251013_171932 simulation results - choose from another bin

Added 6 `model_selection` methods (see highlights below) and fixed bug with cost calculation in previous simulations. # parameters - cost_bin - smaller bin index means cheaper models (0 to 6) - ==model_selection method== (updated) - one cost bin (old) - best_to_worst: select the next worst model based on perf_metric, starting from best - random: randomly select a model from the remaining pool of models - worst_to_best: select the next best model - worst2_then_best_to_worst: start with 2 worst models, then next best etc. - ==two cost bins== (newly added!) - worst2_then_best_bin_best_to_worst: select worst 2 models from current bin, then choose **best** model from the **best bin**, and next best model in the best bin etc. - worst2_then_best_bin_random: select worst 2 models from current bin, then choose **random** model from the **best bin** etc. - worst2_then_best_bin_worst_to_best: select worst 2 models from current bin, then choose **worst** model from the **best bin** etc. - worst2_then_next_bin_best_to_worst: same as above, but choose from **next better bin** (not best bin) - worst2_then_next_bin_random - worst2_then_next_bin_worst_to_best - max_models: maximum no. of models to use to evaluate (i.e., for each post, can be anything between 2 and max_models) - i also simulated choosing 1 model from a bin, but it never shows up in the top models list, suggesting some combination of models is generally better than just having one model - disagreement threshold: for each post, how much disagreement to tolerate/allow across models before terminating (otherwise, keep adding models until `max_models`) - aggregation method: how to combine ratings across models - simple: simple mean - weighted: weighted by perf_metric (see next parameter) - combined: weighted by perf_metric and uncertainty/disagreement across models - perf_metric: metrics for initially evaluating how good a model is - corr_only: correlation of an LLM's ratings with fc likert - icc_only: an LLM's internal reliability across multiple runs - perf_geometric: geometric mean of corr and icc - perf_mean: simple mean of corr and icc - perf_min: min(corr, icc) - cost_only: model inference cost # best models ## across all 69552 simulations - `param_rank_train`: parameter ranking in training set (ranked by `fc_likert_corr_train` - negative is better performance) - `*_train`: training set - `*_test`: testing set `worst2_then_best_bin_*` parameters work well (e.g., `fc_likert_corr_train <=-0.7`), but super expensive (see cost per million post `cost_per_m_post_train` column: $30k-$500k!). They are often the top-performing models (see `param_rank_train` column). ```r param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char> 1: 43007 1 -0.720 -0.739 -0.664 -0.690 38716.79 505767.57 2 5 0.40 simple icc_only worst2_then_best_bin_random 2: 44268 2 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.10 simple icc_only worst2_then_best_bin_worst_to_best 3: 45528 3 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.15 simple icc_only worst2_then_best_bin_worst_to_best 4: 46788 4 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.20 simple icc_only worst2_then_best_bin_worst_to_best 5: 48048 5 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.25 simple icc_only worst2_then_best_bin_worst_to_best 6: 49308 6 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.30 simple icc_only worst2_then_best_bin_worst_to_best 7: 50568 7 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.35 simple icc_only worst2_then_best_bin_worst_to_best 8: 51828 8 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.40 simple icc_only worst2_then_best_bin_worst_to_best 9: 53088 9 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.10 simple icc_only worst2_then_best_bin_worst_to_best 10: 54348 10 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.15 simple icc_only worst2_then_best_bin_worst_to_best 11: 55608 11 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.20 simple icc_only worst2_then_best_bin_worst_to_best 12: 56868 12 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.25 simple icc_only worst2_then_best_bin_worst_to_best 13: 58128 13 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.30 simple icc_only worst2_then_best_bin_worst_to_best 14: 59388 14 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.35 simple icc_only worst2_then_best_bin_worst_to_best 15: 60648 15 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.40 simple icc_only worst2_then_best_bin_worst_to_best 16: 69260 16 -0.713 -0.742 -0.660 -0.715 141125.61 126294.26 6 8 0.35 combined perf_min random 17: 61908 17 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.10 simple icc_only worst2_then_best_bin_worst_to_best 18: 63168 18 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.15 simple icc_only worst2_then_best_bin_worst_to_best 19: 64428 19 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.20 simple icc_only worst2_then_best_bin_worst_to_best 20: 65688 20 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.25 simple icc_only worst2_then_best_bin_worst_to_best 21: 66948 21 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.30 simple icc_only worst2_then_best_bin_worst_to_best 22: 68208 22 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.35 simple icc_only worst2_then_best_bin_worst_to_best 23: 69468 23 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.40 simple icc_only worst2_then_best_bin_worst_to_best 24: 57930 24 -0.706 -0.640 -0.655 -0.620 135913.20 90445.96 6 7 0.25 combined corr_only random 25: 50567 25 -0.704 -0.769 -0.638 -0.709 45717.45 82886.70 2 6 0.35 simple icc_only worst2_then_best_bin_random 26: 59387 26 -0.703 -0.775 -0.661 -0.714 50718.89 553647.88 2 7 0.35 simple icc_only worst2_then_best_bin_random 27: 20110 27 -0.701 -0.736 -0.626 -0.687 28302.31 280006.26 6 3 0.15 combined perf_mean random 28: 71987 28 -0.700 -0.762 -0.681 -0.716 543473.14 580642.38 2 9 0.15 simple icc_only worst2_then_best_bin_random 29: 74507 29 -0.700 -0.775 -0.681 -0.723 543473.14 579996.87 2 9 0.25 simple icc_only worst2_then_best_bin_random 30: 64617 30 -0.699 -0.761 -0.657 -0.706 65475.52 102644.76 5 8 0.20 simple cost_only worst2_then_best_bin_random param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method ``` ```r frequency of different parameters in the top 1000 models --- model_selection_method best_to_worst random worst_to_best worst2_then_best_bin_best_to_worst worst2_then_best_bin_random worst2_then_best_bin_worst_to_best worst2_then_best_to_worst 402 97 49 14 93 204 77 worst2_then_next_bin_random worst2_then_next_bin_worst_to_best 22 42 --- aggregation_method combined simple weighted 202 596 202 --- perf_metric corr_only cost_only icc_only perf_geometric perf_mean perf_min 146 241 209 132 134 138 --- disagreement_threshold 0.1 0.15 0.2 0.25 0.3 0.35 0.4 122 128 167 142 148 144 149 --- cost_bin 1 2 3 4 5 6 16 122 83 454 122 203 --- max_models 2 3 4 5 6 7 8 9 88 75 97 128 195 199 159 59 ``` ## for parameters where cost per million posts in training set is <$100 - `param_rank_train`: parameter ranking in training set (ranked by `fc_likert_corr_train` - negative is better performance) - `*_train`: training set - `*_test`: testing set ```r param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char> 1: 9332 1354 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.10 weighted corr_only best_to_worst 2: 10592 1355 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.15 weighted corr_only best_to_worst 3: 11852 1356 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.20 weighted corr_only best_to_worst 4: 13112 1357 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.25 weighted corr_only best_to_worst 5: 14372 1358 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.30 weighted corr_only best_to_worst 6: 15632 1359 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.35 weighted corr_only best_to_worst 7: 16892 1360 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.40 weighted corr_only best_to_worst 8: 18152 1361 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.10 weighted corr_only best_to_worst 9: 19412 1362 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.15 weighted corr_only best_to_worst 10: 20672 1363 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.20 weighted corr_only best_to_worst 11: 21932 1364 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.25 weighted corr_only best_to_worst 12: 23192 1365 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.30 weighted corr_only best_to_worst 13: 24452 1366 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.35 weighted corr_only best_to_worst 14: 25712 1367 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.40 weighted corr_only best_to_worst 15: 26972 1368 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.10 weighted corr_only best_to_worst 16: 28232 1369 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.15 weighted corr_only best_to_worst 17: 29492 1370 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.20 weighted corr_only best_to_worst 18: 30752 1371 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.25 weighted corr_only best_to_worst 19: 32012 1372 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.30 weighted corr_only best_to_worst 20: 33272 1373 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.35 weighted corr_only best_to_worst 21: 34532 1374 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.40 weighted corr_only best_to_worst 22: 35792 1375 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.10 weighted corr_only best_to_worst 23: 37052 1376 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.15 weighted corr_only best_to_worst 24: 38312 1377 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.20 weighted corr_only best_to_worst 25: 39572 1378 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.25 weighted corr_only best_to_worst 26: 40832 1379 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.30 weighted corr_only best_to_worst 27: 42092 1380 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.35 weighted corr_only best_to_worst 28: 43352 1381 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.40 weighted corr_only best_to_worst 29: 44612 1382 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 6 0.10 weighted corr_only best_to_worst 30: 45872 1383 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 6 0.15 weighted corr_only best_to_worst param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method ``` ```r frequency of different parameters in the top 595 models where cost_per_m_post_train < 100 & fc_likert_corr_train < -0.65 --- model_selection_method best_to_worst random 560 35 --- aggregation_method combined simple weighted 59 60 476 --- perf_metric corr_only cost_only icc_only perf_geometric perf_mean perf_min 147 2 3 146 146 151 --- disagreement_threshold 0.1 0.15 0.2 0.25 0.3 0.35 0.4 87 85 83 87 86 84 83 --- cost_bin 0 1 293 302 --- max_models 2 3 4 5 6 7 8 9 178 60 60 62 57 59 61 58 --- ``` ## for parameters where cost per million posts in training set is ($100,$400) - 14041 models in this range, showing top 30 models in this cost range - cost_per_m_post_train ranges from 280 to 400 in these 30 parameters - performance (fc_likert_corr_train) **improves only by r=0.01** relative to the parameters above where cost per million posts is <$100 ```r param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method <int> <int> <num> <num> <num> <num> <num> <num> <int> <int> <num> <char> <char> <char> 1: 21192 426 -0.6769416 -0.6796620 -0.6597731 -0.6252380 284.6520 288.9997 3 3 0.20 combined perf_mean best_to_worst 2: 21182 463 -0.6768965 -0.6795409 -0.6596890 -0.6251304 284.6520 288.9997 3 3 0.20 combined perf_geometric best_to_worst 3: 30012 503 -0.6756293 -0.6815633 -0.6453492 -0.6308724 314.5609 320.9712 3 4 0.20 combined perf_mean best_to_worst 4: 30002 504 -0.6755618 -0.6814398 -0.6452668 -0.6307591 314.5609 320.9712 3 4 0.20 combined perf_geometric best_to_worst 5: 38832 512 -0.6750520 -0.6894445 -0.6571091 -0.6342248 397.5603 427.4395 3 5 0.20 combined perf_mean best_to_worst 6: 38822 515 -0.6750310 -0.6892278 -0.6568844 -0.6340867 397.5603 427.4395 3 5 0.20 combined perf_geometric best_to_worst 7: 17822 520 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.10 simple perf_geometric best_to_worst 8: 17832 521 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.10 simple perf_mean best_to_worst 9: 19082 522 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.15 simple perf_geometric best_to_worst 10: 19092 523 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.15 simple perf_mean best_to_worst 11: 20342 524 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.20 simple perf_geometric best_to_worst 12: 20352 525 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.20 simple perf_mean best_to_worst 13: 21602 526 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.25 simple perf_geometric best_to_worst 14: 21612 527 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.25 simple perf_mean best_to_worst 15: 22862 528 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.30 simple perf_geometric best_to_worst 16: 22872 529 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.30 simple perf_mean best_to_worst 17: 24122 530 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.35 simple perf_geometric best_to_worst 18: 24132 531 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.35 simple perf_mean best_to_worst 19: 25382 532 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.40 simple perf_geometric best_to_worst 20: 25392 533 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.40 simple perf_mean best_to_worst 21: 38852 914 -0.6704436 -0.6873097 -0.6498157 -0.6318334 373.2393 426.6249 3 5 0.20 combined corr_only best_to_worst 22: 38842 922 -0.6704029 -0.6874381 -0.6498125 -0.6319519 373.2393 426.6249 3 5 0.20 combined perf_min best_to_worst 23: 30032 949 -0.6702093 -0.6801288 -0.6361830 -0.6286456 309.3936 320.1567 3 4 0.20 combined corr_only best_to_worst 24: 30022 950 -0.6701711 -0.6802595 -0.6361465 -0.6287761 309.3936 320.1567 3 4 0.20 combined perf_min best_to_worst 25: 30050 953 -0.6700226 -0.7021708 -0.6733393 -0.6090356 396.1295 570.9819 3 4 0.20 combined cost_only random 26: 31272 1189 -0.6672535 -0.6725090 -0.6315820 -0.6208303 299.4848 311.6049 3 4 0.25 combined perf_mean best_to_worst 27: 31262 1191 -0.6672209 -0.6724043 -0.6315670 -0.6207379 299.4848 311.6049 3 4 0.25 combined perf_geometric best_to_worst 28: 32532 1294 -0.6660981 -0.6711803 -0.6286871 -0.6150379 292.7168 305.8691 3 4 0.30 combined perf_mean best_to_worst 29: 32522 1297 -0.6660708 -0.6710746 -0.6286906 -0.6149619 292.7168 305.8691 3 4 0.30 combined perf_geometric best_to_worst 30: 40082 1298 -0.6660324 -0.6810570 -0.6437485 -0.6253327 363.3305 404.4234 3 5 0.25 combined perf_geometric best_to_worst param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method ``` ```r frequency of different parameters in the top 14041 models --- model_selection_method best_to_worst random worst_to_best worst2_then_best_bin_best_to_worst worst2_then_best_bin_random worst2_then_best_bin_worst_to_best worst2_then_best_to_worst worst2_then_next_bin_best_to_worst 1718 1344 1672 805 805 807 1815 1724 worst2_then_next_bin_random worst2_then_next_bin_worst_to_best 1649 1702 --- aggregation_method combined simple weighted 4187 4035 5819 --- perf_metric corr_only cost_only icc_only perf_geometric perf_mean perf_min 2636 1648 1873 2623 2618 2643 --- disagreement_threshold 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1981 1981 1967 1991 2023 2049 2049 --- cost_bin 0 1 2 3 443 2391 5585 5622 --- max_models 1 2 3 4 5 6 7 8 9 1020 2047 1552 1582 1584 1644 1649 1520 1443 --- ``` # Most cost-effective parameter combination? - cost_bin: 1 - max_model: 2 (up to 4 seems fine?) - model_selection_method: best_to_worst - aggregation_method: weighted - disagreement threshold: doesn't seem to matter too much, but probably smaller better? - per_metric: corr_only Suggested final combination (~$50 per million posts) ```r param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method <int> <int> <num> <num> <num> <num> <num> <num> <int> <int> <num> <char> <char> <char> 1: 26972 1368 -0.6643829 -0.6764605 -0.6417671 -0.5801227 50.28532 50.28532 1 4 0.1 weighted corr_only best_to_worst ```