Added 6 `model_selection` methods (see highlights below) and fixed bug with cost calculation in previous simulations. # parameters - cost_bin - smaller bin index means cheaper models (0 to 6) - ==model_selection method== (updated) - one cost bin (old) - best_to_worst: select the next worst model based on perf_metric, starting from best - random: randomly select a model from the remaining pool of models - worst_to_best: select the next best model - worst2_then_best_to_worst: start with 2 worst models, then next best etc. - ==two cost bins== (newly added!) - worst2_then_best_bin_best_to_worst: select worst 2 models from current bin, then choose **best** model from the **best bin**, and next best model in the best bin etc. - worst2_then_best_bin_random: select worst 2 models from current bin, then choose **random** model from the **best bin** etc. - worst2_then_best_bin_worst_to_best: select worst 2 models from current bin, then choose **worst** model from the **best bin** etc. - worst2_then_next_bin_best_to_worst: same as above, but choose from **next better bin** (not best bin) - worst2_then_next_bin_random - worst2_then_next_bin_worst_to_best - max_models: maximum no. of models to use to evaluate (i.e., for each post, can be anything between 2 and max_models) - i also simulated choosing 1 model from a bin, but it never shows up in the top models list, suggesting some combination of models is generally better than just having one model - disagreement threshold: for each post, how much disagreement to tolerate/allow across models before terminating (otherwise, keep adding models until `max_models`) - aggregation method: how to combine ratings across models - simple: simple mean - weighted: weighted by perf_metric (see next parameter) - combined: weighted by perf_metric and uncertainty/disagreement across models - perf_metric: metrics for initially evaluating how good a model is - corr_only: correlation of an LLM's ratings with fc likert - icc_only: an LLM's internal reliability across multiple runs - perf_geometric: geometric mean of corr and icc - perf_mean: simple mean of corr and icc - perf_min: min(corr, icc) - cost_only: model inference cost # best models ## across all 69552 simulations - `param_rank_train`: parameter ranking in training set (ranked by `fc_likert_corr_train` - negative is better performance) - `*_train`: training set - `*_test`: testing set `worst2_then_best_bin_*` parameters work well (e.g., `fc_likert_corr_train <=-0.7`), but super expensive (see cost per million post `cost_per_m_post_train` column: $30k-$500k!). They are often the top-performing models (see `param_rank_train` column). ```r param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char> 1: 43007 1 -0.720 -0.739 -0.664 -0.690 38716.79 505767.57 2 5 0.40 simple icc_only worst2_then_best_bin_random 2: 44268 2 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.10 simple icc_only worst2_then_best_bin_worst_to_best 3: 45528 3 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.15 simple icc_only worst2_then_best_bin_worst_to_best 4: 46788 4 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.20 simple icc_only worst2_then_best_bin_worst_to_best 5: 48048 5 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.25 simple icc_only worst2_then_best_bin_worst_to_best 6: 49308 6 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.30 simple icc_only worst2_then_best_bin_worst_to_best 7: 50568 7 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.35 simple icc_only worst2_then_best_bin_worst_to_best 8: 51828 8 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.40 simple icc_only worst2_then_best_bin_worst_to_best 9: 53088 9 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.10 simple icc_only worst2_then_best_bin_worst_to_best 10: 54348 10 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.15 simple icc_only worst2_then_best_bin_worst_to_best 11: 55608 11 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.20 simple icc_only worst2_then_best_bin_worst_to_best 12: 56868 12 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.25 simple icc_only worst2_then_best_bin_worst_to_best 13: 58128 13 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.30 simple icc_only worst2_then_best_bin_worst_to_best 14: 59388 14 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.35 simple icc_only worst2_then_best_bin_worst_to_best 15: 60648 15 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.40 simple icc_only worst2_then_best_bin_worst_to_best 16: 69260 16 -0.713 -0.742 -0.660 -0.715 141125.61 126294.26 6 8 0.35 combined perf_min random 17: 61908 17 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.10 simple icc_only worst2_then_best_bin_worst_to_best 18: 63168 18 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.15 simple icc_only worst2_then_best_bin_worst_to_best 19: 64428 19 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.20 simple icc_only worst2_then_best_bin_worst_to_best 20: 65688 20 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.25 simple icc_only worst2_then_best_bin_worst_to_best 21: 66948 21 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.30 simple icc_only worst2_then_best_bin_worst_to_best 22: 68208 22 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.35 simple icc_only worst2_then_best_bin_worst_to_best 23: 69468 23 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.40 simple icc_only worst2_then_best_bin_worst_to_best 24: 57930 24 -0.706 -0.640 -0.655 -0.620 135913.20 90445.96 6 7 0.25 combined corr_only random 25: 50567 25 -0.704 -0.769 -0.638 -0.709 45717.45 82886.70 2 6 0.35 simple icc_only worst2_then_best_bin_random 26: 59387 26 -0.703 -0.775 -0.661 -0.714 50718.89 553647.88 2 7 0.35 simple icc_only worst2_then_best_bin_random 27: 20110 27 -0.701 -0.736 -0.626 -0.687 28302.31 280006.26 6 3 0.15 combined perf_mean random 28: 71987 28 -0.700 -0.762 -0.681 -0.716 543473.14 580642.38 2 9 0.15 simple icc_only worst2_then_best_bin_random 29: 74507 29 -0.700 -0.775 -0.681 -0.723 543473.14 579996.87 2 9 0.25 simple icc_only worst2_then_best_bin_random 30: 64617 30 -0.699 -0.761 -0.657 -0.706 65475.52 102644.76 5 8 0.20 simple cost_only worst2_then_best_bin_random param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method ``` ```r frequency of different parameters in the top 1000 models --- model_selection_method best_to_worst random worst_to_best worst2_then_best_bin_best_to_worst worst2_then_best_bin_random worst2_then_best_bin_worst_to_best worst2_then_best_to_worst 402 97 49 14 93 204 77 worst2_then_next_bin_random worst2_then_next_bin_worst_to_best 22 42 --- aggregation_method combined simple weighted 202 596 202 --- perf_metric corr_only cost_only icc_only perf_geometric perf_mean perf_min 146 241 209 132 134 138 --- disagreement_threshold 0.1 0.15 0.2 0.25 0.3 0.35 0.4 122 128 167 142 148 144 149 --- cost_bin 1 2 3 4 5 6 16 122 83 454 122 203 --- max_models 2 3 4 5 6 7 8 9 88 75 97 128 195 199 159 59 ``` ## for parameters where cost per million posts in training set is <$100 - `param_rank_train`: parameter ranking in training set (ranked by `fc_likert_corr_train` - negative is better performance) - `*_train`: training set - `*_test`: testing set ```r param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char> 1: 9332 1354 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.10 weighted corr_only best_to_worst 2: 10592 1355 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.15 weighted corr_only best_to_worst 3: 11852 1356 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.20 weighted corr_only best_to_worst 4: 13112 1357 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.25 weighted corr_only best_to_worst 5: 14372 1358 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.30 weighted corr_only best_to_worst 6: 15632 1359 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.35 weighted corr_only best_to_worst 7: 16892 1360 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.40 weighted corr_only best_to_worst 8: 18152 1361 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.10 weighted corr_only best_to_worst 9: 19412 1362 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.15 weighted corr_only best_to_worst 10: 20672 1363 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.20 weighted corr_only best_to_worst 11: 21932 1364 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.25 weighted corr_only best_to_worst 12: 23192 1365 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.30 weighted corr_only best_to_worst 13: 24452 1366 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.35 weighted corr_only best_to_worst 14: 25712 1367 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.40 weighted corr_only best_to_worst 15: 26972 1368 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.10 weighted corr_only best_to_worst 16: 28232 1369 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.15 weighted corr_only best_to_worst 17: 29492 1370 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.20 weighted corr_only best_to_worst 18: 30752 1371 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.25 weighted corr_only best_to_worst 19: 32012 1372 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.30 weighted corr_only best_to_worst 20: 33272 1373 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.35 weighted corr_only best_to_worst 21: 34532 1374 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.40 weighted corr_only best_to_worst 22: 35792 1375 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.10 weighted corr_only best_to_worst 23: 37052 1376 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.15 weighted corr_only best_to_worst 24: 38312 1377 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.20 weighted corr_only best_to_worst 25: 39572 1378 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.25 weighted corr_only best_to_worst 26: 40832 1379 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.30 weighted corr_only best_to_worst 27: 42092 1380 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.35 weighted corr_only best_to_worst 28: 43352 1381 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.40 weighted corr_only best_to_worst 29: 44612 1382 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 6 0.10 weighted corr_only best_to_worst 30: 45872 1383 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 6 0.15 weighted corr_only best_to_worst param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method ``` ```r frequency of different parameters in the top 595 models where cost_per_m_post_train < 100 & fc_likert_corr_train < -0.65 --- model_selection_method best_to_worst random 560 35 --- aggregation_method combined simple weighted 59 60 476 --- perf_metric corr_only cost_only icc_only perf_geometric perf_mean perf_min 147 2 3 146 146 151 --- disagreement_threshold 0.1 0.15 0.2 0.25 0.3 0.35 0.4 87 85 83 87 86 84 83 --- cost_bin 0 1 293 302 --- max_models 2 3 4 5 6 7 8 9 178 60 60 62 57 59 61 58 --- ``` # Most cost-effective parameter combination? - cost_bin: 1 - max_model: 2 (up to 4 seems fine?) - model_selection_method: best_to_worst - aggregation_method: weighted - disagreement threshold: doesn't seem to matter too much, but probably smaller better? - per_metric: corr_only Suggested final combination (~$50 per million posts) ```r param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method <int> <int> <num> <num> <num> <num> <num> <num> <int> <int> <num> <char> <char> <char> 1: 26972 1368 -0.6643829 -0.6764605 -0.6417671 -0.5801227 50.28532 50.28532 1 4 0.1 weighted corr_only best_to_worst ```