Added 6 `model_selection` methods (see highlights below) and fixed bug with cost calculation in previous simulations.
# parameters
- cost_bin
- smaller bin index means cheaper models (0 to 6)
- ==model_selection method== (updated)
- one cost bin (old)
- best_to_worst: select the next worst model based on perf_metric, starting from best
- random: randomly select a model from the remaining pool of models
- worst_to_best: select the next best model
- worst2_then_best_to_worst: start with 2 worst models, then next best etc.
- ==two cost bins== (newly added!)
- worst2_then_best_bin_best_to_worst: select worst 2 models from current bin, then choose **best** model from the **best bin**, and next best model in the best bin etc.
- worst2_then_best_bin_random: select worst 2 models from current bin, then choose **random** model from the **best bin** etc.
- worst2_then_best_bin_worst_to_best: select worst 2 models from current bin, then choose **worst** model from the **best bin** etc.
- worst2_then_next_bin_best_to_worst: same as above, but choose from **next better bin** (not best bin)
- worst2_then_next_bin_random
- worst2_then_next_bin_worst_to_best
- max_models: maximum no. of models to use to evaluate (i.e., for each post, can be anything between 2 and max_models)
- i also simulated choosing 1 model from a bin, but it never shows up in the top models list, suggesting some combination of models is generally better than just having one model
- disagreement threshold: for each post, how much disagreement to tolerate/allow across models before terminating (otherwise, keep adding models until `max_models`)
- aggregation method: how to combine ratings across models
- simple: simple mean
- weighted: weighted by perf_metric (see next parameter)
- combined: weighted by perf_metric and uncertainty/disagreement across models
- perf_metric: metrics for initially evaluating how good a model is
- corr_only: correlation of an LLM's ratings with fc likert
- icc_only: an LLM's internal reliability across multiple runs
- perf_geometric: geometric mean of corr and icc
- perf_mean: simple mean of corr and icc
- perf_min: min(corr, icc)
- cost_only: model inference cost
# best models
## across all 69552 simulations
- `param_rank_train`: parameter ranking in training set (ranked by `fc_likert_corr_train` - negative is better performance)
- `*_train`: training set
- `*_test`: testing set
`worst2_then_best_bin_*` parameters work well (e.g., `fc_likert_corr_train <=-0.7`), but super expensive (see cost per million post `cost_per_m_post_train` column: $30k-$500k!). They are often the top-performing models (see `param_rank_train` column).
```r
param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char>
1: 43007 1 -0.720 -0.739 -0.664 -0.690 38716.79 505767.57 2 5 0.40 simple icc_only worst2_then_best_bin_random
2: 44268 2 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.10 simple icc_only worst2_then_best_bin_worst_to_best
3: 45528 3 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.15 simple icc_only worst2_then_best_bin_worst_to_best
4: 46788 4 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.20 simple icc_only worst2_then_best_bin_worst_to_best
5: 48048 5 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.25 simple icc_only worst2_then_best_bin_worst_to_best
6: 49308 6 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.30 simple icc_only worst2_then_best_bin_worst_to_best
7: 50568 7 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.35 simple icc_only worst2_then_best_bin_worst_to_best
8: 51828 8 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.40 simple icc_only worst2_then_best_bin_worst_to_best
9: 53088 9 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.10 simple icc_only worst2_then_best_bin_worst_to_best
10: 54348 10 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.15 simple icc_only worst2_then_best_bin_worst_to_best
11: 55608 11 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.20 simple icc_only worst2_then_best_bin_worst_to_best
12: 56868 12 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.25 simple icc_only worst2_then_best_bin_worst_to_best
13: 58128 13 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.30 simple icc_only worst2_then_best_bin_worst_to_best
14: 59388 14 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.35 simple icc_only worst2_then_best_bin_worst_to_best
15: 60648 15 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.40 simple icc_only worst2_then_best_bin_worst_to_best
16: 69260 16 -0.713 -0.742 -0.660 -0.715 141125.61 126294.26 6 8 0.35 combined perf_min random
17: 61908 17 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.10 simple icc_only worst2_then_best_bin_worst_to_best
18: 63168 18 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.15 simple icc_only worst2_then_best_bin_worst_to_best
19: 64428 19 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.20 simple icc_only worst2_then_best_bin_worst_to_best
20: 65688 20 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.25 simple icc_only worst2_then_best_bin_worst_to_best
21: 66948 21 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.30 simple icc_only worst2_then_best_bin_worst_to_best
22: 68208 22 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.35 simple icc_only worst2_then_best_bin_worst_to_best
23: 69468 23 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.40 simple icc_only worst2_then_best_bin_worst_to_best
24: 57930 24 -0.706 -0.640 -0.655 -0.620 135913.20 90445.96 6 7 0.25 combined corr_only random
25: 50567 25 -0.704 -0.769 -0.638 -0.709 45717.45 82886.70 2 6 0.35 simple icc_only worst2_then_best_bin_random
26: 59387 26 -0.703 -0.775 -0.661 -0.714 50718.89 553647.88 2 7 0.35 simple icc_only worst2_then_best_bin_random
27: 20110 27 -0.701 -0.736 -0.626 -0.687 28302.31 280006.26 6 3 0.15 combined perf_mean random
28: 71987 28 -0.700 -0.762 -0.681 -0.716 543473.14 580642.38 2 9 0.15 simple icc_only worst2_then_best_bin_random
29: 74507 29 -0.700 -0.775 -0.681 -0.723 543473.14 579996.87 2 9 0.25 simple icc_only worst2_then_best_bin_random
30: 64617 30 -0.699 -0.761 -0.657 -0.706 65475.52 102644.76 5 8 0.20 simple cost_only worst2_then_best_bin_random
param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
```
```r
frequency of different parameters in the top 1000 models
---
model_selection_method
best_to_worst random worst_to_best worst2_then_best_bin_best_to_worst worst2_then_best_bin_random worst2_then_best_bin_worst_to_best worst2_then_best_to_worst
402 97 49 14 93 204 77
worst2_then_next_bin_random worst2_then_next_bin_worst_to_best
22 42
---
aggregation_method
combined simple weighted
202 596 202
---
perf_metric
corr_only cost_only icc_only perf_geometric perf_mean perf_min
146 241 209 132 134 138
---
disagreement_threshold
0.1 0.15 0.2 0.25 0.3 0.35 0.4
122 128 167 142 148 144 149
---
cost_bin
1 2 3 4 5 6
16 122 83 454 122 203
---
max_models
2 3 4 5 6 7 8 9
88 75 97 128 195 199 159 59
```
## for parameters where cost per million posts in training set is <$100
- `param_rank_train`: parameter ranking in training set (ranked by `fc_likert_corr_train` - negative is better performance)
- `*_train`: training set
- `*_test`: testing set
```r
param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char>
1: 9332 1354 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.10 weighted corr_only best_to_worst
2: 10592 1355 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.15 weighted corr_only best_to_worst
3: 11852 1356 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.20 weighted corr_only best_to_worst
4: 13112 1357 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.25 weighted corr_only best_to_worst
5: 14372 1358 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.30 weighted corr_only best_to_worst
6: 15632 1359 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.35 weighted corr_only best_to_worst
7: 16892 1360 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.40 weighted corr_only best_to_worst
8: 18152 1361 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.10 weighted corr_only best_to_worst
9: 19412 1362 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.15 weighted corr_only best_to_worst
10: 20672 1363 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.20 weighted corr_only best_to_worst
11: 21932 1364 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.25 weighted corr_only best_to_worst
12: 23192 1365 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.30 weighted corr_only best_to_worst
13: 24452 1366 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.35 weighted corr_only best_to_worst
14: 25712 1367 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.40 weighted corr_only best_to_worst
15: 26972 1368 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.10 weighted corr_only best_to_worst
16: 28232 1369 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.15 weighted corr_only best_to_worst
17: 29492 1370 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.20 weighted corr_only best_to_worst
18: 30752 1371 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.25 weighted corr_only best_to_worst
19: 32012 1372 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.30 weighted corr_only best_to_worst
20: 33272 1373 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.35 weighted corr_only best_to_worst
21: 34532 1374 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.40 weighted corr_only best_to_worst
22: 35792 1375 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.10 weighted corr_only best_to_worst
23: 37052 1376 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.15 weighted corr_only best_to_worst
24: 38312 1377 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.20 weighted corr_only best_to_worst
25: 39572 1378 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.25 weighted corr_only best_to_worst
26: 40832 1379 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.30 weighted corr_only best_to_worst
27: 42092 1380 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.35 weighted corr_only best_to_worst
28: 43352 1381 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.40 weighted corr_only best_to_worst
29: 44612 1382 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 6 0.10 weighted corr_only best_to_worst
30: 45872 1383 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 6 0.15 weighted corr_only best_to_worst
param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
```
```r
frequency of different parameters in the top 595 models where cost_per_m_post_train < 100 & fc_likert_corr_train < -0.65
---
model_selection_method
best_to_worst random
560 35
---
aggregation_method
combined simple weighted
59 60 476
---
perf_metric
corr_only cost_only icc_only perf_geometric perf_mean perf_min
147 2 3 146 146 151
---
disagreement_threshold
0.1 0.15 0.2 0.25 0.3 0.35 0.4
87 85 83 87 86 84 83
---
cost_bin
0 1
293 302
---
max_models
2 3 4 5 6 7 8 9
178 60 60 62 57 59 61 58
---
```
# Most cost-effective parameter combination?
- cost_bin: 1
- max_model: 2 (up to 4 seems fine?)
- model_selection_method: best_to_worst
- aggregation_method: weighted
- disagreement threshold: doesn't seem to matter too much, but probably smaller better?
- per_metric: corr_only
Suggested final combination (~$50 per million posts)
```r
param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
<int> <int> <num> <num> <num> <num> <num> <num> <int> <int> <num> <char> <char> <char>
1: 26972 1368 -0.6643829 -0.6764605 -0.6417671 -0.5801227 50.28532 50.28532 1 4 0.1 weighted corr_only best_to_worst
```