Added 6 `model_selection` methods (see highlights below) and fixed bug with cost calculation in previous simulations.
# parameters
- cost_bin
- smaller bin index means cheaper models (0 to 6)
- ==model_selection method== (updated)
- one cost bin (old)
- best_to_worst: select the next worst model based on perf_metric, starting from best
- random: randomly select a model from the remaining pool of models
- worst_to_best: select the next best model
- worst2_then_best_to_worst: start with 2 worst models, then next best etc.
- ==two cost bins== (newly added!)
- worst2_then_best_bin_best_to_worst: select worst 2 models from current bin, then choose **best** model from the **best bin**, and next best model in the best bin etc.
- worst2_then_best_bin_random: select worst 2 models from current bin, then choose **random** model from the **best bin** etc.
- worst2_then_best_bin_worst_to_best: select worst 2 models from current bin, then choose **worst** model from the **best bin** etc.
- worst2_then_next_bin_best_to_worst: same as above, but choose from **next better bin** (not best bin)
- worst2_then_next_bin_random
- worst2_then_next_bin_worst_to_best
- max_models: maximum no. of models to use to evaluate (i.e., for each post, can be anything between 2 and max_models)
- i also simulated choosing 1 model from a bin, but it never shows up in the top models list, suggesting some combination of models is generally better than just having one model
- disagreement threshold: for each post, how much disagreement to tolerate/allow across models before terminating (otherwise, keep adding models until `max_models`)
- aggregation method: how to combine ratings across models
- simple: simple mean
- weighted: weighted by perf_metric (see next parameter)
- combined: weighted by perf_metric and uncertainty/disagreement across models
- perf_metric: metrics for initially evaluating how good a model is
- corr_only: correlation of an LLM's ratings with fc likert
- icc_only: an LLM's internal reliability across multiple runs
- perf_geometric: geometric mean of corr and icc
- perf_mean: simple mean of corr and icc
- perf_min: min(corr, icc)
- cost_only: model inference cost
# best models
## across all 69552 simulations
- `param_rank_train`: parameter ranking in training set (ranked by `fc_likert_corr_train` - negative is better performance)
- `*_train`: training set
- `*_test`: testing set
`worst2_then_best_bin_*` parameters work well (e.g., `fc_likert_corr_train <=-0.7`), but super expensive (see cost per million post `cost_per_m_post_train` column: $30k-$500k!). They are often the top-performing models (see `param_rank_train` column).
```r
param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char>
1: 43007 1 -0.720 -0.739 -0.664 -0.690 38716.79 505767.57 2 5 0.40 simple icc_only worst2_then_best_bin_random
2: 44268 2 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.10 simple icc_only worst2_then_best_bin_worst_to_best
3: 45528 3 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.15 simple icc_only worst2_then_best_bin_worst_to_best
4: 46788 4 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.20 simple icc_only worst2_then_best_bin_worst_to_best
5: 48048 5 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.25 simple icc_only worst2_then_best_bin_worst_to_best
6: 49308 6 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.30 simple icc_only worst2_then_best_bin_worst_to_best
7: 50568 7 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.35 simple icc_only worst2_then_best_bin_worst_to_best
8: 51828 8 -0.718 -0.750 -0.658 -0.688 45780.68 45780.68 2 6 0.40 simple icc_only worst2_then_best_bin_worst_to_best
9: 53088 9 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.10 simple icc_only worst2_then_best_bin_worst_to_best
10: 54348 10 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.15 simple icc_only worst2_then_best_bin_worst_to_best
11: 55608 11 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.20 simple icc_only worst2_then_best_bin_worst_to_best
12: 56868 12 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.25 simple icc_only worst2_then_best_bin_worst_to_best
13: 58128 13 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.30 simple icc_only worst2_then_best_bin_worst_to_best
14: 59388 14 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.35 simple icc_only worst2_then_best_bin_worst_to_best
15: 60648 15 -0.714 -0.756 -0.678 -0.698 50782.11 50782.11 2 7 0.40 simple icc_only worst2_then_best_bin_worst_to_best
16: 69260 16 -0.713 -0.742 -0.660 -0.715 141125.61 126294.26 6 8 0.35 combined perf_min random
17: 61908 17 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.10 simple icc_only worst2_then_best_bin_worst_to_best
18: 63168 18 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.15 simple icc_only worst2_then_best_bin_worst_to_best
19: 64428 19 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.20 simple icc_only worst2_then_best_bin_worst_to_best
20: 65688 20 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.25 simple icc_only worst2_then_best_bin_worst_to_best
21: 66948 21 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.30 simple icc_only worst2_then_best_bin_worst_to_best
22: 68208 22 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.35 simple icc_only worst2_then_best_bin_worst_to_best
23: 69468 23 -0.712 -0.755 -0.670 -0.698 63727.18 63727.18 2 8 0.40 simple icc_only worst2_then_best_bin_worst_to_best
24: 57930 24 -0.706 -0.640 -0.655 -0.620 135913.20 90445.96 6 7 0.25 combined corr_only random
25: 50567 25 -0.704 -0.769 -0.638 -0.709 45717.45 82886.70 2 6 0.35 simple icc_only worst2_then_best_bin_random
26: 59387 26 -0.703 -0.775 -0.661 -0.714 50718.89 553647.88 2 7 0.35 simple icc_only worst2_then_best_bin_random
27: 20110 27 -0.701 -0.736 -0.626 -0.687 28302.31 280006.26 6 3 0.15 combined perf_mean random
28: 71987 28 -0.700 -0.762 -0.681 -0.716 543473.14 580642.38 2 9 0.15 simple icc_only worst2_then_best_bin_random
29: 74507 29 -0.700 -0.775 -0.681 -0.723 543473.14 579996.87 2 9 0.25 simple icc_only worst2_then_best_bin_random
30: 64617 30 -0.699 -0.761 -0.657 -0.706 65475.52 102644.76 5 8 0.20 simple cost_only worst2_then_best_bin_random
param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
```
```r
frequency of different parameters in the top 1000 models
---
model_selection_method
best_to_worst random worst_to_best worst2_then_best_bin_best_to_worst worst2_then_best_bin_random worst2_then_best_bin_worst_to_best worst2_then_best_to_worst
402 97 49 14 93 204 77
worst2_then_next_bin_random worst2_then_next_bin_worst_to_best
22 42
---
aggregation_method
combined simple weighted
202 596 202
---
perf_metric
corr_only cost_only icc_only perf_geometric perf_mean perf_min
146 241 209 132 134 138
---
disagreement_threshold
0.1 0.15 0.2 0.25 0.3 0.35 0.4
122 128 167 142 148 144 149
---
cost_bin
1 2 3 4 5 6
16 122 83 454 122 203
---
max_models
2 3 4 5 6 7 8 9
88 75 97 128 195 199 159 59
```
## for parameters where cost per million posts in training set is <$100
- `param_rank_train`: parameter ranking in training set (ranked by `fc_likert_corr_train` - negative is better performance)
- `*_train`: training set
- `*_test`: testing set
```r
param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
<num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char>
1: 9332 1354 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.10 weighted corr_only best_to_worst
2: 10592 1355 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.15 weighted corr_only best_to_worst
3: 11852 1356 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.20 weighted corr_only best_to_worst
4: 13112 1357 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.25 weighted corr_only best_to_worst
5: 14372 1358 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.30 weighted corr_only best_to_worst
6: 15632 1359 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.35 weighted corr_only best_to_worst
7: 16892 1360 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 2 0.40 weighted corr_only best_to_worst
8: 18152 1361 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.10 weighted corr_only best_to_worst
9: 19412 1362 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.15 weighted corr_only best_to_worst
10: 20672 1363 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.20 weighted corr_only best_to_worst
11: 21932 1364 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.25 weighted corr_only best_to_worst
12: 23192 1365 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.30 weighted corr_only best_to_worst
13: 24452 1366 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.35 weighted corr_only best_to_worst
14: 25712 1367 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 3 0.40 weighted corr_only best_to_worst
15: 26972 1368 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.10 weighted corr_only best_to_worst
16: 28232 1369 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.15 weighted corr_only best_to_worst
17: 29492 1370 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.20 weighted corr_only best_to_worst
18: 30752 1371 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.25 weighted corr_only best_to_worst
19: 32012 1372 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.30 weighted corr_only best_to_worst
20: 33272 1373 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.35 weighted corr_only best_to_worst
21: 34532 1374 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 4 0.40 weighted corr_only best_to_worst
22: 35792 1375 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.10 weighted corr_only best_to_worst
23: 37052 1376 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.15 weighted corr_only best_to_worst
24: 38312 1377 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.20 weighted corr_only best_to_worst
25: 39572 1378 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.25 weighted corr_only best_to_worst
26: 40832 1379 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.30 weighted corr_only best_to_worst
27: 42092 1380 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.35 weighted corr_only best_to_worst
28: 43352 1381 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 5 0.40 weighted corr_only best_to_worst
29: 44612 1382 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 6 0.10 weighted corr_only best_to_worst
30: 45872 1383 -0.664 -0.676 -0.642 -0.58 50.285 50.285 1 6 0.15 weighted corr_only best_to_worst
param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
```
```r
frequency of different parameters in the top 595 models where cost_per_m_post_train < 100 & fc_likert_corr_train < -0.65
---
model_selection_method
best_to_worst random
560 35
---
aggregation_method
combined simple weighted
59 60 476
---
perf_metric
corr_only cost_only icc_only perf_geometric perf_mean perf_min
147 2 3 146 146 151
---
disagreement_threshold
0.1 0.15 0.2 0.25 0.3 0.35 0.4
87 85 83 87 86 84 83
---
cost_bin
0 1
293 302
---
max_models
2 3 4 5 6 7 8 9
178 60 60 62 57 59 61 58
---
```
## for parameters where cost per million posts in training set is ($100,$400)
- 14041 models in this range, showing top 30 models in this cost range
- cost_per_m_post_train ranges from 280 to 400 in these 30 parameters
- performance (fc_likert_corr_train) **improves only by r=0.01** relative to the parameters above where cost per million posts is <$100
```r
param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
<int> <int> <num> <num> <num> <num> <num> <num> <int> <int> <num> <char> <char> <char>
1: 21192 426 -0.6769416 -0.6796620 -0.6597731 -0.6252380 284.6520 288.9997 3 3 0.20 combined perf_mean best_to_worst
2: 21182 463 -0.6768965 -0.6795409 -0.6596890 -0.6251304 284.6520 288.9997 3 3 0.20 combined perf_geometric best_to_worst
3: 30012 503 -0.6756293 -0.6815633 -0.6453492 -0.6308724 314.5609 320.9712 3 4 0.20 combined perf_mean best_to_worst
4: 30002 504 -0.6755618 -0.6814398 -0.6452668 -0.6307591 314.5609 320.9712 3 4 0.20 combined perf_geometric best_to_worst
5: 38832 512 -0.6750520 -0.6894445 -0.6571091 -0.6342248 397.5603 427.4395 3 5 0.20 combined perf_mean best_to_worst
6: 38822 515 -0.6750310 -0.6892278 -0.6568844 -0.6340867 397.5603 427.4395 3 5 0.20 combined perf_geometric best_to_worst
7: 17822 520 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.10 simple perf_geometric best_to_worst
8: 17832 521 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.10 simple perf_mean best_to_worst
9: 19082 522 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.15 simple perf_geometric best_to_worst
10: 19092 523 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.15 simple perf_mean best_to_worst
11: 20342 524 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.20 simple perf_geometric best_to_worst
12: 20352 525 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.20 simple perf_mean best_to_worst
13: 21602 526 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.25 simple perf_geometric best_to_worst
14: 21612 527 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.25 simple perf_mean best_to_worst
15: 22862 528 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.30 simple perf_geometric best_to_worst
16: 22872 529 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.30 simple perf_mean best_to_worst
17: 24122 530 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.35 simple perf_geometric best_to_worst
18: 24132 531 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.35 simple perf_mean best_to_worst
19: 25382 532 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.40 simple perf_geometric best_to_worst
20: 25392 533 -0.6745908 -0.7132622 -0.6442014 -0.6531874 360.4387 360.4387 3 3 0.40 simple perf_mean best_to_worst
21: 38852 914 -0.6704436 -0.6873097 -0.6498157 -0.6318334 373.2393 426.6249 3 5 0.20 combined corr_only best_to_worst
22: 38842 922 -0.6704029 -0.6874381 -0.6498125 -0.6319519 373.2393 426.6249 3 5 0.20 combined perf_min best_to_worst
23: 30032 949 -0.6702093 -0.6801288 -0.6361830 -0.6286456 309.3936 320.1567 3 4 0.20 combined corr_only best_to_worst
24: 30022 950 -0.6701711 -0.6802595 -0.6361465 -0.6287761 309.3936 320.1567 3 4 0.20 combined perf_min best_to_worst
25: 30050 953 -0.6700226 -0.7021708 -0.6733393 -0.6090356 396.1295 570.9819 3 4 0.20 combined cost_only random
26: 31272 1189 -0.6672535 -0.6725090 -0.6315820 -0.6208303 299.4848 311.6049 3 4 0.25 combined perf_mean best_to_worst
27: 31262 1191 -0.6672209 -0.6724043 -0.6315670 -0.6207379 299.4848 311.6049 3 4 0.25 combined perf_geometric best_to_worst
28: 32532 1294 -0.6660981 -0.6711803 -0.6286871 -0.6150379 292.7168 305.8691 3 4 0.30 combined perf_mean best_to_worst
29: 32522 1297 -0.6660708 -0.6710746 -0.6286906 -0.6149619 292.7168 305.8691 3 4 0.30 combined perf_geometric best_to_worst
30: 40082 1298 -0.6660324 -0.6810570 -0.6437485 -0.6253327 363.3305 404.4234 3 5 0.25 combined perf_geometric best_to_worst
param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
```
```r
frequency of different parameters in the top 14041 models
---
model_selection_method
best_to_worst random worst_to_best worst2_then_best_bin_best_to_worst worst2_then_best_bin_random worst2_then_best_bin_worst_to_best worst2_then_best_to_worst worst2_then_next_bin_best_to_worst
1718 1344 1672 805 805 807 1815 1724
worst2_then_next_bin_random worst2_then_next_bin_worst_to_best
1649 1702
---
aggregation_method
combined simple weighted
4187 4035 5819
---
perf_metric
corr_only cost_only icc_only perf_geometric perf_mean perf_min
2636 1648 1873 2623 2618 2643
---
disagreement_threshold
0.1 0.15 0.2 0.25 0.3 0.35 0.4
1981 1981 1967 1991 2023 2049 2049
---
cost_bin
0 1 2 3
443 2391 5585 5622
---
max_models
1 2 3 4 5 6 7 8 9
1020 2047 1552 1582 1584 1644 1649 1520 1443
---
```
# Most cost-effective parameter combination?
- cost_bin: 1
- max_model: 2 (up to 4 seems fine?)
- model_selection_method: best_to_worst
- aggregation_method: weighted
- disagreement threshold: doesn't seem to matter too much, but probably smaller better?
- per_metric: corr_only
Suggested final combination (~$50 per million posts)
```r
param_idx param_rank_train fc_likert_corr_train fc_likert_corr_test fc_modal_corr_train fc_modal_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
<int> <int> <num> <num> <num> <num> <num> <num> <int> <int> <num> <char> <char> <char>
1: 26972 1368 -0.6643829 -0.6764605 -0.6417671 -0.5801227 50.28532 50.28532 1 4 0.1 weighted corr_only best_to_worst
```