# train set (62 headlines) and test set (145 headlines) correlation for top 50 parameter combinations (only for cost per million post in train set <= 100)
- fc_likert_corr_train: fc_likert LLM correlation in train set
- fc_likert_corr_test: fc_likert LLM correlation in test set
- ideally, this should correlate with fc_likert_corr_train, and it does!
- cost_per_m_post_train: cost per million posts in train set
- cost_per_m_post_train: cost per million posts in test set
- ideally, should correlate with cost_per_m_post_train (and it does) to ensure we're keeping costs low
- other columns are parameter values
- final parameters?
- max_models: 5 (3 or 4 could also work)
- disagreement threshold: doesn't quite matter, but probably 0.1
- aggregation method: combined
- perf_metric: correlation only (don't need model reliability/ICC)
- cost_bin: second cheapest bin
- model selection method: worst two models within a cost bin and then best
```r
fc_likert_corr_train fc_likert_corr_test cost_per_m_post_train cost_per_m_post_test max_models disagreement_threshold aggregation_method perf_metric model_selection_method model_cost
<num> <num> <num> <num> <num> <num> <char> <char> <char> <char>
1: -0.706 -0.667 73.405 63.990 4 0.10 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
2: -0.703 -0.673 73.405 63.990 3 0.10 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
3: -0.697 -0.673 73.405 63.990 5 0.10 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
4: -0.686 -0.676 73.405 63.990 6 0.10 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
5: -0.682 -0.630 50.285 50.285 2 0.10 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
6: -0.682 -0.630 50.285 50.285 2 0.10 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839
7: -0.682 -0.630 50.285 50.285 2 0.15 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
8: -0.682 -0.630 50.285 50.285 2 0.15 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839
9: -0.682 -0.630 50.285 50.285 2 0.20 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
10: -0.682 -0.630 50.285 50.285 2 0.20 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839
11: -0.682 -0.630 50.285 50.285 2 0.25 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
12: -0.682 -0.630 50.285 50.285 2 0.25 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839
13: -0.682 -0.630 50.285 50.285 2 0.30 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
14: -0.682 -0.630 50.285 50.285 2 0.30 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839
15: -0.682 -0.630 50.285 50.285 2 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
16: -0.682 -0.630 50.285 50.285 2 0.35 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839
17: -0.682 -0.630 50.285 50.285 2 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
18: -0.682 -0.630 50.285 50.285 2 0.40 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839
19: -0.676 -0.678 73.405 63.990 7 0.10 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
20: -0.674 -0.671 64.998 60.620 4 0.15 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
21: -0.673 -0.473 70.303 79.098 8 0.25 simple perf_geometric random 0.0000129181451612903,0.0000331427419354839
22: -0.673 -0.676 64.998 60.620 3 0.15 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
23: -0.672 -0.675 64.998 60.620 5 0.15 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
24: -0.672 -0.674 58.167 55.902 4 0.25 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
25: -0.672 -0.677 59.218 56.801 4 0.20 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
26: -0.672 -0.674 57.116 55.003 4 0.30 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
27: -0.671 -0.447 19.167 15.410 5 0.20 combined corr_only random 3.18258064516129e-6,0.0000118854838709677
28: -0.671 -0.521 66.139 48.790 9 0.30 simple corr_only random 0.0000129181451612903,0.0000331427419354839
29: -0.671 -0.679 58.167 55.902 3 0.25 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
30: -0.671 -0.676 57.116 55.003 5 0.30 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
31: -0.671 -0.681 59.218 56.801 3 0.20 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
32: -0.671 -0.676 58.167 55.902 5 0.25 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
33: -0.670 -0.679 59.218 56.801 5 0.20 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
34: -0.670 -0.674 52.913 53.655 4 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
35: -0.670 -0.675 52.913 53.655 5 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
36: -0.670 -0.675 52.913 53.655 6 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
37: -0.670 -0.675 52.913 53.655 7 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
38: -0.670 -0.675 52.913 53.655 8 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
39: -0.670 -0.675 52.913 53.655 9 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
40: -0.669 -0.679 57.116 55.003 3 0.30 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
41: -0.669 -0.676 55.540 54.329 5 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
42: -0.669 -0.676 55.540 54.329 6 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
43: -0.669 -0.676 55.540 54.329 7 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
44: -0.669 -0.676 55.540 54.329 8 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
45: -0.669 -0.676 55.540 54.329 9 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
46: -0.669 -0.677 57.116 55.003 6 0.30 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
47: -0.669 -0.675 55.540 54.329 4 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
48: -0.668 -0.677 64.998 60.620 6 0.15 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
49: -0.668 -0.679 52.913 53.655 3 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839
50: -0.668 -0.520 64.440 65.256 6 0.15 combined perf_mean random 0.0000129181451612903,0.0000331427419354839
fc_likert_corr_train fc_likert_corr_test cost_per_m_post_train cost_per_m_post_test max_models disagreement_threshold aggregation_method perf_metric model_selection_method model_cost
```
# train set (62 headlines) and test set (145 headlines) correlation for all 25200 parameter combinations
```r
Parameter1 | Parameter2 | r | 95% CI
----------------------------------------------------------------
fc_likert_corr_train | fc_likert_corr_test | 0.65 | [0.65, 0.66]
```
correlation between fc-likert and LLMs are always stronger in the test set, probably because it has more data/headlines?
![[1757360313.png]]
# train set (62 headlines) and test set (145 headlines) correlation for 4188 parameter combinations where
- cost per million post in train set <= 200
- fc-likert and LLM correlation in train set < -0.5 (negative is better)
generally good correspondence between train and test set results?
![[1757361453.png]]