results below include only training set headlines (62 of 207; leaving 145 for testing)
- test set results being generated now and will correlate test set results with training set results
# correlation(LLM ratings, fc likert ratings) ~ estimated cost (for all 25200 simulations)
![[1755835915.png]]
## performance for different values of each parameter (all 25200 simulations)
parameters
- smaller y-axis values are better
- max_models: maximum no. of models to use
- disagreement threshold: how much disagreement to tolerate/allow across models before terminating (otherwise, keep adding models until max_models) - doesn't seem to matter that much
- aggregation method: how to combine ratings across models
- simple: simple mean
- weighted: weighted by perf_metric (see next parameter)
- combined: weighted by perf_metric and uncertainty/disagreement across models
- perf_metric: metrics for initially evaluating how good a model is
- corr_only: correlation of an LLM's ratings with fc likert (dave suggestion)
- icc_only: an LLM's internal reliability across multiple runs (dave suggestion)
- perf_geometric: geometric mean of corr and icc
- perf_mean: simple mean of corr and icc
- perf_min: min(corr, icc)
- model_selection method
- best_to_worst: select the next worst model based on perf_metric, starting from best
- random: randomly select a model from the remaining pool of models
- worst_to_best: select the next best model
- worst2_then_best: start with 2 worst models, then next best (dave suggestion)
- cost_bin
- smaller bin index means cheaper models
![[1755835835.png]]
## focusing on cheapest parameter combinations (i.e., k = 8477 where estimated cost per million headlines is < $100)
![[1755835952.png]]
8477 simulations with estimated cost per million headlines < $100
![[1755836223.png]]
# performance for different values of each parameter (k = 1686 simulations with estimated cost per million headlines < $100 & correlation with fact-checker is < -0.6)
![[1755836477.png]]
## top 20 parameter combinations
- will see if these results hold in the test set
```r
fc_likert_corr cost_per_m_post max_models disagreement_threshold aggregation_method model_selection_method perf_metric cost_bin
<num> <num> <int> <num> <char> <char> <char> <int>
1: -0.7061135 73.40478 4 0.10 combined worst2_then_best corr_only 2
2: -0.7034021 73.40478 3 0.10 combined worst2_then_best corr_only 2
3: -0.6970813 73.40478 5 0.10 combined worst2_then_best corr_only 2
4: -0.6862089 73.40478 6 0.10 combined worst2_then_best corr_only 2
5: -0.6824805 50.28532 2 0.10 combined worst_to_best corr_only 2
6: -0.6824805 50.28532 2 0.10 combined worst2_then_best corr_only 2
7: -0.6824805 50.28532 2 0.15 combined worst_to_best corr_only 2
8: -0.6824805 50.28532 2 0.15 combined worst2_then_best corr_only 2
9: -0.6824805 50.28532 2 0.20 combined worst_to_best corr_only 2
10: -0.6824805 50.28532 2 0.20 combined worst2_then_best corr_only 2
11: -0.6824805 50.28532 2 0.25 combined worst_to_best corr_only 2
12: -0.6824805 50.28532 2 0.25 combined worst2_then_best corr_only 2
13: -0.6824805 50.28532 2 0.30 combined worst_to_best corr_only 2
14: -0.6824805 50.28532 2 0.30 combined worst2_then_best corr_only 2
15: -0.6824805 50.28532 2 0.35 combined worst_to_best corr_only 2
16: -0.6824805 50.28532 2 0.35 combined worst2_then_best corr_only 2
17: -0.6824805 50.28532 2 0.40 combined worst_to_best corr_only 2
18: -0.6824805 50.28532 2 0.40 combined worst2_then_best corr_only 2
19: -0.6759055 73.40478 7 0.10 combined worst2_then_best corr_only 2
20: -0.6745170 70.30306 7 0.10 simple random icc_only 2
fc_likert_corr cost_per_m_post max_models disagreement_threshold aggregation_method model_selection_method perf_metric cost_bin
```
## among the top models where cost per million headlines is < $100
the correlation with fact-checker is < -0.65, we have ==800 parameter combinations==
the most common values for each parameter are shown below
```r
max_models N
<int> <int>
1: 2 199
2: 8 91
3: 3 90
4: 7 87
5: 9 87
6: 4 84
7: 6 82
8: 5 80
disagreement_threshold N # weird this didn't make any difference
<num> <int>
1: 0.40 118
2: 0.15 117
3: 0.20 114
4: 0.25 114
5: 0.10 113
6: 0.30 112
7: 0.35 112
aggregation_method N
<char> <int>
1: weighted 582
2: combined 135
3: simple 83
model_selection_method N
<char> <int>
1: best_to_worst 420
2: worst2_then_best 189
3: worst_to_best 149
4: random 42
perf_metric N
<char> <int>
1: corr_only 348
2: perf_geometric 150
3: perf_min 150
4: perf_mean 146
5: icc_only 6
cost_bin N
<int> <int>
1: 2 440
2: 1 360
```