- ~80k parameter combinations
- ~100 posts from x with @grok and @perplexity (and fact-checked by humans)
- train-test set sizes: 40-60
- maybe train set is too small? (jenny's headlines train-test set sizes: 62-145)
# across all 80k parameter combinations, corr(factchecker, LLM) is much lower in testing set than training set
![[_temp 22.png]]
top 30 parameter combinations in train set (across all 80k parameter combinations)
- param_rank_train: train set ranking
- fc_veracity_corr_train: corr(factchecker, LLM) in train set
- fc_veracity_corr_test: corr(fact_checker, LLM) in test set
corr(factchecker, LLM) in test set tends to be 0.1 to 0.3 smaller than in the test set
```r
param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
<num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char>
1: 37710 1 0.796 0.420 4259.263 4042.115 8 3 0.40 combined corr_only random
2: 35310 2 0.790 0.546 3604.872 3005.123 8 3 0.35 weighted corr_only random
3: 37700 3 0.789 0.687 4528.500 4826.891 8 3 0.40 combined perf_min random
4: 28710 4 0.788 0.239 4083.823 4352.782 8 3 0.15 combined corr_only random
5: 35900 5 0.786 -0.116 4240.008 4226.545 8 3 0.35 combined perf_min random
6: 55110 6 0.786 0.629 3078.064 3604.872 8 5 0.20 weighted corr_only random
7: 46700 7 0.786 0.649 5206.403 5274.974 8 4 0.30 combined perf_min random
8: 67690 8 0.785 0.588 3078.064 3167.495 8 6 0.20 weighted perf_mean random
9: 40720 9 0.785 0.398 3078.064 2706.923 8 4 0.15 weighted icc_only random
10: 17910 10 0.783 0.295 3262.244 3293.910 8 2 0.20 combined corr_only random
11: 482 11 0.783 0.647 1786.603 1786.603 8 1 0.10 simple perf_geometric best_to_worst
12: 492 12 0.783 0.647 1786.603 1786.603 8 1 0.10 simple perf_mean best_to_worst
13: 502 13 0.783 0.647 1786.603 1786.603 8 1 0.10 simple perf_min best_to_worst
14: 512 14 0.783 0.647 1786.603 1786.603 8 1 0.10 simple corr_only best_to_worst
15: 1082 15 0.783 0.647 1786.603 1786.603 8 1 0.10 weighted perf_geometric best_to_worst
16: 1092 16 0.783 0.647 1786.603 1786.603 8 1 0.10 weighted perf_mean best_to_worst
17: 1102 17 0.783 0.647 1786.603 1786.603 8 1 0.10 weighted perf_min best_to_worst
18: 1112 18 0.783 0.647 1786.603 1786.603 8 1 0.10 weighted corr_only best_to_worst
19: 1682 19 0.783 0.647 1786.603 1786.603 8 1 0.10 combined perf_geometric best_to_worst
20: 1692 20 0.783 0.647 1786.603 1786.603 8 1 0.10 combined perf_mean best_to_worst
21: 1702 21 0.783 0.647 1786.603 1786.603 8 1 0.10 combined perf_min best_to_worst
22: 1712 22 0.783 0.647 1786.603 1786.603 8 1 0.10 combined corr_only best_to_worst
23: 1730 23 0.783 0.337 1786.603 1231.282 8 1 0.10 combined cost_only random
24: 2282 24 0.783 0.647 1786.603 1786.603 8 1 0.15 simple perf_geometric best_to_worst
25: 2292 25 0.783 0.647 1786.603 1786.603 8 1 0.15 simple perf_mean best_to_worst
26: 2300 26 0.783 0.376 1786.603 1491.865 8 1 0.15 simple perf_min random
27: 2302 27 0.783 0.647 1786.603 1786.603 8 1 0.15 simple perf_min best_to_worst
28: 2312 28 0.783 0.647 1786.603 1786.603 8 1 0.15 simple corr_only best_to_worst
29: 2880 29 0.783 0.480 1786.603 1291.462 8 1 0.15 weighted perf_geometric random
30: 2882 30 0.783 0.647 1786.603 1786.603 8 1 0.15 weighted perf_geometric best_to_worst
param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
```
# for parameter combinations where cost per million posts is < 100, corr(factchecker, LLM) is also much lower in testing set than training set
![[_temp 27.png]]
```r
param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
<num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char>
1: 37832 384 0.688 0.444 50.843 50.843 0 4 0.10 simple corr_only best_to_worst
2: 39632 385 0.688 0.444 50.843 50.843 0 4 0.15 simple corr_only best_to_worst
3: 41432 386 0.688 0.444 50.843 50.843 0 4 0.20 simple corr_only best_to_worst
4: 43232 387 0.688 0.444 50.843 50.843 0 4 0.25 simple corr_only best_to_worst
5: 45032 388 0.688 0.444 50.843 50.843 0 4 0.30 simple corr_only best_to_worst
6: 46832 389 0.688 0.444 50.843 50.843 0 4 0.35 simple corr_only best_to_worst
7: 48632 390 0.688 0.444 50.843 50.843 0 4 0.40 simple corr_only best_to_worst
8: 64233 398 0.681 0.394 74.865 73.235 0 6 0.10 combined corr_only worst2_then_best_to_worst
9: 66033 399 0.681 0.394 74.865 73.235 0 6 0.15 combined corr_only worst2_then_best_to_worst
10: 67833 400 0.681 0.394 74.865 73.235 0 6 0.20 combined corr_only worst2_then_best_to_worst
11: 69633 401 0.681 0.394 74.865 73.235 0 6 0.25 combined corr_only worst2_then_best_to_worst
12: 71433 402 0.681 0.394 74.865 73.235 0 6 0.30 combined corr_only worst2_then_best_to_worst
13: 73233 403 0.681 0.394 74.865 73.235 0 6 0.35 combined corr_only worst2_then_best_to_worst
14: 75033 404 0.681 0.394 74.865 73.235 0 6 0.40 combined corr_only worst2_then_best_to_worst
15: 39032 436 0.671 0.427 48.432 46.533 0 4 0.10 combined corr_only best_to_worst
16: 40832 437 0.671 0.427 48.432 46.533 0 4 0.15 combined corr_only best_to_worst
17: 42632 438 0.671 0.427 48.432 46.533 0 4 0.20 combined corr_only best_to_worst
18: 44432 439 0.671 0.427 48.432 46.533 0 4 0.25 combined corr_only best_to_worst
19: 46232 440 0.671 0.427 48.432 46.533 0 4 0.30 combined corr_only best_to_worst
20: 48032 441 0.671 0.427 48.432 46.533 0 4 0.35 combined corr_only best_to_worst
21: 49832 442 0.671 0.427 48.432 46.533 0 4 0.40 combined corr_only best_to_worst
22: 51632 457 0.659 0.447 63.915 60.622 0 5 0.10 combined corr_only best_to_worst
23: 53432 458 0.659 0.447 63.915 60.622 0 5 0.15 combined corr_only best_to_worst
24: 55232 459 0.659 0.447 63.915 60.622 0 5 0.20 combined corr_only best_to_worst
25: 57032 460 0.659 0.447 63.915 60.622 0 5 0.25 combined corr_only best_to_worst
26: 58832 461 0.659 0.447 63.915 60.622 0 5 0.30 combined corr_only best_to_worst
27: 60632 462 0.659 0.447 63.915 60.622 0 5 0.35 combined corr_only best_to_worst
28: 62432 463 0.659 0.447 63.915 60.622 0 5 0.40 combined corr_only best_to_worst
29: 76833 464 0.659 0.409 91.232 89.049 0 7 0.10 combined corr_only worst2_then_best_to_worst
30: 78633 465 0.659 0.409 91.232 89.049 0 7 0.15 combined corr_only worst2_then_best_to_worst
param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
```