- train-test split: 80-20
Using a larger train set (vs [[251114_190841 xposts train vs test set|previous analysis where train-test split is 40-60]]) leads to better generalization to test/new data. Previous high correlations in train set are probably overfitting.
# across all 80k parameter combinations, corr(factchecker, LLM) is slightly lower in the test set
cost bin 8 models are good
```r
param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
<num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char>
1: 32300 1 0.722 0.130 4345.552 4164.892 8 3 0.25 combined perf_min random
2: 26912 2 0.722 0.686 4522.380 4428.947 8 3 0.10 combined corr_only best_to_worst
3: 28712 3 0.722 0.686 4522.380 4428.947 8 3 0.15 combined corr_only best_to_worst
4: 30512 4 0.722 0.686 4522.380 4428.947 8 3 0.20 combined corr_only best_to_worst
5: 32312 5 0.722 0.686 4522.380 4428.947 8 3 0.25 combined corr_only best_to_worst
6: 34112 6 0.722 0.686 4522.380 4428.947 8 3 0.30 combined corr_only best_to_worst
7: 35912 7 0.722 0.686 4522.380 4428.947 8 3 0.35 combined corr_only best_to_worst
8: 37712 8 0.722 0.686 4522.380 4428.947 8 3 0.40 combined corr_only best_to_worst
9: 29880 9 0.718 0.624 3022.095 2547.623 8 3 0.20 weighted perf_geometric random
10: 60490 10 0.718 0.395 3022.095 2700.950 8 5 0.35 weighted perf_mean random
11: 64090 11 0.718 0.332 3022.095 2555.959 8 6 0.10 weighted perf_mean random
12: 39512 12 0.717 0.695 5865.903 5664.804 8 4 0.10 combined corr_only best_to_worst
13: 41312 13 0.717 0.695 5865.903 5664.804 8 4 0.15 combined corr_only best_to_worst
14: 43112 14 0.717 0.695 5865.903 5664.804 8 4 0.20 combined corr_only best_to_worst
15: 44912 15 0.717 0.695 5865.903 5664.804 8 4 0.25 combined corr_only best_to_worst
16: 46712 16 0.717 0.695 5865.903 5664.804 8 4 0.30 combined corr_only best_to_worst
17: 48512 17 0.717 0.695 5865.903 5664.804 8 4 0.35 combined corr_only best_to_worst
18: 50312 18 0.717 0.695 5865.903 5664.804 8 4 0.40 combined corr_only best_to_worst
19: 26910 19 0.714 0.456 4364.244 3938.789 8 3 0.10 combined corr_only random
20: 48480 20 0.710 0.783 5874.605 5280.637 8 4 0.35 combined perf_geometric random
21: 52112 21 0.708 0.735 7218.128 6908.666 8 5 0.10 combined corr_only best_to_worst
22: 53912 22 0.708 0.735 7218.128 6908.666 8 5 0.15 combined corr_only best_to_worst
23: 55712 23 0.708 0.735 7218.128 6908.666 8 5 0.20 combined corr_only best_to_worst
24: 57512 24 0.708 0.735 7218.128 6908.666 8 5 0.25 combined corr_only best_to_worst
25: 59312 25 0.708 0.735 7218.128 6908.666 8 5 0.30 combined corr_only best_to_worst
26: 61112 26 0.708 0.735 7218.128 6908.666 8 5 0.35 combined corr_only best_to_worst
27: 62910 27 0.708 0.570 7218.128 6656.007 8 5 0.40 combined corr_only random
28: 62912 28 0.708 0.735 7218.128 6908.666 8 5 0.40 combined corr_only best_to_worst
29: 50310 29 0.707 0.508 5173.269 5202.620 8 4 0.40 combined corr_only random
30: 30480 30 0.706 0.631 4360.325 4097.658 8 3 0.20 combined perf_geometric random
param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
```
```python
[{'model': 'openai/gpt-4o-search-preview'}, # best model in bin 8 (search model)
{'model': 'perplexity/sonar-pro'}, # search model
{'model': 'sao10k/l3.1-70b-hanami-x1'},
{'model': 'openai/gpt-4.1'},
{'model': 'google/gemini-2.5-flash-preview-09-2025'},
{'model': 'openai/gpt-5-mini'},
{'model': 'deepcogito/cogito-v2-preview-llama-405b'},
{'model': 'anthracite-org/magnum-v4-72b'},
{'model': 'mistralai/mistral-large'},
{'model': 'mistralai/mistral-large-2407'},
{'model': 'openai/gpt-4o'},
{'model': 'mistralai/pixtral-large-2411'},
{'model': 'openai/gpt-4o-2024-11-20'},
{'model': 'x-ai/grok-code-fast-1'},
{'model': 'qwen/qwen3-vl-30b-a3b-thinking'}] # worst model in bin 8
```
# for parameter combinations where cost per million posts is < 100, corr(factchecker, LLM) is quite similar in train and test sets
but correlation is only around .55
```r
param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
<num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char>
1: 51623 898 0.575 0.544 64.196 70.200 0 5 0.10 combined perf_min worst2_then_best_to_worst
2: 53423 899 0.575 0.544 64.196 70.200 0 5 0.15 combined perf_min worst2_then_best_to_worst
3: 55223 900 0.575 0.544 64.196 70.200 0 5 0.20 combined perf_min worst2_then_best_to_worst
4: 57023 901 0.575 0.544 64.196 70.200 0 5 0.25 combined perf_min worst2_then_best_to_worst
5: 58823 902 0.575 0.544 64.196 70.200 0 5 0.30 combined perf_min worst2_then_best_to_worst
6: 60623 903 0.575 0.544 64.196 70.200 0 5 0.35 combined perf_min worst2_then_best_to_worst
7: 62423 904 0.575 0.544 64.196 70.200 0 5 0.40 combined perf_min worst2_then_best_to_worst
8: 53440 905 0.574 0.566 75.197 70.592 0 5 0.15 combined icc_only random
9: 66020 924 0.572 0.680 77.584 65.255 0 6 0.15 combined perf_min random
10: 40830 941 0.569 0.473 61.083 55.518 0 4 0.15 combined corr_only random
11: 85240 988 0.560 0.448 29.476 33.172 0 7 0.35 weighted icc_only random
12: 54600 1007 0.558 0.555 29.476 20.196 0 5 0.20 weighted perf_geometric random
13: 51603 1009 0.558 0.579 64.196 70.200 0 5 0.10 combined perf_geometric worst2_then_best_to_worst
14: 53403 1010 0.558 0.579 64.196 70.200 0 5 0.15 combined perf_geometric worst2_then_best_to_worst
15: 55203 1011 0.558 0.579 64.196 70.200 0 5 0.20 combined perf_geometric worst2_then_best_to_worst
16: 57003 1012 0.558 0.579 64.196 70.200 0 5 0.25 combined perf_geometric worst2_then_best_to_worst
17: 58803 1013 0.558 0.579 64.196 70.200 0 5 0.30 combined perf_geometric worst2_then_best_to_worst
18: 60603 1014 0.558 0.579 64.196 70.200 0 5 0.35 combined perf_geometric worst2_then_best_to_worst
19: 62403 1015 0.558 0.579 64.196 70.200 0 5 0.40 combined perf_geometric worst2_then_best_to_worst
20: 39023 1027 0.556 0.396 56.036 60.433 0 4 0.10 combined perf_min worst2_then_best_to_worst
21: 40823 1028 0.556 0.396 56.036 60.433 0 4 0.15 combined perf_min worst2_then_best_to_worst
22: 42623 1029 0.556 0.396 56.036 60.433 0 4 0.20 combined perf_min worst2_then_best_to_worst
23: 44423 1030 0.556 0.396 56.036 60.433 0 4 0.25 combined perf_min worst2_then_best_to_worst
24: 46223 1031 0.556 0.396 56.036 60.433 0 4 0.30 combined perf_min worst2_then_best_to_worst
25: 48023 1032 0.556 0.396 56.036 60.433 0 4 0.35 combined perf_min worst2_then_best_to_worst
26: 49823 1033 0.556 0.396 56.036 60.433 0 4 0.40 combined perf_min worst2_then_best_to_worst
27: 51613 1047 0.554 0.589 64.196 70.200 0 5 0.10 combined perf_mean worst2_then_best_to_worst
28: 53413 1048 0.554 0.589 64.196 70.200 0 5 0.15 combined perf_mean worst2_then_best_to_worst
29: 55213 1049 0.554 0.589 64.196 70.200 0 5 0.20 combined perf_mean worst2_then_best_to_worst
30: 57013 1050 0.554 0.589 64.196 70.200 0 5 0.25 combined perf_mean worst2_then_best_to_worst
param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
```
# only when cost per million posts is <2000, then we see better correlations
basically, we're better off using just a single model: `openai/gpt-4o-search-preview`
```python
[{'model': 'openai/gpt-4o-search-preview'},
{'model': 'perplexity/sonar-pro'},
{'model': 'sao10k/l3.1-70b-hanami-x1'},
{'model': 'openai/gpt-4.1'},
{'model': 'google/gemini-2.5-flash-preview-09-2025'},
{'model': 'openai/gpt-5-mini'},
{'model': 'deepcogito/cogito-v2-preview-llama-405b'},
{'model': 'anthracite-org/magnum-v4-72b'},
{'model': 'mistralai/mistral-large'},
{'model': 'mistralai/mistral-large-2407'},
{'model': 'openai/gpt-4o'},
{'model': 'mistralai/pixtral-large-2411'},
{'model': 'openai/gpt-4o-2024-11-20'},
{'model': 'x-ai/grok-code-fast-1'},
{'model': 'qwen/qwen3-vl-30b-a3b-thinking'}]
```
```r
param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
<num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char>
1: 482 127 0.694 0.714 1760.348 1760.348 8 1 0.10 simple perf_geometric best_to_worst
2: 492 128 0.694 0.714 1760.348 1760.348 8 1 0.10 simple perf_mean best_to_worst
3: 502 129 0.694 0.714 1760.348 1760.348 8 1 0.10 simple perf_min best_to_worst
4: 512 130 0.694 0.714 1760.348 1760.348 8 1 0.10 simple corr_only best_to_worst
5: 1082 131 0.694 0.714 1760.348 1760.348 8 1 0.10 weighted perf_geometric best_to_worst
6: 1092 132 0.694 0.714 1760.348 1760.348 8 1 0.10 weighted perf_mean best_to_worst
7: 1100 133 0.694 0.567 1760.348 1453.949 8 1 0.10 weighted perf_min random
8: 1102 134 0.694 0.714 1760.348 1760.348 8 1 0.10 weighted perf_min best_to_worst
9: 1112 135 0.694 0.714 1760.348 1760.348 8 1 0.10 weighted corr_only best_to_worst
10: 1682 136 0.694 0.714 1760.348 1760.348 8 1 0.10 combined perf_geometric best_to_worst
11: 1690 137 0.694 0.531 1760.348 1348.231 8 1 0.10 combined perf_mean random
12: 1692 138 0.694 0.714 1760.348 1760.348 8 1 0.10 combined perf_mean best_to_worst
13: 1702 139 0.694 0.714 1760.348 1760.348 8 1 0.10 combined perf_min best_to_worst
14: 1712 140 0.694 0.714 1760.348 1760.348 8 1 0.10 combined corr_only best_to_worst
15: 2282 141 0.694 0.714 1760.348 1760.348 8 1 0.15 simple perf_geometric best_to_worst
16: 2292 142 0.694 0.714 1760.348 1760.348 8 1 0.15 simple perf_mean best_to_worst
17: 2302 143 0.694 0.714 1760.348 1760.348 8 1 0.15 simple perf_min best_to_worst
18: 2312 144 0.694 0.714 1760.348 1760.348 8 1 0.15 simple corr_only best_to_worst
19: 2882 145 0.694 0.714 1760.348 1760.348 8 1 0.15 weighted perf_geometric best_to_worst
20: 2892 146 0.694 0.714 1760.348 1760.348 8 1 0.15 weighted perf_mean best_to_worst
21: 2902 147 0.694 0.714 1760.348 1760.348 8 1 0.15 weighted perf_min best_to_worst
22: 2912 148 0.694 0.714 1760.348 1760.348 8 1 0.15 weighted corr_only best_to_worst
23: 3482 149 0.694 0.714 1760.348 1760.348 8 1 0.15 combined perf_geometric best_to_worst
24: 3492 150 0.694 0.714 1760.348 1760.348 8 1 0.15 combined perf_mean best_to_worst
25: 3502 151 0.694 0.714 1760.348 1760.348 8 1 0.15 combined perf_min best_to_worst
26: 3512 152 0.694 0.714 1760.348 1760.348 8 1 0.15 combined corr_only best_to_worst
27: 3530 153 0.694 0.265 1760.348 1371.759 8 1 0.15 combined cost_only random
28: 4082 154 0.694 0.714 1760.348 1760.348 8 1 0.20 simple perf_geometric best_to_worst
29: 4092 155 0.694 0.714 1760.348 1760.348 8 1 0.20 simple perf_mean best_to_worst
30: 4102 156 0.694 0.714 1760.348 1760.348 8 1 0.20 simple perf_min best_to_worst
param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method
```