251116_094338 xposts larger train set - 80-20 split

- train-test split: 80-20 Using a larger train set (vs [[251114_190841 xposts train vs test set|previous analysis where train-test split is 40-60]]) leads to better generalization to test/new data. Previous high correlations in train set are probably overfitting. # across all 80k parameter combinations, corr(factchecker, LLM) is slightly lower in the test set cost bin 8 models are good ```r param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char> 1: 32300 1 0.722 0.130 4345.552 4164.892 8 3 0.25 combined perf_min random 2: 26912 2 0.722 0.686 4522.380 4428.947 8 3 0.10 combined corr_only best_to_worst 3: 28712 3 0.722 0.686 4522.380 4428.947 8 3 0.15 combined corr_only best_to_worst 4: 30512 4 0.722 0.686 4522.380 4428.947 8 3 0.20 combined corr_only best_to_worst 5: 32312 5 0.722 0.686 4522.380 4428.947 8 3 0.25 combined corr_only best_to_worst 6: 34112 6 0.722 0.686 4522.380 4428.947 8 3 0.30 combined corr_only best_to_worst 7: 35912 7 0.722 0.686 4522.380 4428.947 8 3 0.35 combined corr_only best_to_worst 8: 37712 8 0.722 0.686 4522.380 4428.947 8 3 0.40 combined corr_only best_to_worst 9: 29880 9 0.718 0.624 3022.095 2547.623 8 3 0.20 weighted perf_geometric random 10: 60490 10 0.718 0.395 3022.095 2700.950 8 5 0.35 weighted perf_mean random 11: 64090 11 0.718 0.332 3022.095 2555.959 8 6 0.10 weighted perf_mean random 12: 39512 12 0.717 0.695 5865.903 5664.804 8 4 0.10 combined corr_only best_to_worst 13: 41312 13 0.717 0.695 5865.903 5664.804 8 4 0.15 combined corr_only best_to_worst 14: 43112 14 0.717 0.695 5865.903 5664.804 8 4 0.20 combined corr_only best_to_worst 15: 44912 15 0.717 0.695 5865.903 5664.804 8 4 0.25 combined corr_only best_to_worst 16: 46712 16 0.717 0.695 5865.903 5664.804 8 4 0.30 combined corr_only best_to_worst 17: 48512 17 0.717 0.695 5865.903 5664.804 8 4 0.35 combined corr_only best_to_worst 18: 50312 18 0.717 0.695 5865.903 5664.804 8 4 0.40 combined corr_only best_to_worst 19: 26910 19 0.714 0.456 4364.244 3938.789 8 3 0.10 combined corr_only random 20: 48480 20 0.710 0.783 5874.605 5280.637 8 4 0.35 combined perf_geometric random 21: 52112 21 0.708 0.735 7218.128 6908.666 8 5 0.10 combined corr_only best_to_worst 22: 53912 22 0.708 0.735 7218.128 6908.666 8 5 0.15 combined corr_only best_to_worst 23: 55712 23 0.708 0.735 7218.128 6908.666 8 5 0.20 combined corr_only best_to_worst 24: 57512 24 0.708 0.735 7218.128 6908.666 8 5 0.25 combined corr_only best_to_worst 25: 59312 25 0.708 0.735 7218.128 6908.666 8 5 0.30 combined corr_only best_to_worst 26: 61112 26 0.708 0.735 7218.128 6908.666 8 5 0.35 combined corr_only best_to_worst 27: 62910 27 0.708 0.570 7218.128 6656.007 8 5 0.40 combined corr_only random 28: 62912 28 0.708 0.735 7218.128 6908.666 8 5 0.40 combined corr_only best_to_worst 29: 50310 29 0.707 0.508 5173.269 5202.620 8 4 0.40 combined corr_only random 30: 30480 30 0.706 0.631 4360.325 4097.658 8 3 0.20 combined perf_geometric random param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method ``` ```python [{'model': 'openai/gpt-4o-search-preview'}, # best model in bin 8 (search model) {'model': 'perplexity/sonar-pro'}, # search model {'model': 'sao10k/l3.1-70b-hanami-x1'}, {'model': 'openai/gpt-4.1'}, {'model': 'google/gemini-2.5-flash-preview-09-2025'}, {'model': 'openai/gpt-5-mini'}, {'model': 'deepcogito/cogito-v2-preview-llama-405b'}, {'model': 'anthracite-org/magnum-v4-72b'}, {'model': 'mistralai/mistral-large'}, {'model': 'mistralai/mistral-large-2407'}, {'model': 'openai/gpt-4o'}, {'model': 'mistralai/pixtral-large-2411'}, {'model': 'openai/gpt-4o-2024-11-20'}, {'model': 'x-ai/grok-code-fast-1'}, {'model': 'qwen/qwen3-vl-30b-a3b-thinking'}] # worst model in bin 8 ``` # for parameter combinations where cost per million posts is < 100, corr(factchecker, LLM) is quite similar in train and test sets but correlation is only around .55 ```r param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char> 1: 51623 898 0.575 0.544 64.196 70.200 0 5 0.10 combined perf_min worst2_then_best_to_worst 2: 53423 899 0.575 0.544 64.196 70.200 0 5 0.15 combined perf_min worst2_then_best_to_worst 3: 55223 900 0.575 0.544 64.196 70.200 0 5 0.20 combined perf_min worst2_then_best_to_worst 4: 57023 901 0.575 0.544 64.196 70.200 0 5 0.25 combined perf_min worst2_then_best_to_worst 5: 58823 902 0.575 0.544 64.196 70.200 0 5 0.30 combined perf_min worst2_then_best_to_worst 6: 60623 903 0.575 0.544 64.196 70.200 0 5 0.35 combined perf_min worst2_then_best_to_worst 7: 62423 904 0.575 0.544 64.196 70.200 0 5 0.40 combined perf_min worst2_then_best_to_worst 8: 53440 905 0.574 0.566 75.197 70.592 0 5 0.15 combined icc_only random 9: 66020 924 0.572 0.680 77.584 65.255 0 6 0.15 combined perf_min random 10: 40830 941 0.569 0.473 61.083 55.518 0 4 0.15 combined corr_only random 11: 85240 988 0.560 0.448 29.476 33.172 0 7 0.35 weighted icc_only random 12: 54600 1007 0.558 0.555 29.476 20.196 0 5 0.20 weighted perf_geometric random 13: 51603 1009 0.558 0.579 64.196 70.200 0 5 0.10 combined perf_geometric worst2_then_best_to_worst 14: 53403 1010 0.558 0.579 64.196 70.200 0 5 0.15 combined perf_geometric worst2_then_best_to_worst 15: 55203 1011 0.558 0.579 64.196 70.200 0 5 0.20 combined perf_geometric worst2_then_best_to_worst 16: 57003 1012 0.558 0.579 64.196 70.200 0 5 0.25 combined perf_geometric worst2_then_best_to_worst 17: 58803 1013 0.558 0.579 64.196 70.200 0 5 0.30 combined perf_geometric worst2_then_best_to_worst 18: 60603 1014 0.558 0.579 64.196 70.200 0 5 0.35 combined perf_geometric worst2_then_best_to_worst 19: 62403 1015 0.558 0.579 64.196 70.200 0 5 0.40 combined perf_geometric worst2_then_best_to_worst 20: 39023 1027 0.556 0.396 56.036 60.433 0 4 0.10 combined perf_min worst2_then_best_to_worst 21: 40823 1028 0.556 0.396 56.036 60.433 0 4 0.15 combined perf_min worst2_then_best_to_worst 22: 42623 1029 0.556 0.396 56.036 60.433 0 4 0.20 combined perf_min worst2_then_best_to_worst 23: 44423 1030 0.556 0.396 56.036 60.433 0 4 0.25 combined perf_min worst2_then_best_to_worst 24: 46223 1031 0.556 0.396 56.036 60.433 0 4 0.30 combined perf_min worst2_then_best_to_worst 25: 48023 1032 0.556 0.396 56.036 60.433 0 4 0.35 combined perf_min worst2_then_best_to_worst 26: 49823 1033 0.556 0.396 56.036 60.433 0 4 0.40 combined perf_min worst2_then_best_to_worst 27: 51613 1047 0.554 0.589 64.196 70.200 0 5 0.10 combined perf_mean worst2_then_best_to_worst 28: 53413 1048 0.554 0.589 64.196 70.200 0 5 0.15 combined perf_mean worst2_then_best_to_worst 29: 55213 1049 0.554 0.589 64.196 70.200 0 5 0.20 combined perf_mean worst2_then_best_to_worst 30: 57013 1050 0.554 0.589 64.196 70.200 0 5 0.25 combined perf_mean worst2_then_best_to_worst param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method ``` # only when cost per million posts is <2000, then we see better correlations basically, we're better off using just a single model: `openai/gpt-4o-search-preview` ```python [{'model': 'openai/gpt-4o-search-preview'}, {'model': 'perplexity/sonar-pro'}, {'model': 'sao10k/l3.1-70b-hanami-x1'}, {'model': 'openai/gpt-4.1'}, {'model': 'google/gemini-2.5-flash-preview-09-2025'}, {'model': 'openai/gpt-5-mini'}, {'model': 'deepcogito/cogito-v2-preview-llama-405b'}, {'model': 'anthracite-org/magnum-v4-72b'}, {'model': 'mistralai/mistral-large'}, {'model': 'mistralai/mistral-large-2407'}, {'model': 'openai/gpt-4o'}, {'model': 'mistralai/pixtral-large-2411'}, {'model': 'openai/gpt-4o-2024-11-20'}, {'model': 'x-ai/grok-code-fast-1'}, {'model': 'qwen/qwen3-vl-30b-a3b-thinking'}] ``` ```r param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char> 1: 482 127 0.694 0.714 1760.348 1760.348 8 1 0.10 simple perf_geometric best_to_worst 2: 492 128 0.694 0.714 1760.348 1760.348 8 1 0.10 simple perf_mean best_to_worst 3: 502 129 0.694 0.714 1760.348 1760.348 8 1 0.10 simple perf_min best_to_worst 4: 512 130 0.694 0.714 1760.348 1760.348 8 1 0.10 simple corr_only best_to_worst 5: 1082 131 0.694 0.714 1760.348 1760.348 8 1 0.10 weighted perf_geometric best_to_worst 6: 1092 132 0.694 0.714 1760.348 1760.348 8 1 0.10 weighted perf_mean best_to_worst 7: 1100 133 0.694 0.567 1760.348 1453.949 8 1 0.10 weighted perf_min random 8: 1102 134 0.694 0.714 1760.348 1760.348 8 1 0.10 weighted perf_min best_to_worst 9: 1112 135 0.694 0.714 1760.348 1760.348 8 1 0.10 weighted corr_only best_to_worst 10: 1682 136 0.694 0.714 1760.348 1760.348 8 1 0.10 combined perf_geometric best_to_worst 11: 1690 137 0.694 0.531 1760.348 1348.231 8 1 0.10 combined perf_mean random 12: 1692 138 0.694 0.714 1760.348 1760.348 8 1 0.10 combined perf_mean best_to_worst 13: 1702 139 0.694 0.714 1760.348 1760.348 8 1 0.10 combined perf_min best_to_worst 14: 1712 140 0.694 0.714 1760.348 1760.348 8 1 0.10 combined corr_only best_to_worst 15: 2282 141 0.694 0.714 1760.348 1760.348 8 1 0.15 simple perf_geometric best_to_worst 16: 2292 142 0.694 0.714 1760.348 1760.348 8 1 0.15 simple perf_mean best_to_worst 17: 2302 143 0.694 0.714 1760.348 1760.348 8 1 0.15 simple perf_min best_to_worst 18: 2312 144 0.694 0.714 1760.348 1760.348 8 1 0.15 simple corr_only best_to_worst 19: 2882 145 0.694 0.714 1760.348 1760.348 8 1 0.15 weighted perf_geometric best_to_worst 20: 2892 146 0.694 0.714 1760.348 1760.348 8 1 0.15 weighted perf_mean best_to_worst 21: 2902 147 0.694 0.714 1760.348 1760.348 8 1 0.15 weighted perf_min best_to_worst 22: 2912 148 0.694 0.714 1760.348 1760.348 8 1 0.15 weighted corr_only best_to_worst 23: 3482 149 0.694 0.714 1760.348 1760.348 8 1 0.15 combined perf_geometric best_to_worst 24: 3492 150 0.694 0.714 1760.348 1760.348 8 1 0.15 combined perf_mean best_to_worst 25: 3502 151 0.694 0.714 1760.348 1760.348 8 1 0.15 combined perf_min best_to_worst 26: 3512 152 0.694 0.714 1760.348 1760.348 8 1 0.15 combined corr_only best_to_worst 27: 3530 153 0.694 0.265 1760.348 1371.759 8 1 0.15 combined cost_only random 28: 4082 154 0.694 0.714 1760.348 1760.348 8 1 0.20 simple perf_geometric best_to_worst 29: 4092 155 0.694 0.714 1760.348 1760.348 8 1 0.20 simple perf_mean best_to_worst 30: 4102 156 0.694 0.714 1760.348 1760.348 8 1 0.20 simple perf_min best_to_worst param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method ```