251114_190841 xposts train vs test set

- ~80k parameter combinations - ~100 posts from x with @grok and @perplexity (and fact-checked by humans) - train-test set sizes: 40-60 - maybe train set is too small? (jenny's headlines train-test set sizes: 62-145) # across all 80k parameter combinations, corr(factchecker, LLM) is much lower in testing set than training set ![[_temp 22.png]] top 30 parameter combinations in train set (across all 80k parameter combinations) - param_rank_train: train set ranking - fc_veracity_corr_train: corr(factchecker, LLM) in train set - fc_veracity_corr_test: corr(fact_checker, LLM) in test set corr(factchecker, LLM) in test set tends to be 0.1 to 0.3 smaller than in the test set ```r param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char> 1: 37710 1 0.796 0.420 4259.263 4042.115 8 3 0.40 combined corr_only random 2: 35310 2 0.790 0.546 3604.872 3005.123 8 3 0.35 weighted corr_only random 3: 37700 3 0.789 0.687 4528.500 4826.891 8 3 0.40 combined perf_min random 4: 28710 4 0.788 0.239 4083.823 4352.782 8 3 0.15 combined corr_only random 5: 35900 5 0.786 -0.116 4240.008 4226.545 8 3 0.35 combined perf_min random 6: 55110 6 0.786 0.629 3078.064 3604.872 8 5 0.20 weighted corr_only random 7: 46700 7 0.786 0.649 5206.403 5274.974 8 4 0.30 combined perf_min random 8: 67690 8 0.785 0.588 3078.064 3167.495 8 6 0.20 weighted perf_mean random 9: 40720 9 0.785 0.398 3078.064 2706.923 8 4 0.15 weighted icc_only random 10: 17910 10 0.783 0.295 3262.244 3293.910 8 2 0.20 combined corr_only random 11: 482 11 0.783 0.647 1786.603 1786.603 8 1 0.10 simple perf_geometric best_to_worst 12: 492 12 0.783 0.647 1786.603 1786.603 8 1 0.10 simple perf_mean best_to_worst 13: 502 13 0.783 0.647 1786.603 1786.603 8 1 0.10 simple perf_min best_to_worst 14: 512 14 0.783 0.647 1786.603 1786.603 8 1 0.10 simple corr_only best_to_worst 15: 1082 15 0.783 0.647 1786.603 1786.603 8 1 0.10 weighted perf_geometric best_to_worst 16: 1092 16 0.783 0.647 1786.603 1786.603 8 1 0.10 weighted perf_mean best_to_worst 17: 1102 17 0.783 0.647 1786.603 1786.603 8 1 0.10 weighted perf_min best_to_worst 18: 1112 18 0.783 0.647 1786.603 1786.603 8 1 0.10 weighted corr_only best_to_worst 19: 1682 19 0.783 0.647 1786.603 1786.603 8 1 0.10 combined perf_geometric best_to_worst 20: 1692 20 0.783 0.647 1786.603 1786.603 8 1 0.10 combined perf_mean best_to_worst 21: 1702 21 0.783 0.647 1786.603 1786.603 8 1 0.10 combined perf_min best_to_worst 22: 1712 22 0.783 0.647 1786.603 1786.603 8 1 0.10 combined corr_only best_to_worst 23: 1730 23 0.783 0.337 1786.603 1231.282 8 1 0.10 combined cost_only random 24: 2282 24 0.783 0.647 1786.603 1786.603 8 1 0.15 simple perf_geometric best_to_worst 25: 2292 25 0.783 0.647 1786.603 1786.603 8 1 0.15 simple perf_mean best_to_worst 26: 2300 26 0.783 0.376 1786.603 1491.865 8 1 0.15 simple perf_min random 27: 2302 27 0.783 0.647 1786.603 1786.603 8 1 0.15 simple perf_min best_to_worst 28: 2312 28 0.783 0.647 1786.603 1786.603 8 1 0.15 simple corr_only best_to_worst 29: 2880 29 0.783 0.480 1786.603 1291.462 8 1 0.15 weighted perf_geometric random 30: 2882 30 0.783 0.647 1786.603 1786.603 8 1 0.15 weighted perf_geometric best_to_worst param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method ``` # for parameter combinations where cost per million posts is < 100, corr(factchecker, LLM) is also much lower in testing set than training set ![[_temp 27.png]] ```r param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method <num> <num> <num> <num> <num> <num> <num> <num> <num> <char> <char> <char> 1: 37832 384 0.688 0.444 50.843 50.843 0 4 0.10 simple corr_only best_to_worst 2: 39632 385 0.688 0.444 50.843 50.843 0 4 0.15 simple corr_only best_to_worst 3: 41432 386 0.688 0.444 50.843 50.843 0 4 0.20 simple corr_only best_to_worst 4: 43232 387 0.688 0.444 50.843 50.843 0 4 0.25 simple corr_only best_to_worst 5: 45032 388 0.688 0.444 50.843 50.843 0 4 0.30 simple corr_only best_to_worst 6: 46832 389 0.688 0.444 50.843 50.843 0 4 0.35 simple corr_only best_to_worst 7: 48632 390 0.688 0.444 50.843 50.843 0 4 0.40 simple corr_only best_to_worst 8: 64233 398 0.681 0.394 74.865 73.235 0 6 0.10 combined corr_only worst2_then_best_to_worst 9: 66033 399 0.681 0.394 74.865 73.235 0 6 0.15 combined corr_only worst2_then_best_to_worst 10: 67833 400 0.681 0.394 74.865 73.235 0 6 0.20 combined corr_only worst2_then_best_to_worst 11: 69633 401 0.681 0.394 74.865 73.235 0 6 0.25 combined corr_only worst2_then_best_to_worst 12: 71433 402 0.681 0.394 74.865 73.235 0 6 0.30 combined corr_only worst2_then_best_to_worst 13: 73233 403 0.681 0.394 74.865 73.235 0 6 0.35 combined corr_only worst2_then_best_to_worst 14: 75033 404 0.681 0.394 74.865 73.235 0 6 0.40 combined corr_only worst2_then_best_to_worst 15: 39032 436 0.671 0.427 48.432 46.533 0 4 0.10 combined corr_only best_to_worst 16: 40832 437 0.671 0.427 48.432 46.533 0 4 0.15 combined corr_only best_to_worst 17: 42632 438 0.671 0.427 48.432 46.533 0 4 0.20 combined corr_only best_to_worst 18: 44432 439 0.671 0.427 48.432 46.533 0 4 0.25 combined corr_only best_to_worst 19: 46232 440 0.671 0.427 48.432 46.533 0 4 0.30 combined corr_only best_to_worst 20: 48032 441 0.671 0.427 48.432 46.533 0 4 0.35 combined corr_only best_to_worst 21: 49832 442 0.671 0.427 48.432 46.533 0 4 0.40 combined corr_only best_to_worst 22: 51632 457 0.659 0.447 63.915 60.622 0 5 0.10 combined corr_only best_to_worst 23: 53432 458 0.659 0.447 63.915 60.622 0 5 0.15 combined corr_only best_to_worst 24: 55232 459 0.659 0.447 63.915 60.622 0 5 0.20 combined corr_only best_to_worst 25: 57032 460 0.659 0.447 63.915 60.622 0 5 0.25 combined corr_only best_to_worst 26: 58832 461 0.659 0.447 63.915 60.622 0 5 0.30 combined corr_only best_to_worst 27: 60632 462 0.659 0.447 63.915 60.622 0 5 0.35 combined corr_only best_to_worst 28: 62432 463 0.659 0.447 63.915 60.622 0 5 0.40 combined corr_only best_to_worst 29: 76833 464 0.659 0.409 91.232 89.049 0 7 0.10 combined corr_only worst2_then_best_to_worst 30: 78633 465 0.659 0.409 91.232 89.049 0 7 0.15 combined corr_only worst2_then_best_to_worst param_idx param_rank_train fc_veracity_corr_train fc_veracity_corr_test cost_per_m_post_train cost_per_m_post_test cost_bin max_models disagreement_threshold aggregation_method perf_metric model_selection_method ```