results below include only training set headlines (62 of 207; leaving 145 for testing) - test set results being generated now and will correlate test set results with training set results # correlation(LLM ratings, fc likert ratings) ~ estimated cost (for all 25200 simulations) ![[1755835915.png]] ## performance for different values of each parameter (all 25200 simulations) parameters - smaller y-axis values are better - max_models: maximum no. of models to use - disagreement threshold: how much disagreement to tolerate/allow across models before terminating (otherwise, keep adding models until max_models) - doesn't seem to matter that much - aggregation method: how to combine ratings across models - simple: simple mean - weighted: weighted by perf_metric (see next parameter) - combined: weighted by perf_metric and uncertainty/disagreement across models - perf_metric: metrics for initially evaluating how good a model is - corr_only: correlation of an LLM's ratings with fc likert (dave suggestion) - icc_only: an LLM's internal reliability across multiple runs (dave suggestion) - perf_geometric: geometric mean of corr and icc - perf_mean: simple mean of corr and icc - perf_min: min(corr, icc) - model_selection method - best_to_worst: select the next worst model based on perf_metric, starting from best - random: randomly select a model from the remaining pool of models - worst_to_best: select the next best model - worst2_then_best: start with 2 worst models, then next best (dave suggestion) - cost_bin - smaller bin index means cheaper models ![[1755835835.png]] ## focusing on cheapest parameter combinations (i.e., k = 8477 where estimated cost per million headlines is < $100) ![[1755835952.png]] 8477 simulations with estimated cost per million headlines < $100 ![[1755836223.png]] # performance for different values of each parameter (k = 1686 simulations with estimated cost per million headlines < $100 & correlation with fact-checker is < -0.6) ![[1755836477.png]] ## top 20 parameter combinations - will see if these results hold in the test set ```r fc_likert_corr cost_per_m_post max_models disagreement_threshold aggregation_method model_selection_method perf_metric cost_bin <num> <num> <int> <num> <char> <char> <char> <int> 1: -0.7061135 73.40478 4 0.10 combined worst2_then_best corr_only 2 2: -0.7034021 73.40478 3 0.10 combined worst2_then_best corr_only 2 3: -0.6970813 73.40478 5 0.10 combined worst2_then_best corr_only 2 4: -0.6862089 73.40478 6 0.10 combined worst2_then_best corr_only 2 5: -0.6824805 50.28532 2 0.10 combined worst_to_best corr_only 2 6: -0.6824805 50.28532 2 0.10 combined worst2_then_best corr_only 2 7: -0.6824805 50.28532 2 0.15 combined worst_to_best corr_only 2 8: -0.6824805 50.28532 2 0.15 combined worst2_then_best corr_only 2 9: -0.6824805 50.28532 2 0.20 combined worst_to_best corr_only 2 10: -0.6824805 50.28532 2 0.20 combined worst2_then_best corr_only 2 11: -0.6824805 50.28532 2 0.25 combined worst_to_best corr_only 2 12: -0.6824805 50.28532 2 0.25 combined worst2_then_best corr_only 2 13: -0.6824805 50.28532 2 0.30 combined worst_to_best corr_only 2 14: -0.6824805 50.28532 2 0.30 combined worst2_then_best corr_only 2 15: -0.6824805 50.28532 2 0.35 combined worst_to_best corr_only 2 16: -0.6824805 50.28532 2 0.35 combined worst2_then_best corr_only 2 17: -0.6824805 50.28532 2 0.40 combined worst_to_best corr_only 2 18: -0.6824805 50.28532 2 0.40 combined worst2_then_best corr_only 2 19: -0.6759055 73.40478 7 0.10 combined worst2_then_best corr_only 2 20: -0.6745170 70.30306 7 0.10 simple random icc_only 2 fc_likert_corr cost_per_m_post max_models disagreement_threshold aggregation_method model_selection_method perf_metric cost_bin ``` ## among the top models where cost per million headlines is < $100 the correlation with fact-checker is < -0.65, we have ==800 parameter combinations== the most common values for each parameter are shown below ```r max_models N <int> <int> 1: 2 199 2: 8 91 3: 3 90 4: 7 87 5: 9 87 6: 4 84 7: 6 82 8: 5 80 disagreement_threshold N # weird this didn't make any difference <num> <int> 1: 0.40 118 2: 0.15 117 3: 0.20 114 4: 0.25 114 5: 0.10 113 6: 0.30 112 7: 0.35 112 aggregation_method N <char> <int> 1: weighted 582 2: combined 135 3: simple 83 model_selection_method N <char> <int> 1: best_to_worst 420 2: worst2_then_best 189 3: worst_to_best 149 4: random 42 perf_metric N <char> <int> 1: corr_only 348 2: perf_geometric 150 3: perf_min 150 4: perf_mean 146 5: icc_only 6 cost_bin N <int> <int> 1: 2 440 2: 1 360 ```