250908_143755 train vs test results

# train set (62 headlines) and test set (145 headlines) correlation for top 50 parameter combinations (only for cost per million post in train set <= 100) - fc_likert_corr_train: fc_likert LLM correlation in train set - fc_likert_corr_test: fc_likert LLM correlation in test set - ideally, this should correlate with fc_likert_corr_train, and it does! - cost_per_m_post_train: cost per million posts in train set - cost_per_m_post_train: cost per million posts in test set - ideally, should correlate with cost_per_m_post_train (and it does) to ensure we're keeping costs low - other columns are parameter values - final parameters? - max_models: 5 (3 or 4 could also work) - disagreement threshold: doesn't quite matter, but probably 0.1 - aggregation method: combined - perf_metric: correlation only (don't need model reliability/ICC) - cost_bin: second cheapest bin - model selection method: worst two models within a cost bin and then best ```r fc_likert_corr_train fc_likert_corr_test cost_per_m_post_train cost_per_m_post_test max_models disagreement_threshold aggregation_method perf_metric model_selection_method model_cost <num> <num> <num> <num> <num> <num> <char> <char> <char> <char> 1: -0.706 -0.667 73.405 63.990 4 0.10 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 2: -0.703 -0.673 73.405 63.990 3 0.10 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 3: -0.697 -0.673 73.405 63.990 5 0.10 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 4: -0.686 -0.676 73.405 63.990 6 0.10 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 5: -0.682 -0.630 50.285 50.285 2 0.10 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 6: -0.682 -0.630 50.285 50.285 2 0.10 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839 7: -0.682 -0.630 50.285 50.285 2 0.15 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 8: -0.682 -0.630 50.285 50.285 2 0.15 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839 9: -0.682 -0.630 50.285 50.285 2 0.20 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 10: -0.682 -0.630 50.285 50.285 2 0.20 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839 11: -0.682 -0.630 50.285 50.285 2 0.25 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 12: -0.682 -0.630 50.285 50.285 2 0.25 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839 13: -0.682 -0.630 50.285 50.285 2 0.30 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 14: -0.682 -0.630 50.285 50.285 2 0.30 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839 15: -0.682 -0.630 50.285 50.285 2 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 16: -0.682 -0.630 50.285 50.285 2 0.35 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839 17: -0.682 -0.630 50.285 50.285 2 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 18: -0.682 -0.630 50.285 50.285 2 0.40 combined corr_only worst_to_best 0.0000129181451612903,0.0000331427419354839 19: -0.676 -0.678 73.405 63.990 7 0.10 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 20: -0.674 -0.671 64.998 60.620 4 0.15 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 21: -0.673 -0.473 70.303 79.098 8 0.25 simple perf_geometric random 0.0000129181451612903,0.0000331427419354839 22: -0.673 -0.676 64.998 60.620 3 0.15 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 23: -0.672 -0.675 64.998 60.620 5 0.15 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 24: -0.672 -0.674 58.167 55.902 4 0.25 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 25: -0.672 -0.677 59.218 56.801 4 0.20 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 26: -0.672 -0.674 57.116 55.003 4 0.30 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 27: -0.671 -0.447 19.167 15.410 5 0.20 combined corr_only random 3.18258064516129e-6,0.0000118854838709677 28: -0.671 -0.521 66.139 48.790 9 0.30 simple corr_only random 0.0000129181451612903,0.0000331427419354839 29: -0.671 -0.679 58.167 55.902 3 0.25 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 30: -0.671 -0.676 57.116 55.003 5 0.30 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 31: -0.671 -0.681 59.218 56.801 3 0.20 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 32: -0.671 -0.676 58.167 55.902 5 0.25 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 33: -0.670 -0.679 59.218 56.801 5 0.20 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 34: -0.670 -0.674 52.913 53.655 4 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 35: -0.670 -0.675 52.913 53.655 5 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 36: -0.670 -0.675 52.913 53.655 6 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 37: -0.670 -0.675 52.913 53.655 7 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 38: -0.670 -0.675 52.913 53.655 8 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 39: -0.670 -0.675 52.913 53.655 9 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 40: -0.669 -0.679 57.116 55.003 3 0.30 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 41: -0.669 -0.676 55.540 54.329 5 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 42: -0.669 -0.676 55.540 54.329 6 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 43: -0.669 -0.676 55.540 54.329 7 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 44: -0.669 -0.676 55.540 54.329 8 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 45: -0.669 -0.676 55.540 54.329 9 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 46: -0.669 -0.677 57.116 55.003 6 0.30 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 47: -0.669 -0.675 55.540 54.329 4 0.35 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 48: -0.668 -0.677 64.998 60.620 6 0.15 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 49: -0.668 -0.679 52.913 53.655 3 0.40 combined corr_only worst2_then_best 0.0000129181451612903,0.0000331427419354839 50: -0.668 -0.520 64.440 65.256 6 0.15 combined perf_mean random 0.0000129181451612903,0.0000331427419354839 fc_likert_corr_train fc_likert_corr_test cost_per_m_post_train cost_per_m_post_test max_models disagreement_threshold aggregation_method perf_metric model_selection_method model_cost ``` # train set (62 headlines) and test set (145 headlines) correlation for all 25200 parameter combinations ```r Parameter1 | Parameter2 | r | 95% CI ---------------------------------------------------------------- fc_likert_corr_train | fc_likert_corr_test | 0.65 | [0.65, 0.66] ``` correlation between fc-likert and LLMs are always stronger in the test set, probably because it has more data/headlines? ![[1757360313.png]] # train set (62 headlines) and test set (145 headlines) correlation for 4188 parameter combinations where - cost per million post in train set <= 200 - fc-likert and LLM correlation in train set < -0.5 (negative is better) generally good correspondence between train and test set results? ![[1757361453.png]]