fact-checker validation data - jenny dataset: 207 headlines, each fact-checked by 3 fact-checkers - no. of headlines where all 3 fact-checkers agree: 114 - used jenny's `is_true` binarized classification (original labels in `mis` column: true [1], misleading [2], false [3], can't tell [4]) - true if `mis` == 1, else false llm data - model: perplexity sonar pro - 0 to 1 continuous rating (1 is accurate, 0 is inaccurate) - each headline was evaluated by model 3 times, so 3 batches # AUC(binarized fact-checker "ground truth", continuous LLM prediction) - ground truth: binarized fact-checker ratings - prediction: llm continuous rating ## 114 headlines where all 3 fact-checkers agree ```r is_true N <int> <int> 1: 0 41 # no. of false headlines 2: 1 73 # no. of true headlines ``` - batch 1 auc: 0.932 - batch 2 auc: 0.934 - batch 3 auc: 0.916 ## all 207 headlines, where ground truth is modal fact-checker rating ```r is true N <int> <int> 1: 0 125 2: 1 82 ``` - batch 1 auc: 0.84 - batch 2 auc: 0.85 - batch 3 auc: 0.84 ## all 207 headlines, separately for each fact-checker ```r n_headlines factchecker llm_batch auc <int> <char> <char> <auc> 1: 207 cn 1 0.8111220 2: 207 cn 2 0.8125366 3: 207 cn 3 0.8043902 4: 207 pg 1 0.8408411 5: 207 pg 2 0.8567757 6: 207 pg 3 0.8432710 7: 207 su 1 0.7990446 8: 207 su 2 0.7984713 9: 207 su 3 0.7852866 ``` # correlations between fact-checker's avg continuous rating (`avg_mt`) and LLM's continuous rating ```r # separately for each fact-checker and LLM batch n_headlines factchecker llm_batch pearson_r <int> <char> <char> <num> 1: 207 cn 1 0.6064910 2: 207 cn 2 0.6243530 3: 207 cn 3 0.6017426 4: 207 pg 1 0.6164126 5: 207 pg 2 0.6221302 6: 207 pg 3 0.6336609 7: 207 su 1 0.4913052 8: 207 su 2 0.4807078 9: 207 su 3 0.4682475 cor( mean(fact-checker likert average), mean(LLM batches) ) 0.6881195 ```