fact-checker validation data
- jenny dataset: 207 headlines, each fact-checked by 3 fact-checkers
- no. of headlines where all 3 fact-checkers agree: 114
- used jenny's `is_true` binarized classification (original labels in `mis` column: true [1], misleading [2], false [3], can't tell [4])
- true if `mis` == 1, else false
llm data
- model: perplexity sonar pro
- 0 to 1 continuous rating (1 is accurate, 0 is inaccurate)
- each headline was evaluated by model 3 times, so 3 batches
# AUC(binarized fact-checker "ground truth", continuous LLM prediction)
- ground truth: binarized fact-checker ratings
- prediction: llm continuous rating
## 114 headlines where all 3 fact-checkers agree
```r
is_true N
<int> <int>
1: 0 41 # no. of false headlines
2: 1 73 # no. of true headlines
```
- batch 1 auc: 0.932
- batch 2 auc: 0.934
- batch 3 auc: 0.916
## all 207 headlines, where ground truth is modal fact-checker rating
```r
is true N
<int> <int>
1: 0 125
2: 1 82
```
- batch 1 auc: 0.84
- batch 2 auc: 0.85
- batch 3 auc: 0.84
## all 207 headlines, separately for each fact-checker
```r
n_headlines factchecker llm_batch auc
<int> <char> <char> <auc>
1: 207 cn 1 0.8111220
2: 207 cn 2 0.8125366
3: 207 cn 3 0.8043902
4: 207 pg 1 0.8408411
5: 207 pg 2 0.8567757
6: 207 pg 3 0.8432710
7: 207 su 1 0.7990446
8: 207 su 2 0.7984713
9: 207 su 3 0.7852866
```
# correlations between fact-checker's avg continuous rating (`avg_mt`) and LLM's continuous rating
```r
# separately for each fact-checker and LLM batch
n_headlines factchecker llm_batch pearson_r
<int> <char> <char> <num>
1: 207 cn 1 0.6064910
2: 207 cn 2 0.6243530
3: 207 cn 3 0.6017426
4: 207 pg 1 0.6164126
5: 207 pg 2 0.6221302
6: 207 pg 3 0.6336609
7: 207 su 1 0.4913052
8: 207 su 2 0.4807078
9: 207 su 3 0.4682475
cor( mean(fact-checker likert average), mean(LLM batches) )
0.6881195
```