focused on 3 datasets for now (can get expensive if we want to use more datasets and/or more models! see total cost below)
- datasets: aslett, allen2021, allenvaccine
- for each, train-test-split 70%-30%, and looking at just train set for now
- 353 pieces of content in total
```python
unique content per data source: shape: (3, 2)
┌──────────────┬───────────┐
│ data_source ┆ n_content │
│ --- ┆ --- │
│ str ┆ u32 │
╞══════════════╪═══════════╡
│ aslett ┆ 147 │
│ allen2021 ┆ 144 │
│ allenvaccine ┆ 62 │
└──────────────┴───────────┘
```
selected 12 best models. i chose them based on vibes, what i thought were the leading models
cost per headline per model
- openrouter/perplexity/sonar-deep-research: $1.79642 (had to exclude this because it's too expensive - would cost $400 to fact-check content above)
- openrouter/openai/o1-pro: $0.44696
- openrouter/openai/gpt-5-pro: $0.21595
- openrouter/openai/gpt-5.2-pro: $0.05501
- openrouter/openai/o1: $0.04249
- openrouter/x-ai/grok-4: $0.01708
- openrouter/perplexity/sonar-reasoning-pro: $0.01219
- openrouter/anthropic/claude-opus-4.1: $0.01135
- openrouter/google/gemini-3-pro-preview: $0.01047
- openrouter/openai/gpt-5: $0.00793
- openrouter/perplexity/sonar-pro-search: $0.00198
- openrouter/perplexity/sonar-pro: $0.00197
- openrouter/openai/gpt-4o-search-preview: $0.00145
descriptives
- total LLM classifications: 4,524
- **total_cost: $310.96**
# analyses
LLM system prompt: evaluate content to be `true`, `false/misleading`, or `can't tell`
```xml
<role>
You're a politically neutral & discerning, and you excel at determining whether some content might be false/misleading, such that reading it would cause people to have less accurate beliefs.
</role>
<output_format>
Evaluate the content and output one of the following numbers:
- true (1)
- false/misleading (0)
- can't tell (-1)
Briefly explain your reasoning in 1-2 sentences.
<output_format>
<task>
Determine whether the content is true (1), false/misleading (0), or can't tell (-1).
</task>
```
## variable recoding for each dataset
to map each dataset's categories/labels to our 3 categories
- allen2021 (variable: modal fact-checker 'is true' rating; 3 professional fact-checkers)
- 1 -> true
- 0 -> false/misleading
- aslett (variable: modal_fact_checker_rating)
- true -> true
- false/misleading -> false/misleading
- no mode -> can't tell
- could not determine -> can't tell
- allenvaccine (variable: fc_modal; 2 professional fact-checkers)
- 1.0 -> false/misleading
- 0.5 -> can't tell
- 0.0 -> true
## results
- figure rows: 3 datsets
- figure columns: whether humans/fact-checkers rate content as `false/misleading` (left), `true` (middle), or `can't tell` (center)
- k = no. of headlines/posts in that category
- x (llm human agree): proportion of LLM classifications that agree with human ratings
summary
- perplexity models (sonar) do better at classifying false/misleading
- gpt-4o-search and gemini-3-pro better at classifying true
- claude-opus and gpt-5 do better at can't tell
![[1766432844.png]]
data table
```r
model data_source fc0_recoded llm_human_agree n
<char> <char> <char> <num> <int>
1: gpt-4o-search-preview allenvaccine true 1.00000000 33
2: gemini-3-pro-preview allenvaccine true 0.96969697 33
3: gpt-4o-search-preview aslett true 0.96078431 102
4: grok-4 allenvaccine true 0.90909091 33
5: gpt-4o-search-preview allen2021 true 0.89189189 74
6: gemini-3-pro-preview aslett true 0.88461538 104
7: o1-pro allenvaccine true 0.84848485 33
8: gemini-3-pro-preview allen2021 true 0.83783784 74
9: grok-4 aslett true 0.82692308 104
10: gpt-5-pro allenvaccine true 0.78787879 33
11: o1 allenvaccine true 0.78787879 33
12: gpt-5-pro aslett true 0.74757282 103
13: o1-pro aslett true 0.73529412 102
14: gpt-5 aslett true 0.70476190 105
15: o1 aslett true 0.66990291 103
16: grok-4 allen2021 true 0.63513514 74
17: gpt-5 allenvaccine true 0.57575758 33
18: gpt-5-pro allen2021 true 0.55405405 74
19: claude-opus-4.1 allenvaccine true 0.54545455 33
20: gpt-5.2-pro allenvaccine true 0.54545455 33
21: gpt-5 allen2021 true 0.52702703 74
22: sonar-pro-search allenvaccine true 0.48484848 33
23: o1-pro allen2021 true 0.47297297 74
24: o1 allen2021 true 0.44594595 74
25: gpt-5.2-pro aslett true 0.40000000 105
26: sonar-pro allenvaccine true 0.39393939 33
27: sonar-pro allen2021 true 0.37837838 74
28: sonar-reasoning-pro allenvaccine true 0.36363636 33
29: sonar-reasoning-pro aslett true 0.33962264 106
30: sonar-pro-search allen2021 true 0.33783784 74
31: claude-opus-4.1 aslett true 0.30097087 103
32: sonar-pro aslett true 0.29523810 105
33: sonar-pro-search aslett true 0.25714286 105
34: sonar-reasoning-pro allen2021 true 0.25675676 74
35: claude-opus-4.1 allen2021 true 0.24324324 74
36: gpt-5.2-pro allen2021 true 0.20270270 74
37: sonar-reasoning-pro aslett false/misleading 0.96875000 32
38: sonar-pro-search allen2021 false/misleading 0.94285714 70
39: sonar-pro-search aslett false/misleading 0.93750000 32
40: sonar-reasoning-pro allen2021 false/misleading 0.92857143 70
41: sonar-pro allen2021 false/misleading 0.91428571 70
42: sonar-pro allenvaccine false/misleading 0.87500000 16
43: sonar-pro aslett false/misleading 0.84375000 32
44: o1-pro allen2021 false/misleading 0.81428571 70
45: sonar-reasoning-pro allenvaccine false/misleading 0.81250000 16
46: grok-4 allen2021 false/misleading 0.78571429 70
47: grok-4 aslett false/misleading 0.78125000 32
48: gemini-3-pro-preview allenvaccine false/misleading 0.75000000 16
49: grok-4 allenvaccine false/misleading 0.75000000 16
50: o1 allen2021 false/misleading 0.72857143 70
51: gemini-3-pro-preview allen2021 false/misleading 0.71428571 70
52: gpt-5.2-pro aslett false/misleading 0.69696970 33
53: o1-pro aslett false/misleading 0.65625000 32
54: gpt-5-pro allen2021 false/misleading 0.64285714 70
55: sonar-pro-search allenvaccine false/misleading 0.62500000 16
56: gpt-5.2-pro allen2021 false/misleading 0.61428571 70
57: gpt-5 allen2021 false/misleading 0.60000000 70
58: o1 aslett false/misleading 0.59375000 32
59: gpt-5-pro aslett false/misleading 0.59375000 32
60: gpt-4o-search-preview allen2021 false/misleading 0.54285714 70
61: claude-opus-4.1 allen2021 false/misleading 0.54285714 70
62: o1 allenvaccine false/misleading 0.50000000 16
63: gemini-3-pro-preview aslett false/misleading 0.50000000 32
64: gpt-5 aslett false/misleading 0.46875000 32
65: gpt-5.2-pro allenvaccine false/misleading 0.43750000 16
66: o1-pro allenvaccine false/misleading 0.43750000 16
67: gpt-5 allenvaccine false/misleading 0.43750000 16
68: gpt-5-pro allenvaccine false/misleading 0.43750000 16
69: claude-opus-4.1 aslett false/misleading 0.40625000 32
70: gpt-4o-search-preview allenvaccine false/misleading 0.37500000 16
71: gpt-4o-search-preview aslett false/misleading 0.28125000 32
72: claude-opus-4.1 allenvaccine false/misleading 0.25000000 16
73: claude-opus-4.1 allenvaccine can't tell 0.69230769 13
74: gpt-5.2-pro aslett can't tell 0.69230769 13
75: claude-opus-4.1 aslett can't tell 0.64285714 14
76: gpt-5 aslett can't tell 0.61538462 13
77: o1 aslett can't tell 0.61538462 13
78: o1-pro aslett can't tell 0.61538462 13
79: gpt-5-pro aslett can't tell 0.61538462 13
80: gpt-5 allenvaccine can't tell 0.46153846 13
81: o1 allenvaccine can't tell 0.38461538 13
82: gpt-5-pro allenvaccine can't tell 0.38461538 13
83: o1-pro allenvaccine can't tell 0.30769231 13
84: gpt-5.2-pro allenvaccine can't tell 0.30769231 13
85: gemini-3-pro-preview aslett can't tell 0.28571429 14
86: gpt-4o-search-preview aslett can't tell 0.25000000 16
87: sonar-reasoning-pro allenvaccine can't tell 0.15384615 13
88: grok-4 aslett can't tell 0.15384615 13
89: sonar-pro aslett can't tell 0.14285714 14
90: sonar-pro-search aslett can't tell 0.14285714 14
91: gemini-3-pro-preview allenvaccine can't tell 0.07692308 13
92: grok-4 allenvaccine can't tell 0.07692308 13
93: gpt-4o-search-preview allenvaccine can't tell 0.07692308 13
94: sonar-pro-search allenvaccine can't tell 0.07692308 13
95: sonar-pro allenvaccine can't tell 0.00000000 13
96: sonar-reasoning-pro aslett can't tell 0.00000000 13
model data_source fc0_recoded llm_human_agree n
```
## base rates
![[1766447214.png]]
## different measures of agreement that can handle multi-class variables
gemini-3-pro's classification agrees most with humans
![[20251222194358.png]]