# Why birdwatch LLM-human agreement is so much worse than human-human agreement Exploratory analysis comparing LLM factchecking ratings with two human rater groups (fcch and fcpg) on 457 birdwatch posts, across 11 models and 4 context conditions. ## Data - `fc_llm_wide.csv`: 19,194 rows (457 posts x ~42 model-context combos). Each row has LLM ratings, fcch (factchecker human) ratings, and fcpg (factchecker public/general) ratings on a 1-7 scale across 7 items (unbiased, objective, actually happened, accurate, reliable, trustworthy, true), plus categorical misleadingness labels (4-level, 3-level, 2-level). - `for-llms-with-media-classified.csv`: Input data with original tweet content and metadata. ## Reference benchmarks - **Birdwatch** inter-human r = 0.63 (from figures); LLM-human r ~ 0.2-0.3 - **Thomas dataset** inter-human r = 0.68; LLM-human r ~ 0.6-0.8 (much better) ## Key findings ### 1. Central tendency bias (restricted range) LLMs avoid extreme ratings, compressing their responses toward the middle of the 1-7 scale. This mechanically attenuates all correlations. | Rater | mr_mean std | % extreme ratings (1 or 7) on `mra_actually_happened` | |-------|------------|-------------------------------------------------------| | LLM (claude 46) | 1.15 | 10.5% | | fcch | 1.64 | 39.4% | | fcpg | 2.21 | 51.2% | This pattern holds across all 7 rating items. For example, on `mru_unbiased`, humans use extremes 54-66% of the time while LLMs use them only 32% of the time. ### 2. Human raters have very different base rates The two human groups disagree on what counts as misleading: | Rater | % misleadingfalse | % true | |-------|-------------------|--------| | fcch | 82.7% | 17.3% | | fcpg | 52.3% | 46.6% | | LLM (all models, post+media) | 64.1% | 25.1% | fcch labels nearly everything as misleading/false, while fcpg is much more balanced. LLMs fall in between but can't match both simultaneously. This 30+ percentage point gap in base rates between human groups is a structural constraint on LLM-human agreement. ### 3. All models perform similarly poorly No model achieves close to human-human agreement. LLM-fcch r ranges from 0.20 to 0.28 and LLM-fcpg r ranges from 0.22 to 0.30 across all model-context combinations. Adding more context (media, web search) provides only marginal improvement. | Model | LLM-fcch r (post+media) | LLM-fcpg r (post+media) | |-------|------------------------|------------------------| | gpt 4 osearch | 0.273 | 0.285 | | gemini 3 flash | 0.259 | 0.249 | | claude 46 | 0.255 | 0.272 | | grok 4 | 0.247 | 0.239 | | gpt 52 pro | 0.232 | 0.253 | For comparison, human-human r (fcch-fcpg) = 0.657. ### 4. LLMs sit between the two human groups On average, LLMs rate posts slightly higher (more true) than fcch but lower than fcpg: - Mean signed diff (LLM - fcch): +0.07 (LLM slightly more generous) - Mean signed diff (LLM - fcpg): -0.90 (LLM much less generous) - 54% of posts rated higher than fcch, only 37% rated higher than fcpg ### 5. LLMs especially fail on posts humans rate as "true" When fcch labels a post as true (n=79), LLMs agree only 22.8% of the time. LLMs give these posts a mean of 3.36 on the 1-7 scale while fcch gives them 5.28. LLMs are systematically biased toward calling things misleading. When fcch labels a post as misleadingfalse (n=378), LLMs agree 76.7% of the time — agreement on "clearly misleading" posts is reasonable, but LLMs treat too many true posts as uncertain or misleading. ### 6. Agreement depends on whether humans themselves agree | Condition | N | LLM-fcch r | LLM-fcpg r | |-----------|---|-----------|-----------| | Both humans agree | 297 | 0.356 | 0.352 | | Humans disagree | 160 | -0.037 | 0.054 | When both human groups agree on a post's label, LLM-human r jumps to 0.35 — still well below human-human levels, but substantially better. When humans disagree, LLM-human correlation drops to essentially 0. This suggests the poor performance is concentrated on ambiguous posts. ### 7. Post length does not predict disagreement Correlation between post word count and LLM-human disagreement: r = -0.028. Disagreement is roughly uniform across short (24 words) and long (297 words) posts. ### 8. Per-item breakdown All 7 rating items show similarly poor LLM-human correlations (0.19-0.26), and similarly low LLM variance relative to humans: | Item | LLM-fcch r | LLM-fcpg r | fcch-fcpg r | LLM std | fcch std | fcpg std | |------|-----------|-----------|------------|---------|---------|---------| | mru_unbiased | 0.261 | 0.247 | 0.534 | 1.30 | 1.59 | 2.28 | | mro_objective | 0.261 | 0.218 | 0.520 | 1.18 | 1.56 | 2.37 | | mra_actually_happened | 0.187 | 0.239 | 0.577 | 1.50 | 2.15 | 2.31 | | mrc_accurate | 0.208 | 0.232 | 0.587 | 1.28 | 1.78 | 2.33 | | mrr_reliable | 0.233 | 0.250 | 0.652 | 1.20 | 1.85 | 2.39 | | mrw_trustworthy | 0.224 | 0.249 | 0.646 | 1.19 | 1.85 | 2.40 | | mrh_true | 0.191 | 0.245 | 0.588 | 1.29 | 1.99 | 2.32 | | **mr_mean** | **0.255** | **0.272** | **0.657** | **1.15** | **1.64** | **2.21** | No single item is an outlier — the issue is systematic. ## Detailed examples All examples use context = post+media+native across 5 LLMs: claude 46, grok 4, gemini 31 pro, gpt 54, gpt 4 osearch. ### LLM says misleading/false, humans say true These are cases where the LLM rated a post as misleading or false, but both human rater groups rated it as true. This is the more common direction of disagreement. --- #### Amazon "peeing in bottles" denial **URL:** https://x.com/amazonnews/status/1374911222361956359 **Tweet:** "@repmarkpocan 1/2 You don't really believe the peeing in bottles thing, do you? If that were true, nobody would work for us. The truth is that we have over a million incredible employees around the world who are proud of what they do, and have great wages and health care from day one." | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | true | true | | **mr_mean** | 1.14 | 1.71 | 1.00 | 2.57 | 2.00 | 6.43 | 7.00 | | mru_unbiased | 1 | 2 | 1 | 1 | 2 | 5 | 7 | | mro_objective | 1 | 2 | 1 | 2 | 2 | 5 | 7 | | mra_actually_happened | 2 | 2 | 1 | 7 | 2 | 7 | 7 | | mrc_accurate | 1 | 1 | 1 | 2 | 2 | 7 | 7 | | mrr_reliable | 1 | 2 | 1 | 2 | 2 | 7 | 7 | | mrw_trustworthy | 1 | 2 | 1 | 2 | 2 | 7 | 7 | | mrh_true | 1 | 1 | 1 | 2 | 2 | 7 | 7 | **LLM explanations:** - **claude 46:** "This tweet from Amazon's official account denied that workers urinate in bottles, a claim that was extensively documented by multiple journalists, worker testimonials, and internal Amazon documents. Amazon itself later apologized for this tweet, calling it an 'own-goal' and admitting 'the tweet was incorrect.'" - **grok 4:** "The post denies reports of Amazon workers urinating in bottles, which evidence from multiple sources including Amazon's later apology confirms as true, making the denial false and misleading." - **gemini 31 pro:** "The post falsely denies well-documented reports that Amazon delivery drivers sometimes have to urinate in bottles due to grueling schedules, a fact Amazon itself later admitted in a public apology." - **gpt 54:** "The post makes factual claims denying that workers urinate in bottles and broadly portraying Amazon working conditions as strong, but Amazon later acknowledged this denial was wrong for its delivery workforce." - **gpt 4 osearch:** "The post denies reports of Amazon workers urinating in bottles, but evidence indicates that some drivers have had to do so due to time constraints." **What happened:** The LLM correctly identified the Amazon tweet as misleading corporate PR — Amazon later admitted the tweet was wrong. But the human raters labeled the tweet as "true." This suggests the human raters may have been evaluating whether the tweet accurately quotes Amazon (it does), while the LLM evaluated whether Amazon's *claim* was true (it wasn't). A fundamental disagreement about what "true" means in this context. --- #### "The greatest vaccine is being in good health" **URL:** https://x.com/CoachDanGo/status/1356226387842117632 **Tweet:** "The greatest vaccine in the world is being in good health." | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | true | true | | **mr_mean** | 2.14 | 3.00 | 2.00 | 2.71 | 2.14 | 7.00 | 7.00 | | mru_unbiased | 2 | 3 | 3 | 2 | 3 | 7 | 7 | | mro_objective | 2 | 3 | 2 | 2 | 3 | 7 | 7 | | mra_actually_happened | 3 | 1 | 1 | 7 | 1 | 7 | 7 | | mrc_accurate | 2 | 3 | 2 | 2 | 2 | 7 | 7 | | mrr_reliable | 2 | 4 | 2 | 2 | 2 | 7 | 7 | | mrw_trustworthy | 2 | 4 | 2 | 2 | 2 | 7 | 7 | | mrh_true | 2 | 3 | 2 | 2 | 2 | 7 | 7 | **LLM explanations:** - **claude 46:** "This post implies that being healthy is a substitute for vaccination. While good health is beneficial, it does not provide the specific immune protection that vaccines offer. This claim is misleading and potentially harmful as it could discourage vaccination." - **grok 4:** "The post is an opinion equating good health to the 'greatest vaccine,' which is misleading as it may downplay the importance of actual vaccines, potentially leading to harmful behaviors such as vaccine hesitancy." - **gemini 31 pro:** "The post presents a subjective metaphor equating good general health to a vaccine. This is potentially misleading, as good health alone does not provide the specific adaptive immunity that actual vaccines offer." - **gpt 54:** "This is framed as a broad health claim/opinion, but it is misleading because good general health is not a vaccine and does not replace vaccination's specific protective effect against infectious diseases." - **gpt 4 osearch:** "While maintaining good health is important, it does not replace the need for vaccinations, which are specifically designed to protect against infectious diseases." **What happened:** The LLM interpreted this as anti-vaccine messaging in the COVID pandemic context. The human raters gave it perfect 7 s across the board, apparently taking the statement at face value as a truism about health. This illustrates a core tension: the LLM reasons about implied meaning and downstream harm, while humans may rate the literal surface claim. --- #### "Earth, Venus and Jupiter as seen from Mars" **URL:** https://x.com/RyanHaecker/status/1364287711302008836 **Tweet:** "Earth, Venus and Jupiter as seen from Mars" (with an image) | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | misleadingfalse | misleadingfalse | misleadingfalse | **true** | misleadingfalse | true | true | | **mr_mean** | 2.14 | 4.14 | 1.86 | **5.71** | 2.14 | 6.86 | 7.00 | | mru_unbiased | 5 | 6 | 4 | 6 | 5 | 6 | 7 | | mro_objective | 4 | 6 | 4 | 6 | 5 | 7 | 7 | | mra_actually_happened | 1 | 4 | 1 | 6 | 1 | 7 | 7 | | mrc_accurate | 1 | 4 | 1 | 6 | 1 | 7 | 7 | | mrr_reliable | 1 | 3 | 1 | 5 | 1 | 7 | 7 | | mrw_trustworthy | 2 | 3 | 1 | 5 | 1 | 7 | 7 | | mrh_true | 1 | 3 | 1 | 6 | 1 | 7 | 7 | **LLM explanations:** - **claude 46:** "The image is not a real photograph taken from Mars. Multiple fact-checkers have confirmed it is a computer-generated image likely produced using planetarium software like SkySafari, first debunked in 2012. NASA confirmed the image did not originate from any of their missions." - **grok 4:** "The post presents a computer-generated image as if it were a real view from Mars, which is misleading since no such actual photograph exists, though the simulated alignment may be astronomically possible." - **gemini 31 pro:** "The image is not a real photograph taken from Mars; it is a computer-generated image created using planetarium software." - **gpt 54:** "The post makes a straightforward factual caption about an astronomical image, and nothing in the text appears deceptive or exaggerated. Based on the caption and accompanying image description, it is plausible and likely accurate." *(NOTE: gpt 54 is the only LLM that rated this as true — it failed to detect the fake image)* - **gpt 4 osearch:** "The image in the post is a computer-generated simulation, not an actual photograph, as confirmed by experts." **What happened:** The LLM (with native web search) correctly identified this as a debunked CGI image, giving it 1 s on accuracy. The human raters gave it 7 s — they likely didn't have the context to know it was fake. This is a case where the LLM is arguably *more correct* than the humans, enabled by web search access. --- #### KKK massacre claim **URL:** https://twitter.com/ladydownintexas/status/1374764213982355462 **Tweet:** "I am the daughter of the Grand Dragon of the Ku Klux Klan. 43 years ago my father massacred 30+ Vietnamese shrimpers in Rockport, Texas. I have been fighting to tell what I know and so far I have been unsupported." | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | true | true | | **mr_mean** | 1.43 | 1.00 | 1.57 | 2.00 | 1.00 | 5.57 | 6.57 | | mru_unbiased | 3 | 1 | 3 | 1 | 1 | 4 | 5 | | mro_objective | 2 | 1 | 3 | 2 | 1 | 4 | 6 | | mra_actually_happened | 1 | 1 | 1 | 3 | 1 | 7 | 7 | | mrc_accurate | 1 | 1 | 1 | 2 | 1 | 6 | 7 | | mrr_reliable | 1 | 1 | 1 | 2 | 1 | 6 | 7 | | mrw_trustworthy | 1 | 1 | 1 | 2 | 1 | 6 | 7 | | mrh_true | 1 | 1 | 1 | 2 | 1 | 6 | 7 | **LLM explanations:** - **claude 46:** "The post claims a KKK Grand Dragon 'massacred 30+ Vietnamese shrimpers in Rockport, Texas' approximately 43 years ago. Extensive historical records document the real KKK-Vietnamese shrimper conflict, but there is no record whatsoever of any massacre of 30+ people." - **grok 4:** "Historical records confirm tensions and KKK intimidation against Vietnamese shrimpers in Texas in 1979, but no massacre of 30+ people occurred in Rockport or nearby areas; the post fabricates or exaggerates events." - **gemini 31 pro:** "While the KKK did clash with and intimidate Vietnamese fishermen in Texas during that era, there is no historical record of a massacre of over 30 people in Rockport." - **gpt 54:** "Historical sources support KKK intimidation and violence against Vietnamese shrimpers on the Texas Gulf Coast, but I could not find corroboration for a massacre of more than 30 people." - **gpt 4 osearch:** "The post claims a massacre of over 30 Vietnamese shrimpers in Rockport, Texas. However, there is no historical evidence supporting such an event." **What happened:** The LLM correctly identified that while KKK-Vietnamese shrimper violence did occur, no massacre of 30+ people is documented in any historical record. The human raters apparently accepted the claim at face value. Another case where the LLM appears more accurate. --- #### CDC vaccine refusal claim **URL:** https://x.com/ShannonJoyRadio/status/1408130418742857732 **Tweet:** "Here's the kicker ... the CDC is perfectly comfortable condemning your child to serious injury or DEATH to prevent ONE COVID DEATH ... while CDC employees themselves have refused the vaccine to the tune of SIXTY PERCENT!!!" | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | true | | **mr_mean** | 1.14 | 1.00 | 1.71 | 1.43 | 1.00 | 4.57 | 7.00 | | mru_unbiased | 1 | 1 | 1 | 1 | 1 | 5 | 7 | | mro_objective | 1 | 1 | 1 | 1 | 1 | 5 | 7 | | mra_actually_happened | 2 | 1 | 2 | 4 | 1 | 5 | 7 | | mrc_accurate | 1 | 1 | 2 | 1 | 1 | 4 | 7 | | mrr_reliable | 1 | 1 | 2 | 1 | 1 | 4 | 7 | | mrw_trustworthy | 1 | 1 | 2 | 1 | 1 | 4 | 7 | | mrh_true | 1 | 1 | 2 | 1 | 1 | 5 | 7 | **LLM explanations:** - **claude 46:** "The post claims 60% of CDC employees 'refused' the vaccine, which is false. The 60% figure referred to NIAID (not CDC) employees who HAD been vaccinated. Furthermore, 'not yet vaccinated' does not equal 'refused.'" - **grok 4:** "The post contains false information about CDC employees' vaccine refusal rates, which fact-checks show to be around 4% unvaccinated, and exaggerates myocarditis risks." - **gemini 31 pro:** "The claim that 60% of CDC employees refused the COVID-19 vaccine is inaccurate and stems from a misinterpretation of testimony regarding vaccination rates." - **gpt 54:** "The post makes inflammatory factual claims that are unsupported or misleading, especially the assertion that 60% of CDC employees refused vaccination." - **gpt 4 osearch:** "The post falsely claims that 60% of CDC employees refused the COVID-19 vaccine, a statistic not supported by available data." **What happened:** This is a case where the humans themselves disagree (fcch=4.57, fcpg=7.00). The LLM agrees with fcch that it's misleading but rates it much more harshly (1.14 vs 4.57). fcpg rated it 7.00 (completely true), possibly interpreting the post differently. --- ### LLM says true, humans say misleading/false These are cases where the LLM rated a post as true, but human raters rated it as misleading or false. The LLM tends to take factual reporting at face value, while humans may be rating the *underlying claim* or the *framing*. --- #### FAA gender-neutral language recommendation **URL:** https://x.com/washingtonpost/status/1407857072130174978 **Tweet:** "The committee said the FAA should replace words and phrases such as 'cockpit,' 'unmanned aviation' and 'man-made' with gender-neutral terms such as 'flight deck,' 'uncrewed aviation' and 'machine-made'" | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | true | true | true | true | true | misleadingfalse | misleadingfalse | | **mr_mean** | 6.14 | 7.00 | 6.71 | 6.14 | 7.00 | 1.00 | 1.43 | | mru_unbiased | 6 | 7 | 6 | 6 | 7 | 1 | 1 | | mro_objective | 6 | 7 | 6 | 6 | 7 | 1 | 1 | | mra_actually_happened | 7 | 7 | 7 | 7 | 7 | 1 | 2 | | mrc_accurate | 6 | 7 | 7 | 6 | 7 | 1 | 2 | | mrr_reliable | 6 | 7 | 7 | 6 | 7 | 1 | 1 | | mrw_trustworthy | 6 | 7 | 7 | 6 | 7 | 1 | 1 | | mrh_true | 6 | 7 | 7 | 6 | 7 | 1 | 2 | All 5 LLMs unanimously rate this as true (6-7 range). Both human groups give it 1 s. **LLM explanations:** - **claude 46:** "The post accurately summarizes a real report from the FAA's Drone Advisory Committee released in June 2021, which recommended replacing gender-specific terms with gender-neutral alternatives." - **grok 4:** "The post accurately reports a factual recommendation from an FAA committee on adopting gender-neutral terms, as verified by multiple reliable sources." - **gemini 31 pro:** "The post accurately reports a recommendation made in June 2021 by an FAA advisory committee to adopt gender-neutral aviation terminology." - **gpt 54:** "This post accurately summarizes a 2021 recommendation from an FAA advisory committee that suggested replacing terms like 'cockpit' with more gender-neutral alternatives. It is a factual claim and largely neutral." - **gpt 4 osearch:** "The post accurately reports the FAA committee's recommendations to adopt gender-neutral language in aviation terminology." **What happened:** The LLM verified the reporting was accurate — the committee did make this recommendation. The humans rated it 1 s across the board, apparently judging the *underlying recommendation* as misleading rather than the *accuracy of the reporting*. This is a clear case of different evaluation frameworks: the LLM evaluates factual accuracy of the report, humans evaluate the claim being reported on. --- #### Meghan McCain blames media for radicalizing Republicans **URL:** https://x.com/therecount/status/1356305017175728129 **Tweet:** "The View co-host Meghan McCain blames media coverage of Rep. Marjorie Taylor Greene (R-GA) for radicalizing Republicans: 'It makes traditional Republicans go back into their corners ... and I would argue this is how we got Trump in the first place.'" | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | true | true | true | true | true | misleadingfalse | misleadingfalse | | **mr_mean** | 5.57 | 7.00 | 6.71 | 5.71 | 6.71 | 1.00 | 1.00 | | mru_unbiased | 5 | 7 | 6 | 4 | 6 | 1 | 1 | | mro_objective | 5 | 7 | 6 | 5 | 6 | 1 | 1 | | mra_actually_happened | 7 | 7 | 7 | 7 | 7 | 1 | 1 | | mrc_accurate | 6 | 7 | 7 | 6 | 7 | 1 | 1 | | mrr_reliable | 5 | 7 | 7 | 6 | 7 | 1 | 1 | | mrw_trustworthy | 5 | 7 | 7 | 6 | 7 | 1 | 1 | | mrh_true | 6 | 7 | 7 | 6 | 7 | 1 | 1 | All 5 LLMs unanimously rate this as true (5.6-7.0 range). Both human groups give all 1 s. **LLM explanations:** - **claude 46:** "The post accurately reports what Meghan McCain said on The View. Multiple sources confirm the quote, verified by video evidence." - **grok 4:** "The post accurately reports a statement made by Meghan McCain on The View, as confirmed by the original X post and attached video." - **gemini 31 pro:** "The post accurately quotes Meghan McCain's statements made on 'The View' regarding media coverage of Marjorie Taylor Greene." - **gpt 54:** "The post appears to be a straightforward quote-based report of something Meghan McCain said on The View, and the accompanying metadata supports that such a televised segment existed." - **gpt 4 osearch:** "The post accurately reports Meghan McCain's statement on 'The View' regarding media coverage of Rep. Marjorie Taylor Greene, as evidenced by the linked video." **What happened:** All 5 LLMs evaluated whether the quote was accurately reported (it was). The humans gave all 1 s, seemingly judging whether McCain's *argument* (that media coverage radicalizes Republicans) is true — which is an opinion, not a factual claim. The LLMs and humans are answering different questions. --- #### PolitiFact on Biden insulin price claims **URL:** https://x.com/PolitiFact/status/1365077693042155520 **Tweet:** "Beware of Facebook posts blaming Joe Biden for insulin price hikes. Experts said any recent price changes facing individual patients are likely due to the way insurance works, not the Biden administration." | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | true | true | true | true | true | misleadingfalse | misleadingfalse | | **mr_mean** | 5.71 | 7.00 | 6.00 | 6.14 | 7.00 | 1.29 | 1.00 | | mru_unbiased | 5 | 7 | 6 | 6 | 7 | 3 | 1 | | mro_objective | 5 | 7 | 6 | 6 | 7 | 1 | 1 | | mra_actually_happened | 6 | 7 | 6 | 7 | 7 | 1 | 1 | | mrc_accurate | 6 | 7 | 6 | 6 | 7 | 1 | 1 | | mrr_reliable | 6 | 7 | 6 | 6 | 7 | 1 | 1 | | mrw_trustworthy | 6 | 7 | 6 | 6 | 7 | 1 | 1 | | mrh_true | 6 | 7 | 6 | 6 | 7 | 1 | 1 | All 5 LLMs unanimously rate this as true (5.7-7.0 range). Both human groups give 1 s. **LLM explanations:** - **claude 46:** "This PolitiFact post accurately warns that Facebook posts blaming Biden for insulin price hikes are misleading. Multiple fact-checkers confirmed Biden's freeze of a Trump-era rule could not have raised insulin prices." - **grok 4:** "The post from PolitiFact accurately debunks misleading claims blaming Biden for insulin price hikes, supported by fact-checks and data." - **gemini 31 pro:** "This post is from a reputable fact-checking organization accurately summarizing experts' explanations that individual insulin price changes are tied to insurance rather than the Biden administration." - **gpt 54:** "The post accurately reflects contemporaneous fact-checking: experts said early 2021 insulin price changes were more likely tied to insurance deductibles and coverage rules than to any Biden administration action." - **gpt 4 osearch:** "The post accurately informs readers that recent insulin price changes are more likely due to insurance mechanisms rather than actions by the Biden administration." **What happened:** The LLM treated PolitiFact as a reliable fact-checking source and verified the claim. The human raters gave it 1 s — they may be evaluating PolitiFact's *conclusion* as politically biased rather than evaluating whether the reporting is accurate. This reveals a likely ideological dimension in the human ratings. --- #### Ted Cruz response to AOC **URL:** https://x.com/ReporterCioffi/status/1354871915823976448 **Tweet:** "Cruz response to @AOC tweet: 'You know, there's a lot of partisan anger and rage on the Democratic side. It's, it's not healthy for our country, it's certainly not conducive of healing or unity, but everyone has to decide how they want to interact with others.' via hill pool" | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | true | true | true | true | true | misleadingfalse | misleadingfalse | | **mr_mean** | 6.00 | 7.00 | 6.71 | 5.00 | 6.14 | 1.57 | 1.43 | | mru_unbiased | 5 | 7 | 6 | 4 | 5 | 1 | 1 | | mro_objective | 5 | 7 | 6 | 5 | 5 | 1 | 1 | | mra_actually_happened | 7 | 7 | 7 | 6 | 7 | 3 | 2 | | mrc_accurate | 6 | 7 | 7 | 5 | 7 | 1 | 2 | | mrr_reliable | 6 | 7 | 7 | 5 | 6 | 1 | 1 | | mrw_trustworthy | 6 | 7 | 7 | 5 | 6 | 1 | 1 | | mrh_true | 7 | 7 | 7 | 5 | 7 | 3 | 2 | All 5 LLMs unanimously rate this as true (5.0-7.0 range). Both human groups give mostly 1 s. **LLM explanations:** - **claude 46:** "This post is a straightforward journalistic report from a Hill pool reporter accurately quoting Ted Cruz's response to AOC's tweet. Multiple news sources confirm this exact quote." - **grok 4:** "The post accurately quotes Ted Cruz's response to AOC's tweet, as verified by web searches and the original X post, presenting factual reporting without distortion." - **gemini 31 pro:** "The post accurately reports a documented quote from Senator Ted Cruz in response to a tweet by Representative Alexandria Ocasio-Cortez in January 2021." - **gpt 54:** "This post appears to report a specific quote attributed to Ted Cruz in response to AOC, and there is nothing internally implausible about it." - **gpt 4 osearch:** "The post accurately reports Senator Ted Cruz's response to Representative Alexandria Ocasio-Cortez's tweet, as documented in the provided source." **What happened:** The LLM saw a verbatim journalist quote and rated it true. Humans rated it all 1 s — possibly evaluating Cruz's *statement* (blaming Democrats for partisan anger) as misleading rather than the accuracy of the *quote reporting*. Same pattern as the Meghan McCain example. --- #### Robinhood "Let the people trade" screenshot **URL:** https://x.com/PoorlyAgedStuff/status/1354818584040493056 **Tweet:** (image only — screenshot of Robinhood's old tweet "Let the people trade", posted by @PoorlyAgedStuff on the day Robinhood restricted GameStop trading) | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | true | true | true | true | true | misleadingfalse | true | | **mr_mean** | 6.14 | 6.57 | 6.86 | 5.57 | 6.43 | 2.57 | 1.00 | | mru_unbiased | 5 | 5 | 6 | 4 | 6 | 1 | 1 | | mro_objective | 5 | 6 | 7 | 6 | 6 | 1 | 1 | | mra_actually_happened | 7 | 7 | 7 | 7 | 7 | 7 | 1 | | mrc_accurate | 7 | 7 | 7 | 6 | 7 | 4 | 1 | | mrr_reliable | 6 | 7 | 7 | 5 | 6 | 1 | 1 | | mrw_trustworthy | 6 | 7 | 7 | 5 | 6 | 2 | 1 | | mrh_true | 7 | 7 | 7 | 6 | 7 | 2 | 1 | All 5 LLMs rate this as true. fcpg also labels it true but gives all 1 s on the items (contradicting their own label). fcch gives mra_actually_happened a 7 (acknowledging the screenshot is real) but low scores elsewhere. **LLM explanations:** - **claude 46:** "The post shares a screenshot of Robinhood's real 2016 tweet 'Let the people trade,' posted on January 28, 2021 — the very day Robinhood restricted trading on GameStop. The screenshot is authentic." - **grok 4:** "The post content is a link to a genuine 2016 tweet from Robinhood stating 'Let the people trade,' which is factual; shared in the context of Robinhood's 2021 trading restrictions, it highlights irony without making misleading claims." - **gemini 31 pro:** "The post shares an authentic, historical tweet from Robinhood that contrasts with their actions in January 2021, accurately presenting a factual event." - **gpt 54:** "The post appears to show a real 2016 Robinhood tweet saying 'Let the people trade,' which is a factual screenshot consistent with the account's branding." - **gpt 4 osearch:** "The post accurately presents a 2016 tweet from Robinhood stating 'Let the people trade.' This tweet resurfaced in 2021 when Robinhood restricted trading on certain stocks." **What happened:** Interesting split — fcch gives mra_actually_happened a 7 (acknowledging the screenshot is real) but low scores elsewhere, while fcpg gives all 1 s. The LLM sees an authentic screenshot and rates it true. The humans may be rating the implied *framing* (that Robinhood is hypocritical) rather than the factual content. --- ### Highest agreement: LLM and humans perfectly aligned These posts show when all three raters agree perfectly — typically on clear-cut misinformation. --- #### OANN election fraud promotion **URL:** https://x.com/OANN/status/1357535145574559746 **Tweet:** "Growing evidence of election fraud reveals that the presidency of the United States has been stolen from the American people. Join MyPillow CEO Mike Lindell for an exclusive report. Tune in to 'Absolute Proof' this Saturday & Sunday starting at 8 AM ET / 5 AM PT. Only on #OANN." | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | | **mr_mean** | 1.00 | 1.00 | 1.00 | 1.14 | 1.00 | 1.00 | 1.00 | | All 7 items | 1 | 1 | 1 | 1-2 | 1 | 1 | 1 | Perfect agreement across all LLMs and both human groups. gpt 54 gives mra_actually_happened a 2 instead of 1 — the only deviation. **LLM explanations:** - **claude 46:** "This post promotes the thoroughly debunked claim that the 2020 election was stolen through fraud. Numerous courts, election officials, and cybersecurity experts found no evidence of widespread fraud." - **grok 4:** "The post promotes baseless claims of 2020 election fraud via Mike Lindell's 'Absolute Proof,' which has been thoroughly debunked." - **gemini 31 pro:** "Extensive audits, court rulings, and election officials across the political spectrum have confirmed there was no widespread fraud that changed the outcome of the 2020 election." - **gpt 54:** "The post makes a factual claim about election fraud, but courts, election officials, and subsequent reviews found no evidence of outcome-determinative fraud." - **gpt 4 osearch:** "The post promotes a documentary that has been widely debunked for presenting false claims about election fraud." --- #### Giuliani Antifa/BLM Capitol claim **URL:** https://x.com/RudyGiuliani/status/1358485526421110789 **Tweet:** "Antifa and BLM attack DC like they helped organize the January 6 attack on Congress. However, so far their role is being concealed." | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | | **mr_mean** | 1.00 | 1.00 | 1.00 | 1.14 | 1.00 | 1.00 | 1.00 | | All 7 items | 1 | 1 | 1 | 1-2 | 1 | 1 | 1 | Again near-perfect agreement. gpt 54 again gives mra_actually_happened a 2. **LLM explanations:** - **claude 46:** "The FBI found 'no evidence of anarchist violent extremists or people subscribing to Antifa in connection with Jan. 6.' Out of over 1,500 people charged, none had confirmed links to Antifa or BLM." - **grok 4:** "The post falsely claims that Antifa and BLM helped organize the January 6 Capitol attack, which is debunked by investigations showing no such involvement." - **gemini 31 pro:** "Extensive investigations by the FBI and Congress have found no evidence that Antifa or Black Lives Matter helped organize or lead the January 6 Capitol attack." - **gpt 54:** "Investigations and public statements from the FBI found no evidence that antifa was involved in the Capitol attack." - **gpt 4 osearch:** "The post falsely claims that Antifa and BLM organized the January 6 attack, debunked by the FBI and the January 6 Select Committee." --- #### "TRUMP WON" **URL:** https://x.com/StateofusAll/status/1399961058215870464 **Tweet:** "TRUMP WON...." | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | misleadingfalse | | **mr_mean** | 1.00 | 1.00 | 1.29 | 2.00 | 1.00 | 1.00 | 1.00 | | All 7 items | 1 | 1 | 1-2 | 1-4 | 1 | 1 | 1 | Most raters give all 1 s. gpt 54 is the most generous, giving mra_actually_happened a 4 — perhaps hedging on whether Trump "won" something other than the presidency. --- #### "100% of people who get the COVID vaccine die" **URL:** https://x.com/baseballcrank/status/1386515819497144327 **Tweet:** "100% of people who get the COVID vaccine die. 100% of people who do not get the COVID vaccine die. 100% of people who get COVID die. 100% of people who do not get COVID die. Let this inform your understanding of public health statistics." | | claude 46 | grok 4 | gemini 31 pro | gpt 54 | gpt 4 osearch | fcch | fcpg | |---|---|---|---|---|---|---|---| | **mis_3 levels** | misleadingfalse | **true** | **true** | misleadingfalse | **true** | misleadingfalse | true | | **mr_mean** | 2.86 | 6.29 | 6.00 | 2.86 | 7.00 | 1.86 | 3.86 | | mru_unbiased | 3 | 5 | 4 | 2 | 7 | 2 | 3 | | mro_objective | 3 | 6 | 5 | 3 | 7 | 2 | 4 | | mra_actually_happened | 4 | 7 | 7 | 7 | 7 | 4 | 4 | | mrc_accurate | 3 | 7 | 7 | 2 | 7 | 2 | 4 | | mrr_reliable | 2 | 6 | 6 | 2 | 7 | 1 | 4 | | mrw_trustworthy | 2 | 6 | 6 | 2 | 7 | 1 | 4 | | mrh_true | 3 | 7 | 7 | 2 | 7 | 1 | 4 | This post splits both LLMs and humans. claude 46 and gpt 54 see through the rhetorical trick (misleadingfalse, mr~2.9). grok 4, gemini 31 pro, and gpt 4 osearch rate it as literally true (mr 6-7). fcch sides with the skeptics (1.86), fcpg is in the middle (3.86). **LLM explanations:** - **claude 46:** "While technically true that all humans eventually die, this post is deeply misleading because it uses a rhetorical trick to imply that COVID vaccines and COVID itself have no meaningful impact on mortality." - **grok 4:** "The post presents literally true statements about universal mortality to make a point about interpreting public health statistics, but without context, it could potentially mislead readers into downplaying vaccination benefits." - **gemini 31 pro:** "The post relies on the universal truth of human mortality to make a sarcastic rhetorical point about interpreting statistics; it is technically accurate since everyone eventually dies." - **gpt 54:** "The post uses a technically eventual truth about mortality to imply that vaccine and disease statistics are meaningless, which is misleading because public health statistics concern relative risk and preventable deaths." - **gpt 4 osearch:** "The post uses humor to highlight that death is inevitable for everyone, regardless of COVID-19 vaccination status, emphasizing the importance of understanding public health statistics." **What happened:** The LLM and fcch both see through the rhetorical trick. fcpg rates it as true — technically the statements are all literally true. The LLM sits between the two human groups, rating it misleading but not as harshly as fcch. --- ### Patterns in the examples Looking across these examples, the disagreements follow a few recurring patterns: 1. **Evaluating the report vs. the claim**: The LLM tends to evaluate whether *reporting* is accurate (e.g., "did the reporter correctly quote Cruz?"), while humans often evaluate whether the *underlying claim being reported* is true or misleading. This is the dominant pattern in "LLM says true, humans say false" cases. 2. **LLM has better factual access**: In some cases (Mars image, KKK massacre claim), the LLM with web search access actually identifies misinformation that humans missed. The LLM is arguably *more accurate* here, but this counts against it in agreement metrics. 3. **Surface reading vs. implied meaning**: For the "greatest vaccine is good health" example, the LLM reads the implied anti-vaccine messaging while humans take the literal surface meaning. For the "100% die" post, both LLM and fcch see the rhetorical trick while fcpg doesn't. 4. **Ideological framing**: Several cases (PolitiFact insulin, FAA language, Cruz quote) suggest human raters may evaluate based on whether they agree with the *position* being reported, not whether the reporting is accurate. ## Summary The poor birdwatch LLM-human agreement stems from multiple compounding factors: 1. **LLMs compress ratings toward the center** of the scale (low variance), which attenuates correlations regardless of accuracy 2. **The two human groups have divergent base rates** (83% vs 52% misleading), so no single rater can agree with both 3. **LLMs are biased toward "misleading"** labels, especially underrating posts that humans consider true 4. **Ambiguous posts drive the gap** — when humans agree, LLM performance improves substantially These issues are largely absent in the thomas dataset, which has better-calibrated human raters and likely less ambiguous content, explaining why LLM-human agreement is much higher there.