focused on 3 datasets for now (can get expensive if we want to use more datasets and/or more models! see total cost below) - datasets: aslett, allen2021, allenvaccine - for each, train-test-split 70%-30%, and looking at just train set for now - 353 pieces of content in total ```python unique content per data source: shape: (3, 2) ┌──────────────┬───────────┐ │ data_source ┆ n_content │ │ --- ┆ --- │ │ str ┆ u32 │ ╞══════════════╪═══════════╡ │ aslett ┆ 147 │ │ allen2021 ┆ 144 │ │ allenvaccine ┆ 62 │ └──────────────┴───────────┘ ``` selected 12 best models. i chose them based on vibes, what i thought were the leading models cost per headline per model - openrouter/perplexity/sonar-deep-research: $1.79642 (had to exclude this because it's too expensive - would cost $400 to fact-check content above) - openrouter/openai/o1-pro: $0.44696 - openrouter/openai/gpt-5-pro: $0.21595 - openrouter/openai/gpt-5.2-pro: $0.05501 - openrouter/openai/o1: $0.04249 - openrouter/x-ai/grok-4: $0.01708 - openrouter/perplexity/sonar-reasoning-pro: $0.01219 - openrouter/anthropic/claude-opus-4.1: $0.01135 - openrouter/google/gemini-3-pro-preview: $0.01047 - openrouter/openai/gpt-5: $0.00793 - openrouter/perplexity/sonar-pro-search: $0.00198 - openrouter/perplexity/sonar-pro: $0.00197 - openrouter/openai/gpt-4o-search-preview: $0.00145 descriptives - total LLM classifications: 4,524 - **total_cost: $310.96** # analyses LLM system prompt: evaluate content to be `true`, `false/misleading`, or `can't tell` ```xml <role> You're a politically neutral & discerning, and you excel at determining whether some content might be false/misleading, such that reading it would cause people to have less accurate beliefs. </role> <output_format> Evaluate the content and output one of the following numbers: - true (1) - false/misleading (0) - can't tell (-1) Briefly explain your reasoning in 1-2 sentences. <output_format> <task> Determine whether the content is true (1), false/misleading (0), or can't tell (-1). </task> ``` ## variable recoding for each dataset to map each dataset's categories/labels to our 3 categories - allen2021 (variable: modal fact-checker 'is true' rating; 3 professional fact-checkers) - 1 -> true - 0 -> false/misleading - aslett (variable: modal_fact_checker_rating) - true -> true - false/misleading -> false/misleading - no mode -> can't tell - could not determine -> can't tell - allenvaccine (variable: fc_modal; 2 professional fact-checkers) - 1.0 -> false/misleading - 0.5 -> can't tell - 0.0 -> true ## results - figure rows: 3 datsets - figure columns: whether humans/fact-checkers rate content as `false/misleading` (left), `true` (middle), or `can't tell` (center) - k = no. of headlines/posts in that category - x (llm human agree): proportion of LLM classifications that agree with human ratings summary - perplexity models (sonar) do better at classifying false/misleading - gpt-4o-search and gemini-3-pro better at classifying true - claude-opus and gpt-5 do better at can't tell ![[1766432844.png]] data table ```r model data_source fc0_recoded llm_human_agree n <char> <char> <char> <num> <int> 1: gpt-4o-search-preview allenvaccine true 1.00000000 33 2: gemini-3-pro-preview allenvaccine true 0.96969697 33 3: gpt-4o-search-preview aslett true 0.96078431 102 4: grok-4 allenvaccine true 0.90909091 33 5: gpt-4o-search-preview allen2021 true 0.89189189 74 6: gemini-3-pro-preview aslett true 0.88461538 104 7: o1-pro allenvaccine true 0.84848485 33 8: gemini-3-pro-preview allen2021 true 0.83783784 74 9: grok-4 aslett true 0.82692308 104 10: gpt-5-pro allenvaccine true 0.78787879 33 11: o1 allenvaccine true 0.78787879 33 12: gpt-5-pro aslett true 0.74757282 103 13: o1-pro aslett true 0.73529412 102 14: gpt-5 aslett true 0.70476190 105 15: o1 aslett true 0.66990291 103 16: grok-4 allen2021 true 0.63513514 74 17: gpt-5 allenvaccine true 0.57575758 33 18: gpt-5-pro allen2021 true 0.55405405 74 19: claude-opus-4.1 allenvaccine true 0.54545455 33 20: gpt-5.2-pro allenvaccine true 0.54545455 33 21: gpt-5 allen2021 true 0.52702703 74 22: sonar-pro-search allenvaccine true 0.48484848 33 23: o1-pro allen2021 true 0.47297297 74 24: o1 allen2021 true 0.44594595 74 25: gpt-5.2-pro aslett true 0.40000000 105 26: sonar-pro allenvaccine true 0.39393939 33 27: sonar-pro allen2021 true 0.37837838 74 28: sonar-reasoning-pro allenvaccine true 0.36363636 33 29: sonar-reasoning-pro aslett true 0.33962264 106 30: sonar-pro-search allen2021 true 0.33783784 74 31: claude-opus-4.1 aslett true 0.30097087 103 32: sonar-pro aslett true 0.29523810 105 33: sonar-pro-search aslett true 0.25714286 105 34: sonar-reasoning-pro allen2021 true 0.25675676 74 35: claude-opus-4.1 allen2021 true 0.24324324 74 36: gpt-5.2-pro allen2021 true 0.20270270 74 37: sonar-reasoning-pro aslett false/misleading 0.96875000 32 38: sonar-pro-search allen2021 false/misleading 0.94285714 70 39: sonar-pro-search aslett false/misleading 0.93750000 32 40: sonar-reasoning-pro allen2021 false/misleading 0.92857143 70 41: sonar-pro allen2021 false/misleading 0.91428571 70 42: sonar-pro allenvaccine false/misleading 0.87500000 16 43: sonar-pro aslett false/misleading 0.84375000 32 44: o1-pro allen2021 false/misleading 0.81428571 70 45: sonar-reasoning-pro allenvaccine false/misleading 0.81250000 16 46: grok-4 allen2021 false/misleading 0.78571429 70 47: grok-4 aslett false/misleading 0.78125000 32 48: gemini-3-pro-preview allenvaccine false/misleading 0.75000000 16 49: grok-4 allenvaccine false/misleading 0.75000000 16 50: o1 allen2021 false/misleading 0.72857143 70 51: gemini-3-pro-preview allen2021 false/misleading 0.71428571 70 52: gpt-5.2-pro aslett false/misleading 0.69696970 33 53: o1-pro aslett false/misleading 0.65625000 32 54: gpt-5-pro allen2021 false/misleading 0.64285714 70 55: sonar-pro-search allenvaccine false/misleading 0.62500000 16 56: gpt-5.2-pro allen2021 false/misleading 0.61428571 70 57: gpt-5 allen2021 false/misleading 0.60000000 70 58: o1 aslett false/misleading 0.59375000 32 59: gpt-5-pro aslett false/misleading 0.59375000 32 60: gpt-4o-search-preview allen2021 false/misleading 0.54285714 70 61: claude-opus-4.1 allen2021 false/misleading 0.54285714 70 62: o1 allenvaccine false/misleading 0.50000000 16 63: gemini-3-pro-preview aslett false/misleading 0.50000000 32 64: gpt-5 aslett false/misleading 0.46875000 32 65: gpt-5.2-pro allenvaccine false/misleading 0.43750000 16 66: o1-pro allenvaccine false/misleading 0.43750000 16 67: gpt-5 allenvaccine false/misleading 0.43750000 16 68: gpt-5-pro allenvaccine false/misleading 0.43750000 16 69: claude-opus-4.1 aslett false/misleading 0.40625000 32 70: gpt-4o-search-preview allenvaccine false/misleading 0.37500000 16 71: gpt-4o-search-preview aslett false/misleading 0.28125000 32 72: claude-opus-4.1 allenvaccine false/misleading 0.25000000 16 73: claude-opus-4.1 allenvaccine can't tell 0.69230769 13 74: gpt-5.2-pro aslett can't tell 0.69230769 13 75: claude-opus-4.1 aslett can't tell 0.64285714 14 76: gpt-5 aslett can't tell 0.61538462 13 77: o1 aslett can't tell 0.61538462 13 78: o1-pro aslett can't tell 0.61538462 13 79: gpt-5-pro aslett can't tell 0.61538462 13 80: gpt-5 allenvaccine can't tell 0.46153846 13 81: o1 allenvaccine can't tell 0.38461538 13 82: gpt-5-pro allenvaccine can't tell 0.38461538 13 83: o1-pro allenvaccine can't tell 0.30769231 13 84: gpt-5.2-pro allenvaccine can't tell 0.30769231 13 85: gemini-3-pro-preview aslett can't tell 0.28571429 14 86: gpt-4o-search-preview aslett can't tell 0.25000000 16 87: sonar-reasoning-pro allenvaccine can't tell 0.15384615 13 88: grok-4 aslett can't tell 0.15384615 13 89: sonar-pro aslett can't tell 0.14285714 14 90: sonar-pro-search aslett can't tell 0.14285714 14 91: gemini-3-pro-preview allenvaccine can't tell 0.07692308 13 92: grok-4 allenvaccine can't tell 0.07692308 13 93: gpt-4o-search-preview allenvaccine can't tell 0.07692308 13 94: sonar-pro-search allenvaccine can't tell 0.07692308 13 95: sonar-pro allenvaccine can't tell 0.00000000 13 96: sonar-reasoning-pro aslett can't tell 0.00000000 13 model data_source fc0_recoded llm_human_agree n ``` ## base rates ![[1766447214.png]] ## different measures of agreement that can handle multi-class variables gemini-3-pro's classification agrees most with humans ![[20251222194358.png]]