- [[250711_183656 20 llms classify jenny vaccine data - online models|see online model results]] - dataset: jenny's vaccine data with 90 fact-checked headlines - 20 different LLMs used to classify each headline (goal is to eventually reduce to 2 or 3 models at most per headline/post) - ran 3 batches/iterations (so we can also check for reliability/consistency within models, in addition to across models) - each batch: $90 \times 20$ # prompt - llm asked to evaluate - misleading - inaccuracy - confidence (for exploring - suggested by mohsen) > You're a politically neutral & discerning social media user who excels at determining whether a social media post might be potentially inaccurate/misleading, such that reading it would cause people to have less accurate beliefs. Provide a rating between 0 & 1, where 0 is definitely not inaccurate/misleading and 1 is definitely inaccurate/misleading. Indicate your confidence in your rating between 0 (not at all confident) & 1 (very confident). Very briefly explain your rating. ```python # response schema, ensures/requests the following structured output from LLM class Response(BaseModel): inaccurate_rating: float misleading_rating: float confidence_rating: float explanation: str ``` # total cost (dollars) per batch ```r batch cost <int> <num> 1: 1 0.2882899 # cost for 90 headlines x 20 models 2: 2 0.2886385 3: 3 0.2903000 ``` total cost for batch 1, by model ```r batch model total_cost_90_headlines cost_per_headline <int> <char> <num> <num> 1: 1 openai/gpt-4.1 0.11289000 0.0012543333 # too expensive 2: 1 x-ai/grok-3-mini 0.03854670 0.0004282967 3: 1 google/gemini-2.5-flash 0.02415600 0.0002684000 4: 1 mistralai/mistral-medium-3 0.02299360 0.0002554844 5: 1 openai/gpt-4.1-mini 0.02021480 0.0002246089 6: 1 qwen/qwen3-30b-a3b 0.01436130 0.0001595700 7: 1 mistralai/mistral-small 0.00798560 0.0000887289 8: 1 meta-llama/llama-4-maverick 0.00740295 0.0000822550 9: 1 openai/gpt-4o-mini 0.00711495 0.0000790550 10: 1 google/gemini-2.5-flash-preview-05-20 0.00674835 0.0000749817 11: 1 openai/gpt-4.1-nano 0.00532490 0.0000591656 12: 1 google/gemini-2.0-flash-001 0.00476920 0.0000529911 13: 1 google/gemini-2.5-flash-lite-preview-06-17 0.00459080 0.0000510089 14: 1 meta-llama/llama-4-scout 0.00423920 0.0000471022 15: 1 cohere/command-r7b-12-2024 0.00198517 0.0000220575 16: 1 google/gemini-flash-1.5-8b 0.00186690 0.0000207433 17: 1 mistralai/mistral-small-24b-instruct-2501 0.00138418 0.0000153798 18: 1 mistralai/ministral-3b 0.00083068 0.0000092298 19: 1 meta-llama/llama-3.1-8b-instruct 0.00078285 0.0000086983 20: 1 mistralai/mistral-nemo 0.00010177 0.0000011308 batch model total_cost_90_headlines cost_per_headline ``` cost for the top 8 good/cheap/reliable/fast models ```r batch model total_cost_90_headlines n cost_per_headline <int> <char> <num> <int> <num> 1: 1 mistralai/mistral-small 0.00798560 90 0.00008872889 2: 1 meta-llama/llama-4-maverick 0.00740295 90 0.00008225500 3: 1 openai/gpt-4o-mini 0.00711495 90 0.00007905500 4: 1 google/gemini-2.5-flash-preview-05-20 0.00674835 90 0.00007498167 5: 1 openai/gpt-4.1-nano 0.00532490 90 0.00005916556 6: 1 google/gemini-2.0-flash-001 0.00476920 90 0.00005299111 7: 1 google/gemini-2.5-flash-lite-preview-06-17 0.00459080 90 0.00005100889 8: 1 google/gemini-flash-1.5-8b 0.00186690 90 0.00002074333 ``` # LLM and fact-checker variables correlations Headlines are more misleading than inaccurate. ```python batch llm_misleading llm_inaccurate <int> <num> <num> 1: 1 0.54060 0.46293 2: 2 0.54238 0.46148 3: 3 0.54262 0.46123 model llm_misleading llm_inaccurate <char> <num> <num> 1: google/gemini-2.5-flash 0.3729630 0.2955556 2: google/gemini-2.5-flash-preview-05-20 0.3750000 0.2970370 3: cohere/command-r7b-12-2024 0.4095207 0.4424948 4: meta-llama/llama-4-scout 0.4379630 0.3637037 5: openai/gpt-4.1 0.4391111 0.3459259 6: mistralai/ministral-3b 0.4411111 0.4333333 7: mistralai/mistral-medium-3 0.4677778 0.3622222 8: openai/gpt-4.1-nano 0.4725926 0.3718519 9: openai/gpt-4.1-mini 0.4885185 0.4018519 10: qwen/qwen3-30b-a3b 0.5248148 0.5307407 11: mistralai/mistral-small 0.5361111 0.4998148 12: openai/gpt-4o-mini 0.5461111 0.4770370 13: meta-llama/llama-4-maverick 0.5825185 0.4698148 14: x-ai/grok-3-mini 0.5968519 0.3866667 15: mistralai/mistral-small-24b-instruct-2501 0.6422981 0.5708717 16: meta-llama/llama-3.1-8b-instruct 0.6429844 0.6077070 17: google/gemini-2.5-flash-lite-preview-06-17 0.6448148 0.4494444 18: google/gemini-2.0-flash-001 0.6473333 0.4855556 19: mistralai/mistral-nemo 0.7776894 0.6985244 20: google/gemini-flash-1.5-8b 0.7940000 0.7503704 model llm_misleading llm_inaccurate ``` misleading and inaccurate ratings are almost perfectly correlated (showing only batch 1, since all batches are pretty similar) ![[20250710153907.png]] ## LLM misleading rating - results are almost identical across batches, so showing only batch 1 - each dot is the mean rating across 20 LLMs - mean pearson r across all variables: 0.67 - including only ratings where LLMs said they're confident (> .75) didn't increase correlations variables (see [email](https://mail.google.com/mail/u/0/#search/jenny+/KtbxLwhGKnXwkNlpzPHKzdFgNnGKVBTdgB?compose=DmwnWsmBGvQqpHQkGnkBKHLXcLSJRdSkbBBqkFVKczMvgJSpfrvPfPzzpnwgZftGtKNtWjcTtMVg)) - `classification`: fact-checker's answer to the question "Given current evidence, I believe this story is misinformed / potential misleading" (1) or "not misleading" (0). We asked two fact-checkers so it's the average of the two. - `classification_agree`: where 1 = both fact-checkers said misleading and 0 is 0 or 1 fact-checkers said misleading. - `classification_disagree`: 1 if at least 1 fact-checker said misleading and 0 if neither said misleading. - `mis_bin`: is the answer to true/false/misleading where misleading + false are 1 and true is 0. - `mr* questions`: are 7 point likert scale questions about how accurate each item was. - `cof`: ??? ![[vaccine_correlations_llm_misleading_batch1.png]] ## LLM inaccuracy rating mean pearson r across all variables: 0.67 ![[vaccine_correlations_llm_inaccurate_batch1.png]] # TODOs - [ ] which variables to focus on in vaccine dataset? - [x] exclude certain models from the list of 20 models (too expensive/unreliable/slow etc.) - [ ] test algorithm for choosing just 2 (or at most a few) models while not sacrificing accuracy/correlation with fact-checker ratings