- [[250711_183656 20 llms classify jenny vaccine data - online models|see online model results]]
- dataset: jenny's vaccine data with 90 fact-checked headlines
- 20 different LLMs used to classify each headline (goal is to eventually reduce to 2 or 3 models at most per headline/post)
- ran 3 batches/iterations (so we can also check for reliability/consistency within models, in addition to across models)
- each batch: $90 \times 20$
# prompt
- llm asked to evaluate
- misleading
- inaccuracy
- confidence (for exploring - suggested by mohsen)
> You're a politically neutral & discerning social media user who excels at determining whether a social media post might be potentially inaccurate/misleading, such that reading it would cause people to have less accurate beliefs. Provide a rating between 0 & 1, where 0 is definitely not inaccurate/misleading and 1 is definitely inaccurate/misleading. Indicate your confidence in your rating between 0 (not at all confident) & 1 (very confident). Very briefly explain your rating.
```python
# response schema, ensures/requests the following structured output from LLM
class Response(BaseModel):
inaccurate_rating: float
misleading_rating: float
confidence_rating: float
explanation: str
```
# total cost (dollars) per batch
```r
batch cost
<int> <num>
1: 1 0.2882899 # cost for 90 headlines x 20 models
2: 2 0.2886385
3: 3 0.2903000
```
total cost for batch 1, by model
```r
batch model total_cost_90_headlines cost_per_headline
<int> <char> <num> <num>
1: 1 openai/gpt-4.1 0.11289000 0.0012543333 # too expensive
2: 1 x-ai/grok-3-mini 0.03854670 0.0004282967
3: 1 google/gemini-2.5-flash 0.02415600 0.0002684000
4: 1 mistralai/mistral-medium-3 0.02299360 0.0002554844
5: 1 openai/gpt-4.1-mini 0.02021480 0.0002246089
6: 1 qwen/qwen3-30b-a3b 0.01436130 0.0001595700
7: 1 mistralai/mistral-small 0.00798560 0.0000887289
8: 1 meta-llama/llama-4-maverick 0.00740295 0.0000822550
9: 1 openai/gpt-4o-mini 0.00711495 0.0000790550
10: 1 google/gemini-2.5-flash-preview-05-20 0.00674835 0.0000749817
11: 1 openai/gpt-4.1-nano 0.00532490 0.0000591656
12: 1 google/gemini-2.0-flash-001 0.00476920 0.0000529911
13: 1 google/gemini-2.5-flash-lite-preview-06-17 0.00459080 0.0000510089
14: 1 meta-llama/llama-4-scout 0.00423920 0.0000471022
15: 1 cohere/command-r7b-12-2024 0.00198517 0.0000220575
16: 1 google/gemini-flash-1.5-8b 0.00186690 0.0000207433
17: 1 mistralai/mistral-small-24b-instruct-2501 0.00138418 0.0000153798
18: 1 mistralai/ministral-3b 0.00083068 0.0000092298
19: 1 meta-llama/llama-3.1-8b-instruct 0.00078285 0.0000086983
20: 1 mistralai/mistral-nemo 0.00010177 0.0000011308
batch model total_cost_90_headlines cost_per_headline
```
cost for the top 8 good/cheap/reliable/fast models
```r
batch model total_cost_90_headlines n cost_per_headline
<int> <char> <num> <int> <num>
1: 1 mistralai/mistral-small 0.00798560 90 0.00008872889
2: 1 meta-llama/llama-4-maverick 0.00740295 90 0.00008225500
3: 1 openai/gpt-4o-mini 0.00711495 90 0.00007905500
4: 1 google/gemini-2.5-flash-preview-05-20 0.00674835 90 0.00007498167
5: 1 openai/gpt-4.1-nano 0.00532490 90 0.00005916556
6: 1 google/gemini-2.0-flash-001 0.00476920 90 0.00005299111
7: 1 google/gemini-2.5-flash-lite-preview-06-17 0.00459080 90 0.00005100889
8: 1 google/gemini-flash-1.5-8b 0.00186690 90 0.00002074333
```
# LLM and fact-checker variables correlations
Headlines are more misleading than inaccurate.
```python
batch llm_misleading llm_inaccurate
<int> <num> <num>
1: 1 0.54060 0.46293
2: 2 0.54238 0.46148
3: 3 0.54262 0.46123
model llm_misleading llm_inaccurate
<char> <num> <num>
1: google/gemini-2.5-flash 0.3729630 0.2955556
2: google/gemini-2.5-flash-preview-05-20 0.3750000 0.2970370
3: cohere/command-r7b-12-2024 0.4095207 0.4424948
4: meta-llama/llama-4-scout 0.4379630 0.3637037
5: openai/gpt-4.1 0.4391111 0.3459259
6: mistralai/ministral-3b 0.4411111 0.4333333
7: mistralai/mistral-medium-3 0.4677778 0.3622222
8: openai/gpt-4.1-nano 0.4725926 0.3718519
9: openai/gpt-4.1-mini 0.4885185 0.4018519
10: qwen/qwen3-30b-a3b 0.5248148 0.5307407
11: mistralai/mistral-small 0.5361111 0.4998148
12: openai/gpt-4o-mini 0.5461111 0.4770370
13: meta-llama/llama-4-maverick 0.5825185 0.4698148
14: x-ai/grok-3-mini 0.5968519 0.3866667
15: mistralai/mistral-small-24b-instruct-2501 0.6422981 0.5708717
16: meta-llama/llama-3.1-8b-instruct 0.6429844 0.6077070
17: google/gemini-2.5-flash-lite-preview-06-17 0.6448148 0.4494444
18: google/gemini-2.0-flash-001 0.6473333 0.4855556
19: mistralai/mistral-nemo 0.7776894 0.6985244
20: google/gemini-flash-1.5-8b 0.7940000 0.7503704
model llm_misleading llm_inaccurate
```
misleading and inaccurate ratings are almost perfectly correlated (showing only batch 1, since all batches are pretty similar)
![[20250710153907.png]]
## LLM misleading rating
- results are almost identical across batches, so showing only batch 1
- each dot is the mean rating across 20 LLMs
- mean pearson r across all variables: 0.67
- including only ratings where LLMs said they're confident (> .75) didn't increase correlations
variables (see [email](https://mail.google.com/mail/u/0/#search/jenny+/KtbxLwhGKnXwkNlpzPHKzdFgNnGKVBTdgB?compose=DmwnWsmBGvQqpHQkGnkBKHLXcLSJRdSkbBBqkFVKczMvgJSpfrvPfPzzpnwgZftGtKNtWjcTtMVg))
- `classification`: fact-checker's answer to the question "Given current evidence, I believe this story is misinformed / potential misleading" (1) or "not misleading" (0). We asked two fact-checkers so it's the average of the two.
- `classification_agree`: where 1 = both fact-checkers said misleading and 0 is 0 or 1 fact-checkers said misleading.
- `classification_disagree`: 1 if at least 1 fact-checker said misleading and 0 if neither said misleading.
- `mis_bin`: is the answer to true/false/misleading where misleading + false are 1 and true is 0.
- `mr* questions`: are 7 point likert scale questions about how accurate each item was.
- `cof`: ???
![[vaccine_correlations_llm_misleading_batch1.png]]
## LLM inaccuracy rating
mean pearson r across all variables: 0.67
![[vaccine_correlations_llm_inaccurate_batch1.png]]
# TODOs
- [ ] which variables to focus on in vaccine dataset?
- [x] exclude certain models from the list of 20 models (too expensive/unreliable/slow etc.)
- [ ] test algorithm for choosing just 2 (or at most a few) models while not sacrificing accuracy/correlation with fact-checker ratings