approach
- random subset of 1000 AI messages
- 27 strategies
- compare 5 models (listed based on desc order of Elo rating)
- gpt4o
- deepseekchat
- o3mini (not sure because it's not classified yet)
- gpt4omini
- llama3.1-405b
- **revised prompt** to raise threshold for saying strategy is used/present
- in the [[250203_152746 strategy use in trump harris study|first version prompt]], mean strategy presence for gpt4omini was 49% (with revised prompt it's 39%; see table below)
summary
- deepseek-chat and gpt-4o much lower average strategy use than the other models
- llama405 should be the worst model (based on Elo ratings)
- deepseek-chat and gpt4o should be the best models (based on Elo ratings)
```r
# latest prompt
model presence
<char> <num>
1: deepseek-chat 0.1423353 # high elo model
2: gpt-4o 0.1875025 # high elo model
3: gpt-4o-mini 0.3904403 # the original model i used
4: llama-3.1-405b-instruct 0.3891835 # lowest elo model
5: o3-mini 0.3834099 # supposed to be a high elo model but who knows
# old prompt
model presence
<char> <num>
1: gpt-4o 0.2985154 # 10% higher
2: gpt-4o-mini 0.4926652 # 10% higher
```
![[1738701573.png]]
# validation (using latest prompt)
how well do ratings correlate between models?
- gpt mini models (o3mini and gpt4omini) correlate very strongly (perhaps not surprising—they might have distilled the models using similar techniques)
- both models say strategies are used frequently (~38%)
- the better/best models (deepseekchat and gpt4o) also correlate quite well (rho = 0.68)
- strategy use is much lower (14% and 18%)
![[1738702733.png]]
```r
# Correlation Matrix (spearman-method)
Parameter1 | Parameter2 | rho | 95% CI | S | p
--------------------------------------------------------------------------------------
deepseekchat | deepseekchat | 1.00 | [1.00, 1.00] | Inf | < .001***
deepseekchat | gpt4o | 0.68 | [0.67, 0.69] | 8.84e+11 | < .001***
deepseekchat | gpt4omini | 0.59 | [0.58, 0.60] | 1.13e+12 | < .001***
deepseekchat | llama31405binstruct | 0.54 | [0.54, 0.55] | 1.25e+12 | < .001***
deepseekchat | o3mini | 0.59 | [0.59, 0.60] | 1.12e+12 | < .001***
gpt4o | deepseekchat | 0.68 | [0.67, 0.69] | 8.84e+11 | < .001***
gpt4o | gpt4o | 1.00 | [1.00, 1.00] | Inf | < .001***
gpt4o | gpt4omini | 0.63 | [0.62, 0.64] | 1.01e+12 | < .001***
gpt4o | llama31405binstruct | 0.61 | [0.60, 0.62] | 1.07e+12 | < .001***
gpt4o | o3mini | 0.65 | [0.64, 0.66] | 9.58e+11 | < .001***
gpt4omini | deepseekchat | 0.59 | [0.58, 0.60] | 1.13e+12 | < .001***
gpt4omini | gpt4o | 0.63 | [0.62, 0.64] | 1.01e+12 | < .001***
gpt4omini | gpt4omini | 1.00 | [1.00, 1.00] | Inf | < .001***
gpt4omini | llama31405binstruct | 0.73 | [0.72, 0.74] | 7.43e+11 | < .001***
gpt4omini | o3mini | 0.74 | [0.73, 0.74] | 7.28e+11 | < .001***
llama31405binstruct | deepseekchat | 0.54 | [0.54, 0.55] | 1.25e+12 | < .001***
llama31405binstruct | gpt4o | 0.61 | [0.60, 0.62] | 1.07e+12 | < .001***
llama31405binstruct | gpt4omini | 0.73 | [0.72, 0.74] | 7.43e+11 | < .001***
llama31405binstruct | llama31405binstruct | 1.00 | [1.00, 1.00] | Inf | < .001***
llama31405binstruct | o3mini | 0.67 | [0.66, 0.68] | 9.06e+11 | < .001***
o3mini | deepseekchat | 0.59 | [0.59, 0.60] | 1.12e+12 | < .001***
o3mini | gpt4o | 0.65 | [0.64, 0.66] | 9.58e+11 | < .001***
o3mini | gpt4omini | 0.74 | [0.73, 0.74] | 7.28e+11 | < .001***
o3mini | llama31405binstruct | 0.67 | [0.66, 0.68] | 9.06e+11 | < .001***
o3mini | o3mini | 1.00 | [1.00, 1.00] | Inf | < .001***
```
![[20250204165607.png]]
components
![[20250204170604.png]]