250204_152950 new prompt plus compare models

approach - random subset of 1000 AI messages - 27 strategies - compare 5 models (listed based on desc order of Elo rating) - gpt4o - deepseekchat - o3mini (not sure because it's not classified yet) - gpt4omini - llama3.1-405b - **revised prompt** to raise threshold for saying strategy is used/present - in the [[250203_152746 strategy use in trump harris study|first version prompt]], mean strategy presence for gpt4omini was 49% (with revised prompt it's 39%; see table below) summary - deepseek-chat and gpt-4o much lower average strategy use than the other models - llama405 should be the worst model (based on Elo ratings) - deepseek-chat and gpt4o should be the best models (based on Elo ratings) ```r # latest prompt model presence <char> <num> 1: deepseek-chat 0.1423353 # high elo model 2: gpt-4o 0.1875025 # high elo model 3: gpt-4o-mini 0.3904403 # the original model i used 4: llama-3.1-405b-instruct 0.3891835 # lowest elo model 5: o3-mini 0.3834099 # supposed to be a high elo model but who knows # old prompt model presence <char> <num> 1: gpt-4o 0.2985154 # 10% higher 2: gpt-4o-mini 0.4926652 # 10% higher ``` ![[1738701573.png]] # validation (using latest prompt) how well do ratings correlate between models? - gpt mini models (o3mini and gpt4omini) correlate very strongly (perhaps not surprising—they might have distilled the models using similar techniques) - both models say strategies are used frequently (~38%) - the better/best models (deepseekchat and gpt4o) also correlate quite well (rho = 0.68) - strategy use is much lower (14% and 18%) ![[1738702733.png]] ```r # Correlation Matrix (spearman-method) Parameter1 | Parameter2 | rho | 95% CI | S | p -------------------------------------------------------------------------------------- deepseekchat | deepseekchat | 1.00 | [1.00, 1.00] | Inf | < .001*** deepseekchat | gpt4o | 0.68 | [0.67, 0.69] | 8.84e+11 | < .001*** deepseekchat | gpt4omini | 0.59 | [0.58, 0.60] | 1.13e+12 | < .001*** deepseekchat | llama31405binstruct | 0.54 | [0.54, 0.55] | 1.25e+12 | < .001*** deepseekchat | o3mini | 0.59 | [0.59, 0.60] | 1.12e+12 | < .001*** gpt4o | deepseekchat | 0.68 | [0.67, 0.69] | 8.84e+11 | < .001*** gpt4o | gpt4o | 1.00 | [1.00, 1.00] | Inf | < .001*** gpt4o | gpt4omini | 0.63 | [0.62, 0.64] | 1.01e+12 | < .001*** gpt4o | llama31405binstruct | 0.61 | [0.60, 0.62] | 1.07e+12 | < .001*** gpt4o | o3mini | 0.65 | [0.64, 0.66] | 9.58e+11 | < .001*** gpt4omini | deepseekchat | 0.59 | [0.58, 0.60] | 1.13e+12 | < .001*** gpt4omini | gpt4o | 0.63 | [0.62, 0.64] | 1.01e+12 | < .001*** gpt4omini | gpt4omini | 1.00 | [1.00, 1.00] | Inf | < .001*** gpt4omini | llama31405binstruct | 0.73 | [0.72, 0.74] | 7.43e+11 | < .001*** gpt4omini | o3mini | 0.74 | [0.73, 0.74] | 7.28e+11 | < .001*** llama31405binstruct | deepseekchat | 0.54 | [0.54, 0.55] | 1.25e+12 | < .001*** llama31405binstruct | gpt4o | 0.61 | [0.60, 0.62] | 1.07e+12 | < .001*** llama31405binstruct | gpt4omini | 0.73 | [0.72, 0.74] | 7.43e+11 | < .001*** llama31405binstruct | llama31405binstruct | 1.00 | [1.00, 1.00] | Inf | < .001*** llama31405binstruct | o3mini | 0.67 | [0.66, 0.68] | 9.06e+11 | < .001*** o3mini | deepseekchat | 0.59 | [0.59, 0.60] | 1.12e+12 | < .001*** o3mini | gpt4o | 0.65 | [0.64, 0.66] | 9.58e+11 | < .001*** o3mini | gpt4omini | 0.74 | [0.73, 0.74] | 7.28e+11 | < .001*** o3mini | llama31405binstruct | 0.67 | [0.66, 0.68] | 9.06e+11 | < .001*** o3mini | o3mini | 1.00 | [1.00, 1.00] | Inf | < .001*** ``` ![[20250204165607.png]] components ![[20250204170604.png]]