250512_103643 canada - factcheck accuracy

Mean/median accuracy - `n`: no. of AI messages with factual claims (fewer messages in `nofacts` condition) ```r strategy conditionC modelF median_accuracy mean_accuracy n <char> <num> <fctr> <num> <num> <int> 1: baseline -0.5 GPT-4.1 60 63.12207 426 2: baseline -0.5 DeepSeek-V3 60 59.69267 423 3: baseline -0.5 Llama-4 70 69.03981 427 4: baseline 0.5 GPT-4.1 60 64.02314 389 5: baseline 0.5 DeepSeek-V3 60 60.66253 483 6: baseline 0.5 Llama-4 70 70.75510 490 7: nofacts -0.5 GPT-4.1 75 71.61765 34 8: nofacts -0.5 DeepSeek-V3 85 76.06061 33 9: nofacts -0.5 Llama-4 70 70.05495 91 10: nofacts 0.5 GPT-4.1 70 64.42857 35 11: nofacts 0.5 DeepSeek-V3 80 70.52632 19 12: nofacts 0.5 Llama-4 70 67.24359 78 ``` DV: AI message accuracy - each person has multiple AI messages, so SEs are clustered on user - `conditionC`: proCarney (-0.5), pro-Poilievre (0.5) ```r rgt; summ(feols(accuracy ~ conditionC * modelF * strategy, d1, cluster = ~responseid)) term result sig <char> <char> <char> 1: (Intercept) b = 63.57 [62.29, 64.85], p < .001 *** # baseline, gpt 2: conditionC b = 0.90 [-1.66, 3.46], p = .490 # no difference in accuracy between conditions 3: modelFDeepSeek-V3 b = -3.40 [-5.28, -1.51], p < .001 *** # deepseek is less accurate than gpt 4: modelFLlama-4 b = 6.32 [4.52, 8.13], p < .001 *** # llama is more accurate than gpt 5: strategynofacts b = 4.45 [-0.36, 9.26], p = .070 . # nofacts has much less messages than baseline 6: conditionC × modelFDeepSeek-V3 b = 0.07 [-3.69, 3.83], p = .971 7: conditionC × modelFLlama-4 b = 0.81 [-2.79, 4.42], p = .658 8: conditionC × strategynofacts b = -8.09 [-17.72, 1.54], p = .099 . 9: modelFDeepSeek-V3 × strategynofacts b = 8.67 [-0.84, 18.17], p = .074 . 10: modelFLlama-4 × strategynofacts b = -5.70 [-11.55, 0.15], p = .056 . 11: conditionC × modelFDeepSeek-V3 × strategynofacts b = 1.59 [-17.42, 20.59], p = .870 12: conditionC × modelFLlama-4 × strategynofacts b = 3.56 [-8.13, 15.26], p = .550 ``` ![[1747066479.png]]