250512_103643 canada - perplexity factcheck accuracy

Perplexity web search model: `sonar-pro` Mean/median accuracy - `n`: no. of AI messages with factual claims (fewer messages in `nofacts` condition) ```r strategy condition modelF median_accuracy mean_accuracy n <char> <char> <fctr> <num> <num> <int> 1: baseline proCarney GPT-4.1 60 63.17647 425 2: baseline proCarney DeepSeek-V3 60 59.37799 418 3: baseline proCarney Llama-4 70 69.04481 424 4: baseline proPoilievre GPT-4.1 60 64.02314 389 5: baseline proPoilievre DeepSeek-V3 60 60.41929 477 6: baseline proPoilievre Llama-4 70 70.65708 487 7: nofacts proCarney GPT-4.1 75 72.65625 32 8: nofacts proCarney DeepSeek-V3 80 69.50000 20 9: nofacts proCarney Llama-4 70 69.39759 83 10: nofacts proPoilievre GPT-4.1 70 64.84848 33 11: nofacts proPoilievre DeepSeek-V3 60 65.38462 13 12: nofacts proPoilievre Llama-4 70 66.81159 69 ``` DV: AI message accuracy - each person has multiple AI messages, so SEs are clustered on user - `conditionC`: proCarney (-0.5), pro-Poilievre (0.5) - `modelF`: gpt, deepseek, llama - `strategy`: baseline, nofacts ```r summ(feols(pfc ~ conditionC * modelF * strategy, d1, cluster = ~responseid)) term result sig <char> <char> <char> 1: (Intercept) b = 63.60 [62.32, 64.88], p < .001 *** # baseline, gpt 2: conditionC b = 0.85 [-1.71, 3.40], p = .516 # no difference in accuracy between conditions 3: modelFDeepSeek-V3 b = -3.70 [-5.57, -1.83], p < .001 *** # deepseek is less accurate than gpt 4: modelFLlama-4 b = 6.25 [4.45, 8.06], p < .001 *** # llama is more accurate than gpt 5: strategynofacts b = 5.15 [0.15, 10.15], p = .043 * # but note that nofacts has fewer messages than baseline 6: conditionC × modelFDeepSeek-V3 b = 0.19 [-3.54, 3.93], p = .919 7: conditionC × modelFLlama-4 b = 0.77 [-2.84, 4.38], p = .677 8: conditionC × strategynofacts b = -8.65 [-18.66, 1.35], p = .090 . 9: modelFDeepSeek-V3 × strategynofacts b = 2.39 [-9.35, 14.13], p = .690 10: modelFLlama-4 × strategynofacts b = -6.90 [-13.05, -0.75], p = .028 * 11: conditionC × modelFDeepSeek-V3 × strategynofacts b = 3.50 [-19.99, 26.98], p = .770 12: conditionC × modelFLlama-4 × strategynofacts b = 4.46 [-7.84, 16.76], p = .477 ``` ![[1747066479.png]] # interact with no. of facts ```r rgt; summ(feols(pfc ~ conditionC * modelF * scale(n_factual_claims), d1, cluster = ~responseid)) term result sig <char> <char> <char> 1: (Intercept) b = 63.75 [62.52, 64.99], p < .001 *** 2: conditionC b = 0.43 [-2.04, 2.90], p = .736 3: modelFDeepSeek-V3 b = -3.83 [-5.70, -1.95], p < .001 *** 4: modelFLlama-4 b = 6.38 [4.69, 8.07], p < .001 *** 5: scale(n_factual_claims) b = 2.45 [1.27, 3.62], p < .001 *** # more facts, more accurate overall? 6: conditionC × modelFDeepSeek-V3 b = 0.86 [-2.89, 4.61], p = .653 7: conditionC × modelFLlama-4 b = 0.59 [-2.80, 3.97], p = .734 8: conditionC × scale(n_factual_claims) b = 1.37 [-0.98, 3.72], p = .253 9: modelFDeepSeek-V3 × scale(n_factual_claims) b = 0.35 [-1.50, 2.20], p = .709 10: modelFLlama-4 × scale(n_factual_claims) b = 0.72 [-0.83, 2.27], p = .362 11: conditionC × modelFDeepSeek-V3 × scale(n_factual_claims) b = -2.56 [-6.25, 1.13], p = .174 12: conditionC × modelFLlama-4 × scale(n_factual_claims) b = -1.37 [-4.47, 1.74], p = .388 ``` ![[20250512170307.png]]