- see [[250520_101204 poland - perplexity factcheck accuracy|poland accuracy results]] Perplexity web search model: `sonar-pro` Mean/median accuracy - `n`: no. of AI messages with factual claims ```r condition median_accuracy mean_accuracy n <char> <num> <num> <int> 1: proCarney 85 80.96718 2224 2: proPoilievre 80 77.29054 2251 strategy condition modelF median_accuracy mean_accuracy n <char> <char> <fctr> <num> <num> <int> 1: baseline proCarney GPT-4.1 85 82.39494 514 2: baseline proCarney DeepSeek-V3 85 79.72627 453 3: baseline proCarney Llama-4 85 83.08279 459 4: baseline proPoilievre GPT-4.1 85 79.53593 487 5: baseline proPoilievre DeepSeek-V3 80 77.48358 548 6: baseline proPoilievre Llama-4 85 81.70406 517 7: nofacts proCarney GPT-4.1 85 81.07500 200 8: nofacts proCarney DeepSeek-V3 80 75.46667 210 9: nofacts proCarney Llama-4 85 80.94330 388 10: nofacts proPoilievre GPT-4.1 75 72.59459 185 11: nofacts proPoilievre DeepSeek-V3 70 68.22222 180 12: nofacts proPoilievre Llama-4 80 74.35629 334 ``` ![[1747147217.png]] DV: AI message accuracy - each person has multiple AI messages, so SEs are clustered on user - `conditionC`: proCarney (-0.5), pro-Poilievre (0.5) - `modelF`: gpt, deepseek, llama - `strategy`: baseline, nofacts ```r summ(feols(pfc ~ conditionC * modelF * strategy, d1, cluster = ~responseid)) term result sig <char> <char> <char> 1: (Intercept) b = 80.97 [80.09, 81.84], p < .001 *** # baseline, gpt4 2: conditionC b = -2.86 [-4.62, -1.10], p = .001 ** # proP messages slightly less accurate 3: modelFDeepSeek-V3 b = -2.36 [-3.73, -0.99], p = .001 *** # deepseek less accurate 4: modelFLlama-4 b = 1.43 [0.20, 2.66], p = .023 * # llama more accurate 5: strategynofacts b = -4.13 [-5.84, -2.42], p < .001 *** # nofacts LESS accurate than baseline 6: conditionC × modelFDeepSeek-V3 b = 0.62 [-2.12, 3.35], p = .659 7: conditionC × modelFLlama-4 b = 1.48 [-0.98, 3.94], p = .237 8: conditionC × strategynofacts b = -5.62 [-9.04, -2.21], p = .001 ** # nofacts proP messages LESS accurate 9: modelFDeepSeek-V3 × strategynofacts b = -2.63 [-5.33, 0.07], p = .056 . 10: modelFLlama-4 × strategynofacts b = -0.61 [-2.88, 1.65], p = .596 11: conditionC × modelFDeepSeek-V3 × strategynofacts b = 0.62 [-4.78, 6.02], p = .822 12: conditionC × modelFLlama-4 × strategynofacts b = 0.41 [-4.12, 4.95], p = .858 ``` # interact with no. of facts ```r rgt; summ(feols(pfc ~ conditionC * modelF * scale(n_factual_claims), d1, cluster = ~responseid)) term result sig <char> <char> <char> 1: (Intercept) b = 79.66 [78.89, 80.42], p < .001 *** 2: conditionC b = -4.48 [-6.01, -2.95], p < .001 *** 3: modelFDeepSeek-V3 b = -3.09 [-4.29, -1.88], p < .001 *** # deepseek less accurate 4: modelFLlama-4 b = 1.17 [0.15, 2.20], p = .025 * 5: scale(n_factual_claims) b = 2.09 [1.38, 2.81], p < .001 *** # more facts, more accurate overall 6: conditionC × modelFDeepSeek-V3 b = 1.39 [-1.03, 3.80], p = .260 7: conditionC × modelFLlama-4 b = 1.28 [-0.76, 3.33], p = .219 8: conditionC × scale(n_factual_claims) b = 2.25 [0.81, 3.68], p = .002 ** 9: modelFDeepSeek-V3 × scale(n_factual_claims) b = 0.60 [-0.52, 1.72], p = .293 10: modelFLlama-4 × scale(n_factual_claims) b = 0.61 [-0.36, 1.59], p = .220 11: conditionC × modelFDeepSeek-V3 × scale(n_factual_claims) b = -1.12 [-3.36, 1.12], p = .327 12: conditionC × modelFLlama-4 × scale(n_factual_claims) b = -0.20 [-2.15, 1.75], p = .843 ``` ![[20250513104117.png]] # check: harris-trump study with sonar-pro Sonar-pro is a newer and more updated model (release March 2025). Older model we used (sonar-huge) don't exist anymore. Randomly sampled 1000 messages to see if the asymmetry is still there when using sonar pro (yes). - median rating: 82 (90 for old model) - correlation between sonar-pro scores and scores from old model (sonar-huge): `rho = 0.64 [0.60, 0.68]` ```r rgt; summ(feols(sonar_pro_new ~ condition + topic, data = d2, cluster = ~responseid)) term result sig <char> <char> <char> 1: (Intercept) b = 81.59 [80.35, 82.82], p < .001 *** 2: conditionproTrump b = -12.54 [-14.02, -11.06], p < .001 *** # old model: b = -22.46 [-23.77, -21.15], p < .001 3: topicpolicy b = 3.84 [2.38, 5.30], p < .001 *** # old model: b = 6.06 [4.75, 7.37], p < .001 ``` manuscript > To evaluate the accuracy of the claims made by the AI models, we conducted post hoc analyses using Perplexity AI’s online LLM (Sonar Huge 128k-online)—which can access real-time information from the internet—to fact-check all statements made by the AI models during the conversations that it determined contained claims or factual information (k = 8,134 statements). Each statement was rated on a scale from 0 (completely inaccurate) to 100 (completely accurate). We find that, on average, the statements made by the AI models were mostly accurate, receiving a median accuracy score of 90 (median absolute deviation = 14.83). In a post hoc analysis comparing accuracy scores across conditions, however, we find stark differences (Fig. 3A). Linear regression predicting accuracy scores using center-coded persuasion condition and conversation focus dummies and their interaction finds that the pro-Trump AI made substantially more inaccurate statements compared to the pro-Harris AI (b = -22.46 [-23.77, -21.15], p < .001). We also find that the statements were more accurate in the policy-focused condition than the personality-focused condition (b = 6.06 [4.75, 7.37], p < .001), and that the accuracy difference between the pro-Harris and pro-Trump statements was significantly larger when the conversation was focused on personality rather than policy (interaction: b = 7.27 [4.65, 9.90], p < .001). That is, the AI was more misinformative when trying to convince people to vote for Trump, and this was particularly stark when the conversation focused on the candidates’ personal characteristics (see SI Section S3.1 and SI Table S20 for examples). We also examined how the AI models’ accuracy changed throughout a given conversation and find that accuracy increases across successive statements within a conversation (see SI Section S3.2).