gt; summ(feols(pfc ~ conditionC * modelF * scale(n_factual_claims), d1, cluster = ~responseid)) term result sig <char> <char> <char> 1: (Intercept) b = 79.66 [78.89, 80.42], p < .001 *** 2: conditionC b = -4.48 [-6.01, -2.95], p < .001 *** 3: modelFDeepSeek-V3 b = -3.09 [-4.29, -1.88], p < .001 *** # deepseek less accurate 4: modelFLlama-4 b = 1.17 [0.15, 2.20], p = .025 * 5: scale(n_factual_claims) b = 2.09 [1.38, 2.81], p < .001 *** # more facts, more accurate overall 6: conditionC × modelFDeepSeek-V3 b = 1.39 [-1.03, 3.80], p = .260 7: conditionC × modelFLlama-4 b = 1.28 [-0.76, 3.33], p = .219 8: conditionC × scale(n_factual_claims) b = 2.25 [0.81, 3.68], p = .002 ** 9: modelFDeepSeek-V3 × scale(n_factual_claims) b = 0.60 [-0.52, 1.72], p = .293 10: modelFLlama-4 × scale(n_factual_claims) b = 0.61 [-0.36, 1.59], p = .220 11: conditionC × modelFDeepSeek-V3 × scale(n_factual_claims) b = -1.12 [-3.36, 1.12], p = .327 12: conditionC × modelFLlama-4 × scale(n_factual_claims) b = -0.20 [-2.15, 1.75], p = .843 ``` ![[20250513104117.png]] # check: harris-trump study with sonar-pro Sonar-pro is a newer and more updated model (release March 2025). Older model we used (sonar-huge) don't exist anymore. Randomly sampled 1000 messages to see if the asymmetry is still there when using sonar pro (yes). - median rating: 82 (90 for old model) - correlation between sonar-pro scores and scores from old model (sonar-huge): `rho = 0.64 [0.60, 0.68]` ```r rgt; summ(feols(sonar_pro_new ~ condition + topic, data = d2, cluster = ~responseid)) term result sig <char> <char> <char> 1: (Intercept) b = 81.59 [80.35, 82.82], p < .001 *** 2: conditionproTrump b = -12.54 [-14.02, -11.06], p < .001 *** # old model: b = -22.46 [-23.77, -21.15], p < .001 3: topicpolicy b = 3.84 [2.38, 5.30], p < .001 *** # old model: b = 6.06 [4.75, 7.37], p < .001 ``` manuscript > To evaluate the accuracy of the claims made by the AI models, we conducted post hoc analyses using Perplexity AI’s online LLM (Sonar Huge 128k-online)—which can access real-time information from the internet—to fact-check all statements made by the AI models during the conversations that it determined contained claims or factual information (k = 8,134 statements). Each statement was rated on a scale from 0 (completely inaccurate) to 100 (completely accurate). We find that, on average, the statements made by the AI models were mostly accurate, receiving a median accuracy score of 90 (median absolute deviation = 14.83). In a post hoc analysis comparing accuracy scores across conditions, however, we find stark differences (Fig. 3A). Linear regression predicting accuracy scores using center-coded persuasion condition and conversation focus dummies and their interaction finds that the pro-Trump AI made substantially more inaccurate statements compared to the pro-Harris AI (b = -22.46 [-23.77, -21.15], p < .001). We also find that the statements were more accurate in the policy-focused condition than the personality-focused condition (b = 6.06 [4.75, 7.37], p < .001), and that the accuracy difference between the pro-Harris and pro-Trump statements was significantly larger when the conversation was focused on personality rather than policy (interaction: b = 7.27 [4.65, 9.90], p < .001). That is, the AI was more misinformative when trying to convince people to vote for Trump, and this was particularly stark when the conversation focused on the candidates’ personal characteristics (see SI Section S3.1 and SI Table S20 for examples). We also examined how the AI models’ accuracy changed throughout a given conversation and find that accuracy increases across successive statements within a conversation (see SI Section S3.2).