250512_103643 canada - perplexity factcheck accuracy - hello
Perplexity web search model: `sonar-pro`
Mean/median accuracy
- `n`: no. of AI messages with factual claims (fewer messages in `nofacts` condition)
```r
strategy condition modelF median_accuracy mean_accuracy n
<char> <char> <fctr> <num> <num> <int>
1: baseline proCarney GPT-4.1 60 63.17647 425
2: baseline proCarney DeepSeek-V3 60 59.37799 418
3: baseline proCarney Llama-4 70 69.04481 424
4: baseline proPoilievre GPT-4.1 60 64.02314 389
5: baseline proPoilievre DeepSeek-V3 60 60.41929 477
6: baseline proPoilievre Llama-4 70 70.65708 487
7: nofacts proCarney GPT-4.1 75 72.65625 32
8: nofacts proCarney DeepSeek-V3 80 69.50000 20
9: nofacts proCarney Llama-4 70 69.39759 83
10: nofacts proPoilievre GPT-4.1 70 64.84848 33
11: nofacts proPoilievre DeepSeek-V3 60 65.38462 13
12: nofacts proPoilievre Llama-4 70 66.81159 69
```
DV: AI message accuracy
- each person has multiple AI messages, so SEs are clustered on user
- `conditionC`: proCarney (-0.5), pro-Poilievre (0.5)
- `modelF`: gpt, deepseek, llama
- `strategy`: baseline, nofacts
```r
summ(feols(pfc ~ conditionC * modelF * strategy, d1, cluster = ~responseid))
term result sig
<char> <char> <char>
1: (Intercept) b = 63.60 [62.32, 64.88], p < .001 *** # baseline, gpt
2: conditionC b = 0.85 [-1.71, 3.40], p = .516 # no difference in accuracy between conditions
3: modelFDeepSeek-V3 b = -3.70 [-5.57, -1.83], p < .001 *** # deepseek is less accurate than gpt
4: modelFLlama-4 b = 6.25 [4.45, 8.06], p < .001 *** # llama is more accurate than gpt
5: strategynofacts b = 5.15 [0.15, 10.15], p = .043 * # but note that nofacts has fewer messages than baseline
6: conditionC × modelFDeepSeek-V3 b = 0.19 [-3.54, 3.93], p = .919
7: conditionC × modelFLlama-4 b = 0.77 [-2.84, 4.38], p = .677
8: conditionC × strategynofacts b = -8.65 [-18.66, 1.35], p = .090 .
9: modelFDeepSeek-V3 × strategynofacts b = 2.39 [-9.35, 14.13], p = .690
10: modelFLlama-4 × strategynofacts b = -6.90 [-13.05, -0.75], p = .028 *
11: conditionC × modelFDeepSeek-V3 × strategynofacts b = 3.50 [-19.99, 26.98], p = .770
12: conditionC × modelFLlama-4 × strategynofacts b = 4.46 [-7.84, 16.76], p = .477
```
![[1747066479.png]]
# interact with no. of facts
```r
r
gt; summ(feols(pfc ~ conditionC * modelF * scale(n_factual_claims), d1, cluster = ~responseid))
term result sig
<char> <char> <char>
1: (Intercept) b = 63.75 [62.52, 64.99], p < .001 ***
2: conditionC b = 0.43 [-2.04, 2.90], p = .736
3: modelFDeepSeek-V3 b = -3.83 [-5.70, -1.95], p < .001 ***
4: modelFLlama-4 b = 6.38 [4.69, 8.07], p < .001 ***
5: scale(n_factual_claims) b = 2.45 [1.27, 3.62], p < .001 *** # more facts, more accurate overall?
6: conditionC × modelFDeepSeek-V3 b = 0.86 [-2.89, 4.61], p = .653
7: conditionC × modelFLlama-4 b = 0.59 [-2.80, 3.97], p = .734
8: conditionC × scale(n_factual_claims) b = 1.37 [-0.98, 3.72], p = .253
9: modelFDeepSeek-V3 × scale(n_factual_claims) b = 0.35 [-1.50, 2.20], p = .709
10: modelFLlama-4 × scale(n_factual_claims) b = 0.72 [-0.83, 2.27], p = .362
11: conditionC × modelFDeepSeek-V3 × scale(n_factual_claims) b = -2.56 [-6.25, 1.13], p = .174
12: conditionC × modelFLlama-4 × scale(n_factual_claims) b = -1.37 [-4.47, 1.74], p = .388
```
![[20250512170307.png]]