250103_114800 simulate LLM responses

From the original qualtrics data, I randomly selected 100 participants from each condition-topic combination, so 400 participants in total. `fc_score` is perplexity's (mean) fact-checked score. Model is labeled as `perplexity+gpt`. The **first** AI response to all 400 participants' prompts (generated during the study) were all contained factual content to be evaluated. ```r model condition topic n fc_score <char> <char> <char> <int> <num> 1: perplexity+gpt Pro-Harris AI Personality 100 84.39 2: perplexity+gpt Pro-Harris AI Policy 100 88.44 3: perplexity+gpt Pro-Trump AI Personality 100 38.95 4: perplexity+gpt Pro-Trump AI Policy 100 52.56 ``` Each participant has a unique system+user prompt that was used in the study to initialize/initiate the conversation during the study. I fed each system+user prompt to 12 other models to simulate the **very first AI response** to the initial prompt. ```r model N <char> <int> 1: deepseek/deepseek-chat 400 2: gemini-pro-1.5 400 3: gemma-2-27b-it 400 4: gpt-4-turbo 400 5: gpt-4o 400 6: grok-2-1212 400 7: llama-3.1-405b-instruct 400 8: llama-3.1-70b-instruct 400 9: llama-3.3-70b-instruct 400 10: mistral-large-2407 400 11: mistral-nemo 400 12: perplexity+gpt 400 # original data/model (not simulated) 13: qwen/qwen-2.5-72b-instruct 400 ``` # results ## proportion of messages that contain factual, verifiable content About 20% of simulated AI messages cannot be fact-checked. FYI, prompt for perplexity fact-checking: """You are a professional fact-checker, and you excel at objectively evaluating the veracity/accuracy of information. Rate the overall veracity of the text using a scale of 0 (completely inaccurate) to 100 (completely accurate), but if the text contains no factual information to evaluate for accuracy, you must provide a score of -100 and briefly explain why. For scores lower than 100, briefly explain and justify your score. You must respond with a valid JSON matching the schema provided, and only return the JSON object without markdown json code block with backticks or extra text before or after the JSON object: {"score": "int", "explanation": "string"}""" ```r model perc_factchecked n_factchecked <char> <num> <num> 1: deepseek/deepseek-chat 92.25 369 # chinese model performing well 2: gemini-pro-1.5 81.75 327 3: gemma-2-27b-it 77.75 311 4: gpt-4-turbo 83.00 332 5: gpt-4o 84.75 339 6: grok-2-1212 82.75 331 7: llama-3.1-405b-instruct 71.00 284 8: llama-3.1-70b-instruct 72.75 291 9: llama-3.3-70b-instruct 65.75 263 10: mistral-large-2407 89.00 356 11: mistral-nemo 87.50 350 12: perplexity+gpt 100.00 400 # all messages can be fact checked 13: qwen/qwen-2.5-72b-instruct 88.25 353 # chinese model performing well ``` Fitted logistic regression to compare fact-checking rates between all pairs of models (Bonferroni correction just to be super stringent, but not necessarily the best correction here). Summary: Models differ in terms of how much fact-based/verifiable content they generate. Perplexity+GPT combo (the setup used in the experiment) has the most factual/verifiable content (but then, note that when randomly selecting the 400 participants initially, I conditioned those whose first AI message had verifiable content and could be fact-checked.). ![[1736214131.png]] ## among the verifiable, factual content, perplexity+gpt combo has the highest fc_score ```r model fc_score N <fctr> <num> <int> 1: perplexity+gpt 66.08500 400 2: qwen/qwen-2.5-72b-instruct 61.17847 353 # chinese model is good 3: deepseek/deepseek-chat 58.66125 369 # chinese model is good 4: gpt-4-turbo 58.20482 332 5: gpt-4o 54.07080 339 6: mistral-large-2407 53.36798 356 7: gemini-pro-1.5 53.14373 327 8: llama-3.1-405b-instruct 47.96831 284 9: llama-3.1-70b-instruct 45.73540 291 10: gemma-2-27b-it 45.62701 311 11: grok-2-1212 43.89728 331 12: mistral-nemo 39.80286 350 13: llama-3.3-70b-instruct 37.83270 263 ``` Fitted OLS to compare every fact-checked scores between all pairs of models (same as above). Summary: Perplexity+GPT AI messages were more factually accurate than messages generated by all other models. ![[1736214193.png]] ## every model showed the same bias All models in pro-Harris condition generated much more accurate content than in pro-Trump condition. ![[1736214232.png]]