A comparison of 1,500 synthetic respondents and 1,500 real people using six OpenAI LLMs showed large discrepancies between AI and human responses. LLMs were prompted with standard political survey questions and instructed to simulate specific demographic profiles. Overall error ranged from 23 points to 4 points compared with real respondents. Accuracy declined markedly for underrepresented groups such as Black, Asian, and Pacific Islander respondents. Cost and time incentives drive adoption of synthetic surveying despite clear limitations, posing risks for poll accuracy and representation.
The results weren't exactly inspiring. The worst-performing model, which was not specified, was 23 points off from the real respondents overall, while the best-performing model, GPT-4o-mini, was 4 points off. And the more closely Morris zoomed in, the worse things looked. As this graphic from the study shows, even the "voters" generated by the best-performing model, 4o-mini, veered further from reality as they were instructed to respond as groups who are less well represented in the United States population, like Black, Asian and Pacific Islander respondents.
Using six OpenAI models - GPT-4.1, GPT-4.1 nano, GPT-4.1 mini, GPT-4o, GPT-4o mini, and o4-mini - Morris instructed each LLM to respond as various demographics. In one example, the researcher prompted the LLMs to respond as a white 61-year-old woman in Florida who makes between $50,000 and $75,000 per year and who considers herself a moderate voter.
In a white paper about the topic for the survey platform Verasight, data journalist G. Elliott Morris found, when comparing 1,500 "synthetic" survey respondents and 1,500 real people, that large language models (LLMs) were overall very bad at reflecting the views of actual human respondents. Using typical real-world political survey questions, the LLMs were asked things like "Do you approve or disapprove of the way Donald Trump is handling his job as president?" and given a five-point scale ranging from "strongly approve," "slightly approve," "slightly disapprove," strongly disapprove" and "don't know/not sure."
Collection
[
|
...
]