Latest Validation: Evaluating anthropomorphic alignment in next-gen models. New data for Anthropic Claude Haiku 4.5 included. View Results

HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

A non-profit and open-source platform converting human study research papers into standardized testbed for AI agents to replay human-subject experiments end-to-end, evaluating agent alignment with human participants at the level of scientific inference.

View Leaderboard Read the Paper

Loading effect size data…

What is HumanStudy-Bench?

HumanStudy-Bench treats participant simulation as an agent design problem and provides a standardized testbed — combining an Execution Engine that reconstructs full experimental protocols from published studies and a Benchmark with standardized evaluation metrics — for replaying human-subject experiments end-to-end with alignment evaluation at the level of scientific inference.

Standardized Testbed

Test different agent designs on the same experiments, run agents through real studies covering 6,000+ trials, and compare results rigorously using inferential-level metrics.

Foundational Studies

Covering major behavioral phenomena

6,000+

Experimental Trials

Replayed with AI agents

10-2,000+

Human Sample Range

Per study participant count

Evaluation Metrics

PAS & ECS for alignment

Open-ended and growing — join us to add more

Pipeline Architecture

From published human studies to reusable simulation environment in four stages.

Stage 1: Filter: Curates human studies that are scientifically important and practically reproducible, ensuring full experimental details, quantifiable outcomes, and simulation feasibility.
Stage 2: Extract: Extracts participants' profiles, experimental designs, statistical tests, and human ground-truth outcomes from unstructured papers into machine-executable representations.
Stage 3: Execute: Runs agent designs through reconstructed experimental protocols, generating trial-level data via a shared execution engine that handles agent sampling, instruction dispatch, and response collection.
Stage 4: Evaluate: Compares agent responses against human ground-truth using Probability Alignment Score (PAS) for inferential agreement and Effect Consistency Score (ECS) for effect-size alignment.

Evaluation Metrics

PASProbability Alignment Score

Measures whether agents reach the same scientific conclusions as humans at the phenomenon level. It quantifies the probability that agent and human populations exhibit behavior consistent with the same hypothesis.

ECSEffect Consistency Score

Measures how closely agents reproduce the magnitude and pattern of human behavioral effects at the data level. It assesses both the precision and accuracy of agent responses compared to human ground truth.

Leaderboard

Evaluating agent design alignment with human behavior using Probability Alignment Score (PAS) and Effect Consistency Score (ECS) across 12 foundational human-subject studies.

Filter by Variant:

Rank	Model	Variant	PAS (Alignment)	ECS	Cost	Tokens	Details
1	gemini-3-flash-preview	v3_human_plus_demo	49.7%	0.1593	$2.7883	3,891,506	Show
2	gemini-3-flash-preview	v4_background	46.5%	0.0076	$5.2743	7,786,439	Show
3	gpt-5-nano	v4_background	45.9%	0.0498	$0.3919	5,197,263	Show
4	mistral-nemo	v3_human_plus_demo	44.0%	0.1311	$0.4451	4,789,725	Show
5	qwen3-next-80b-a3b-instruct	v4_background	43.4%	0.1142	$0.9527	4,695,595	Show
6	mistral-nemo	v4_background	43.2%	0.0389	$0.2004	6,973,836	Show
7	mistral-nemo	v1_empty	42.7%	0.0700	$0.3916	4,277,209	Show
8	gpt-oss-20b	v1_empty	41.9%	0.0284	$1.4158	8,450,730	Show
9	gpt-oss-20b	v3_human_plus_demo	41.8%	0.0223	$1.1677	7,561,478	Show
10	mistral-nemo	v2_human	41.1%	0.0339	$0.3872	4,370,464	Show
11	ai-grok-4.1-fast-none	v3_human_plus_demo	41.0%	0.0030	$0.8498	7,037,212	Show
12	gpt-5-nano	v3_human_plus_demo	40.1%	0.0284	$2.6135	9,988,192	Show
13	mistral-small-creative	v3_human_plus_demo	39.3%	0.0124	$0.4529	3,867,556	Show
14	claude-haiku-4.5	v4_background	38.9%	0.0707	$8.2819	5,970,093	Show
15	gpt-oss-20b	v4_background	38.8%	0.0674	$1.2320	11,628,944	Show
16	gpt-5-nano	v2_human	37.7%	0.0078	$2.8796	11,400,115	Show
17	deepseek-v3.2	v4_background	37.4%	0.0249	$3.0434	9,653,594	Show
18	gpt-oss-120b	v3_human_plus_demo	37.2%	0.0557	$1.7409	6,221,295	Show
19	gemini-3-flash-preview	v2_human	37.0%	0.0962	$2.7641	3,651,210	Show
20	gemini-3-flash-preview	v1_empty	36.8%	0.1657	$2.9355	3,678,290	Show
21	mistral-small-creative	v4_background	35.9%	0.0832	$0.6348	5,678,107	Show
22	gpt-5-nano	v1_empty	35.6%	0.0650	$6.5044	19,518,351	Show
23	qwen3-next-80b-a3b-instruct	v3_human_plus_demo	35.1%	0.1445	$0.8590	3,843,866	Show
24	qwen3-next-80b-a3b-instruct	v1_empty	34.9%	0.1138	$0.8090	3,386,421	Show
25	claude-haiku-4.5	v3_human_plus_demo	34.0%	0.0213	$6.4699	4,450,022	Show
26	gpt-oss-120b	v4_background	33.7%	0.0074	$0.9912	8,149,909	Show
27	deepseek-v3.2	v2_human	33.7%	0.0124	$0.8018	3,302,727	Show
28	ai-grok-4.1-fast-none	v4_background	33.4%	-0.0279	$1.2736	7,232,883	Show
29	gpt-oss-120b	v2_human	33.3%	0.0184	$1.6862	6,130,775	Show
30	qwen3-next-80b-a3b-instruct	v2_human	33.1%	0.1798	$0.8273	3,474,833	Show
31	gpt-oss-20b	v2_human	33.0%	0.0036	$1.3697	8,117,261	Show
32	ai-grok-4.1-fast-none	v1_empty	31.9%	0.0717	$0.5784	5,291,271	Show
33	claude-haiku-4.5	v1_empty	30.4%	0.0066	$9.2877	4,737,817	Show
34	ai-grok-4.1-fast-none	v2_human	29.9%	0.0057	$0.5012	5,140,245	Show
35	deepseek-v3.2	v3_human_plus_demo	29.7%	0.0516	$1.0522	3,753,460	Show
36	claude-haiku-4.5	v2_human	29.3%	0.0586	$10.1626	4,988,373	Show
37	deepseek-v3.2	v1_empty	29.3%	0.0462	$0.8000	3,253,924	Show
38	gpt-oss-120b	v1_empty	28.5%	-0.0136	$1.6939	6,049,728	Show
39	mistral-small-creative	v1_empty	25.9%	0.0445	$0.6975	4,905,549	Show
40	mistral-small-creative	v2_human	12.6%	-0.0031	$0.6422	4,482,800	Show

Study Dataset

An initial curated collection of 12 foundational human-subject studies spanning individual cognition, strategic interaction, and social psychology, all with complete experimental materials and clearly specified statistical tests.

Become a contributor →

The False Consensus Effect

Ross et al., 1977

Individual Cognition

Phenomenon: False consensus bias

Measures of Anchoring

Jacowitz & Kahneman, 1995

Individual Cognition

Phenomenon: Anchoring effect

Framing of Decisions

Tversky & Kahneman, 1981

Individual Cognition

Phenomenon: Framing effect

Subjective Probability

Kahneman & Tversky, 1972

Individual Cognition

Phenomenon: Representativeness heuristic

Intentional Action

Knobe, 2003

Social Psychology

Phenomenon: Knobe effect

Forming Impressions

Asch, 1946

Social Psychology

Phenomenon: Primacy effect

Social Categorization

Billig & Tajfel, 1973

Social Psychology

Phenomenon: Minimal group paradigm

Pluralistic Ignorance

Prentice & Miller, 1993

Social Psychology

Phenomenon: Pluralistic ignorance

Guessing Games

Nagel, 1995

Strategic Interaction

Phenomenon: Keynesian beauty contest

Thinking through Uncertainty

Shafir & Tversky, 1992

Strategic Interaction

Phenomenon: Disjunction effect

Fairness in Bargaining

Forsythe et al., 1994

Strategic Interaction

Phenomenon: Dictator game giving

Trust and Reciprocity

Berg et al., 1995

Strategic Interaction

Phenomenon: Trust game