HumanStudy-Bench: Towards AI Agent Design for Participant Simulation
A reusable platform converting human study research papers into standardized testbed for AI agents to replay human-subject experiments end-to-end, evaluating agent alignment with human participants at the level of scientific inference.
Loading effect size data…
What is HumanStudy-Bench?
HumanStudy-Bench treats participant simulation as an agent design problem and provides a standardized testbed — combining an Execution Engine that reconstructs full experimental protocols from published studies and a Benchmark with standardized evaluation metrics — for replaying human-subject experiments end-to-end with alignment evaluation at the level of scientific inference.
Standardized Testbed
Test different agent designs on the same experiments, run agents through real studies covering 6,000+ trials, and compare results rigorously using inferential-level metrics.
Pipeline Architecture
From published human studies to reusable simulation environment in four stages.
- Stage 1: Filter
Curates human studies that are scientifically important and practically reproducible, ensuring full experimental details, quantifiable outcomes, and simulation feasibility.
- Stage 2: Extract
Extracts participants' profiles, experimental designs, statistical tests, and human ground-truth outcomes from unstructured papers into machine-executable representations.
- Stage 3: Execute
Runs agent designs through reconstructed experimental protocols, generating trial-level data via a shared execution engine that handles agent sampling, instruction dispatch, and response collection.
- Stage 4: Evaluate
Compares agent responses against human ground-truth using Probability Alignment Score (PAS) for inferential agreement and Effect Consistency Score (ECS) for effect-size alignment.
Evaluation Metrics
PASProbability Alignment Score
Measures whether agents reach the same scientific conclusions as humans at the phenomenon level. It quantifies the probability that agent and human populations exhibit behavior consistent with the same hypothesis.
ECSEffect Consistency Score
Measures how closely agents reproduce the magnitude and pattern of human behavioral effects at the data level. It assesses both the precision and accuracy of agent responses compared to human ground truth.
Leaderboard
Evaluating agent design alignment with human behavior using Probability Alignment Score (PAS) and Effect Consistency Score (ECS) across 12 foundational human-subject studies.
| Rank | Model | Variant | PAS (Alignment) | ECS | Cost | Tokens | Details |
|---|---|---|---|---|---|---|---|
| 1 | gemini-3-flash-preview | v3_human_plus_demo | 49.7% | 0.1593 | $2.7883 | 3,891,506 | Show |
| 2 | gemini-3-flash-preview | v4_background | 46.5% | 0.0076 | $5.2743 | 7,786,439 | Show |
| 3 | gpt-5-nano | v4_background | 45.9% | 0.0498 | $0.3919 | 5,197,263 | Show |
| 4 | mistral-nemo | v3_human_plus_demo | 44.0% | 0.1311 | $0.4451 | 4,789,725 | Show |
| 5 | qwen3-next-80b-a3b-instruct | v4_background | 43.4% | 0.1142 | $0.9527 | 4,695,595 | Show |
| 6 | mistral-nemo | v4_background | 43.2% | 0.0389 | $0.2004 | 6,973,836 | Show |
| 7 | mistral-nemo | v1_empty | 42.7% | 0.0700 | $0.3916 | 4,277,209 | Show |
| 8 | gpt-oss-20b | v1_empty | 41.9% | 0.0284 | $1.4158 | 8,450,730 | Show |
| 9 | gpt-oss-20b | v3_human_plus_demo | 41.8% | 0.0223 | $1.1677 | 7,561,478 | Show |
| 10 | mistral-nemo | v2_human | 41.1% | 0.0339 | $0.3872 | 4,370,464 | Show |
| 11 | ai-grok-4.1-fast-none | v3_human_plus_demo | 41.0% | 0.0030 | $0.8498 | 7,037,212 | Show |
| 12 | gpt-5-nano | v3_human_plus_demo | 40.1% | 0.0284 | $2.6135 | 9,988,192 | Show |
| 13 | mistral-small-creative | v3_human_plus_demo | 39.3% | 0.0124 | $0.4529 | 3,867,556 | Show |
| 14 | claude-haiku-4.5 | v4_background | 38.9% | 0.0707 | $8.2819 | 5,970,093 | Show |
| 15 | gpt-oss-20b | v4_background | 38.8% | 0.0674 | $1.2320 | 11,628,944 | Show |
| 16 | gpt-5-nano | v2_human | 37.7% | 0.0078 | $2.8796 | 11,400,115 | Show |
| 17 | deepseek-v3.2 | v4_background | 37.4% | 0.0249 | $3.0434 | 9,653,594 | Show |
| 18 | gpt-oss-120b | v3_human_plus_demo | 37.2% | 0.0557 | $1.7409 | 6,221,295 | Show |
| 19 | gemini-3-flash-preview | v2_human | 37.0% | 0.0962 | $2.7641 | 3,651,210 | Show |
| 20 | gemini-3-flash-preview | v1_empty | 36.8% | 0.1657 | $2.9355 | 3,678,290 | Show |
| 21 | mistral-small-creative | v4_background | 35.9% | 0.0832 | $0.6348 | 5,678,107 | Show |
| 22 | gpt-5-nano | v1_empty | 35.6% | 0.0650 | $6.5044 | 19,518,351 | Show |
| 23 | qwen3-next-80b-a3b-instruct | v3_human_plus_demo | 35.1% | 0.1445 | $0.8590 | 3,843,866 | Show |
| 24 | qwen3-next-80b-a3b-instruct | v1_empty | 34.9% | 0.1138 | $0.8090 | 3,386,421 | Show |
| 25 | claude-haiku-4.5 | v3_human_plus_demo | 34.0% | 0.0213 | $6.4699 | 4,450,022 | Show |
| 26 | gpt-oss-120b | v4_background | 33.7% | 0.0074 | $0.9912 | 8,149,909 | Show |
| 27 | deepseek-v3.2 | v2_human | 33.7% | 0.0124 | $0.8018 | 3,302,727 | Show |
| 28 | ai-grok-4.1-fast-none | v4_background | 33.4% | -0.0279 | $1.2736 | 7,232,883 | Show |
| 29 | gpt-oss-120b | v2_human | 33.3% | 0.0184 | $1.6862 | 6,130,775 | Show |
| 30 | qwen3-next-80b-a3b-instruct | v2_human | 33.1% | 0.1798 | $0.8273 | 3,474,833 | Show |
| 31 | gpt-oss-20b | v2_human | 33.0% | 0.0036 | $1.3697 | 8,117,261 | Show |
| 32 | ai-grok-4.1-fast-none | v1_empty | 31.9% | 0.0717 | $0.5784 | 5,291,271 | Show |
| 33 | claude-haiku-4.5 | v1_empty | 30.4% | 0.0066 | $9.2877 | 4,737,817 | Show |
| 34 | ai-grok-4.1-fast-none | v2_human | 29.9% | 0.0057 | $0.5012 | 5,140,245 | Show |
| 35 | deepseek-v3.2 | v3_human_plus_demo | 29.7% | 0.0516 | $1.0522 | 3,753,460 | Show |
| 36 | claude-haiku-4.5 | v2_human | 29.3% | 0.0586 | $10.1626 | 4,988,373 | Show |
| 37 | deepseek-v3.2 | v1_empty | 29.3% | 0.0462 | $0.8000 | 3,253,924 | Show |
| 38 | gpt-oss-120b | v1_empty | 28.5% | -0.0136 | $1.6939 | 6,049,728 | Show |
| 39 | mistral-small-creative | v1_empty | 25.9% | 0.0445 | $0.6975 | 4,905,549 | Show |
| 40 | mistral-small-creative | v2_human | 12.6% | -0.0031 | $0.6422 | 4,482,800 | Show |
Study Dataset
A curated collection of 12 foundational human-subject studies spanning individual cognition, strategic interaction, and social psychology, all with complete experimental materials and clearly specified statistical tests.