Skip to content

ADE-Bench Benchmark Results

altimate-code achieves 74.4% pass rate (32/43 tasks) — #1 on agentic data engineering benchmarks.

About ADE-Bench

ADE-Bench is a benchmark created by Benn Stancil (founder of Mode) in collaboration with dbt Labs. It evaluates AI agents on real-world analytics and data engineering tasks using actual dbt projects and databases. Each task runs in a Docker container sandbox — the agent attempts to resolve the task, and success is measured by whether all dbt tests pass afterward. Tasks include realistic data problems: vague requests like "it’s broken," debugging, schema issues, and complex analytics queries.

Test Configuration

Harness and LLMaltimate-code (Sonnet 4.6)
DatabaseDuckDB (local)
Total Tasks43 (45 ran, 2 excluded due to dev tasks)
Max Retries on failures3
Best Run32/43 (74.4%)
Worst Run29/43 (67.4%)
Excludedf1008, workday001 (dev tasks)

Benchmark Comparison

altimate-code(Sonnet 4.6) — 32/4374.4%
Cortex Code CLI(Opus 4.6) — 28/4365%
Source →
dbt Labs(Sonnet 4.5) — ~25/4359%
Source →
Claude Code(Sonnet 4.6 · baseline) — ~17/4340%

Key Insight: The Harness Matters More Than the Model

Cortex Code CLI uses Opus 4.6 (a more capable, expensive model) yet scores lower than altimate-code with Sonnet 4.6. This demonstrates that purpose-built tooling and deterministic operations outperform raw model capability alone.

The harness — not the model — is the differentiator.

Per-Task Results

Best Run — 32 passed, 9 failed out of 41 tasks

#TaskResultScorePass Rate
1airbnb0018/1080%
2airbnb0029/1182%
3airbnb0037/7100%
4airbnb0042/2100%
5airbnb0054/4100%
6airbnb0067/7100%
7airbnb0078/1173%
8airbnb0082/450%
9airbnb0091/1100%
10analytics_engineering0011/1100%
11analytics_engineering0022/2100%
12analytics_engineering0032/2100%
13analytics_engineering0042/2100%
14analytics_engineering0053/3100%
15analytics_engineering0067/7100%
16analytics_engineering00710/10100%
17analytics_engineering0081/1100%
18asana0012/2100%
19asana0023/3100%
20asana00317/17100%
21asana0045/683%
22asana0058/8100%
23f10016/6100%
24f10029/1090%
25f10034/4100%
26f10042/2100%
27f10054/4100%
28f10064/4100%
29f10076/6100%
30f10091/1100%
31f10102/2100%
32f10114/667%
33intercom0012/2100%
34intercom0024/4100%
35intercom0032/2100%
36quickbooks0016/6100%
37quickbooks0028/8100%
38quickbooks0035/1436%
39quickbooks00428/4858%
40simple0011/1100%
41simple0021/1100%

Retry Variance

RunPassedPass Rate
Best32/4374.4%
Worst29/4367.4%
Variance3 tasks7.0pp

Across 3 retries, Sonnet 4-6 showed a 7 percentage point variance between best and worst runs, indicating moderate non-determinism on borderline tasks.

Sources