ADE-Bench Benchmark Results

altimate-code achieves 74.4% pass rate (32/43 tasks) — #1 on agentic data engineering benchmarks.

Model

Database

About ADE-Bench

ADE-Bench is a benchmark created by Benn Stancil (founder of Mode) in collaboration with dbt Labs. It evaluates AI agents on real-world analytics and data engineering tasks using actual dbt projects and databases. Each task runs in a Docker container sandbox — the agent attempts to resolve the task, and success is measured by whether all dbt tests pass afterward. Tasks include realistic data problems: vague requests like "it’s broken," debugging, schema issues, and complex analytics queries.

Test Configuration

Harness and LLM	altimate-code (Sonnet 4.6)
Database	Snowflake
Total Tasks	43 (45 ran, 2 excluded due to dev tasks)
Max Retries on failures	3
Best Run	32/43 (74.4%)
Worst Run	29/43 (67.4%)
Excluded	f1008, workday001 (dev tasks)

Benchmark Comparison

Agents evaluated on ADE-Bench with Snowflake.

altimate-code(Sonnet 4.6 · Snowflake) — 32/4374.4%

Cortex Code CLI(Opus 4.6 · Snowflake) — 28/4365%

Source →

Key Insight: The Harness Matters More Than the Model

Across both benchmarks, altimate-code on Sonnet 4.6 beats competitors running Opus 4.6 — a more capable, more expensive model. Purpose-built tooling and deterministic operations outperform raw model capability alone.

The harness — not the model — is the differentiator.

Per-Task Results — Snowflake

Best Run — 32 passed, 11 failed out of 43 tasks

#	Task	Result	Score	Pass Rate
1	airbnb001	✓	11/11	100%
2	airbnb002	✓	12/12	100%
3	airbnb003	✓	8/8	100%
4	airbnb004	✓	3/3	100%
5	airbnb005	✗	4/5	80%
6	airbnb006	✓	8/8	100%
7	airbnb007	✗	11/12	91%
8	airbnb008	✓	5/5	100%
9	airbnb009	✓	2/2	100%
10	analytics_engineering001	✓	2/2	100%
11	analytics_engineering002	✓	3/3	100%
12	analytics_engineering002.medium	✓	3/3	100%
13	analytics_engineering003	✓	3/3	100%
14	analytics_engineering004	✗	2/3	66%
15	analytics_engineering005	✓	4/4	100%
16	analytics_engineering006	✗	6/8	75%
17	analytics_engineering007	✓	11/11	100%
18	analytics_engineering007.medium	✓	11/11	100%
19	asana001	✓	3/3	100%
20	asana002	✓	4/4	100%
21	asana003	✗	17/18	94%
22	asana004	✗	6/7	85%
23	asana005	✗	8/9	88%
24	asana005.hard	✗	8/9	88%
25	f1001	✓	7/7	100%
26	f1002	✗	10/11	90%
27	f1003	✓	5/5	100%
28	f1003.hard	✓	5/5	100%
29	f1004	✓	3/3	100%
30	f1005	✓	5/5	100%
31	f1005.medium	✓	5/5	100%
32	f1006	✓	5/5	100%
33	f1006.hard	✓	5/5	100%
34	f1007	✓	7/7	100%
35	f1007.hard	✓	7/7	100%
36	f1007.medium	✓	7/7	100%
37	f1009	✓	2/2	100%
38	f1010	✓	3/3	100%
39	f1010.medium	✓	3/3	100%
40	f1011	✗	6/7	85%
41	intercom001	✗	2/3	66%
42	intercom002	✓	5/5	100%
43	intercom003	✓	3/3	100%

Sources

ADE-Bench on GitHub (dbt-labs) — Benchmark repository and methodology
Snowflake Blog: Cortex Code CLI Expands Support — Cortex Code benchmark results (65%, Opus 4.6)
dbt Labs Blog: Introducing ADE-Bench — dbt Labs benchmark results (59%, Sonnet 4.5)