ADE-Bench Benchmark Results
altimate-code achieves 74.4% pass rate (32/43 tasks) — #1 on agentic data engineering benchmarks.
About ADE-Bench
ADE-Bench is a benchmark created by Benn Stancil (founder of Mode) in collaboration with dbt Labs. It evaluates AI agents on real-world analytics and data engineering tasks using actual dbt projects and databases. Each task runs in a Docker container sandbox — the agent attempts to resolve the task, and success is measured by whether all dbt tests pass afterward. Tasks include realistic data problems: vague requests like "it’s broken," debugging, schema issues, and complex analytics queries.
Test Configuration
| Harness and LLM | altimate-code (Sonnet 4.6) |
| Database | Snowflake |
| Total Tasks | 43 (45 ran, 2 excluded due to dev tasks) |
| Max Retries on failures | 3 |
| Best Run | 32/43 (74.4%) |
| Worst Run | 29/43 (67.4%) |
| Excluded | f1008, workday001 (dev tasks) |
Benchmark Comparison
Agents evaluated on ADE-Bench with Snowflake.
Key Insight: The Harness Matters More Than the Model
Across both benchmarks, altimate-code on Sonnet 4.6 beats competitors running Opus 4.6 — a more capable, more expensive model. Purpose-built tooling and deterministic operations outperform raw model capability alone.
The harness — not the model — is the differentiator.
Per-Task Results — Snowflake
Best Run — 32 passed, 11 failed out of 43 tasks
| # | Task | Result | Score | Pass Rate |
|---|---|---|---|---|
| 1 | airbnb001 | ✓ | 11/11 | 100% |
| 2 | airbnb002 | ✓ | 12/12 | 100% |
| 3 | airbnb003 | ✓ | 8/8 | 100% |
| 4 | airbnb004 | ✓ | 3/3 | 100% |
| 5 | airbnb005 | ✗ | 4/5 | 80% |
| 6 | airbnb006 | ✓ | 8/8 | 100% |
| 7 | airbnb007 | ✗ | 11/12 | 91% |
| 8 | airbnb008 | ✓ | 5/5 | 100% |
| 9 | airbnb009 | ✓ | 2/2 | 100% |
| 10 | analytics_engineering001 | ✓ | 2/2 | 100% |
| 11 | analytics_engineering002 | ✓ | 3/3 | 100% |
| 12 | analytics_engineering002.medium | ✓ | 3/3 | 100% |
| 13 | analytics_engineering003 | ✓ | 3/3 | 100% |
| 14 | analytics_engineering004 | ✗ | 2/3 | 66% |
| 15 | analytics_engineering005 | ✓ | 4/4 | 100% |
| 16 | analytics_engineering006 | ✗ | 6/8 | 75% |
| 17 | analytics_engineering007 | ✓ | 11/11 | 100% |
| 18 | analytics_engineering007.medium | ✓ | 11/11 | 100% |
| 19 | asana001 | ✓ | 3/3 | 100% |
| 20 | asana002 | ✓ | 4/4 | 100% |
| 21 | asana003 | ✗ | 17/18 | 94% |
| 22 | asana004 | ✗ | 6/7 | 85% |
| 23 | asana005 | ✗ | 8/9 | 88% |
| 24 | asana005.hard | ✗ | 8/9 | 88% |
| 25 | f1001 | ✓ | 7/7 | 100% |
| 26 | f1002 | ✗ | 10/11 | 90% |
| 27 | f1003 | ✓ | 5/5 | 100% |
| 28 | f1003.hard | ✓ | 5/5 | 100% |
| 29 | f1004 | ✓ | 3/3 | 100% |
| 30 | f1005 | ✓ | 5/5 | 100% |
| 31 | f1005.medium | ✓ | 5/5 | 100% |
| 32 | f1006 | ✓ | 5/5 | 100% |
| 33 | f1006.hard | ✓ | 5/5 | 100% |
| 34 | f1007 | ✓ | 7/7 | 100% |
| 35 | f1007.hard | ✓ | 7/7 | 100% |
| 36 | f1007.medium | ✓ | 7/7 | 100% |
| 37 | f1009 | ✓ | 2/2 | 100% |
| 38 | f1010 | ✓ | 3/3 | 100% |
| 39 | f1010.medium | ✓ | 3/3 | 100% |
| 40 | f1011 | ✗ | 6/7 | 85% |
| 41 | intercom001 | ✗ | 2/3 | 66% |
| 42 | intercom002 | ✓ | 5/5 | 100% |
| 43 | intercom003 | ✓ | 3/3 | 100% |
Sources
- ADE-Bench on GitHub (dbt-labs) — Benchmark repository and methodology
- Snowflake Blog: Cortex Code CLI Expands Support — Cortex Code benchmark results (65%, Opus 4.6)
- dbt Labs Blog: Introducing ADE-Bench — dbt Labs benchmark results (59%, Sonnet 4.5)