ADE-Bench Benchmark Results
altimate-code achieves 74.4% pass rate (32/43 tasks) — #1 on agentic data engineering benchmarks.
About ADE-Bench
ADE-Bench is a benchmark created by Benn Stancil (founder of Mode) in collaboration with dbt Labs. It evaluates AI agents on real-world analytics and data engineering tasks using actual dbt projects and databases. Each task runs in a Docker container sandbox — the agent attempts to resolve the task, and success is measured by whether all dbt tests pass afterward. Tasks include realistic data problems: vague requests like "it’s broken," debugging, schema issues, and complex analytics queries.
Test Configuration
| Harness and LLM | altimate-code (Sonnet 4.6) |
| Database | DuckDB (local) |
| Total Tasks | 43 (45 ran, 2 excluded due to dev tasks) |
| Max Retries on failures | 3 |
| Best Run | 32/43 (74.4%) |
| Worst Run | 29/43 (67.4%) |
| Excluded | f1008, workday001 (dev tasks) |
Benchmark Comparison
Key Insight: The Harness Matters More Than the Model
Cortex Code CLI uses Opus 4.6 (a more capable, expensive model) yet scores lower than altimate-code with Sonnet 4.6. This demonstrates that purpose-built tooling and deterministic operations outperform raw model capability alone.
The harness — not the model — is the differentiator.
Per-Task Results
Best Run — 32 passed, 9 failed out of 41 tasks
| # | Task | Result | Score | Pass Rate |
|---|---|---|---|---|
| 1 | airbnb001 | ✗ | 8/10 | 80% |
| 2 | airbnb002 | ✗ | 9/11 | 82% |
| 3 | airbnb003 | ✓ | 7/7 | 100% |
| 4 | airbnb004 | ✓ | 2/2 | 100% |
| 5 | airbnb005 | ✓ | 4/4 | 100% |
| 6 | airbnb006 | ✓ | 7/7 | 100% |
| 7 | airbnb007 | ✗ | 8/11 | 73% |
| 8 | airbnb008 | ✗ | 2/4 | 50% |
| 9 | airbnb009 | ✓ | 1/1 | 100% |
| 10 | analytics_engineering001 | ✓ | 1/1 | 100% |
| 11 | analytics_engineering002 | ✓ | 2/2 | 100% |
| 12 | analytics_engineering003 | ✓ | 2/2 | 100% |
| 13 | analytics_engineering004 | ✓ | 2/2 | 100% |
| 14 | analytics_engineering005 | ✓ | 3/3 | 100% |
| 15 | analytics_engineering006 | ✓ | 7/7 | 100% |
| 16 | analytics_engineering007 | ✓ | 10/10 | 100% |
| 17 | analytics_engineering008 | ✓ | 1/1 | 100% |
| 18 | asana001 | ✓ | 2/2 | 100% |
| 19 | asana002 | ✓ | 3/3 | 100% |
| 20 | asana003 | ✓ | 17/17 | 100% |
| 21 | asana004 | ✗ | 5/6 | 83% |
| 22 | asana005 | ✓ | 8/8 | 100% |
| 23 | f1001 | ✓ | 6/6 | 100% |
| 24 | f1002 | ✗ | 9/10 | 90% |
| 25 | f1003 | ✓ | 4/4 | 100% |
| 26 | f1004 | ✓ | 2/2 | 100% |
| 27 | f1005 | ✓ | 4/4 | 100% |
| 28 | f1006 | ✓ | 4/4 | 100% |
| 29 | f1007 | ✓ | 6/6 | 100% |
| 30 | f1009 | ✓ | 1/1 | 100% |
| 31 | f1010 | ✓ | 2/2 | 100% |
| 32 | f1011 | ✗ | 4/6 | 67% |
| 33 | intercom001 | ✓ | 2/2 | 100% |
| 34 | intercom002 | ✓ | 4/4 | 100% |
| 35 | intercom003 | ✓ | 2/2 | 100% |
| 36 | quickbooks001 | ✓ | 6/6 | 100% |
| 37 | quickbooks002 | ✓ | 8/8 | 100% |
| 38 | quickbooks003 | ✗ | 5/14 | 36% |
| 39 | quickbooks004 | ✗ | 28/48 | 58% |
| 40 | simple001 | ✓ | 1/1 | 100% |
| 41 | simple002 | ✓ | 1/1 | 100% |
Retry Variance
| Run | Passed | Pass Rate |
|---|---|---|
| Best | 32/43 | 74.4% |
| Worst | 29/43 | 67.4% |
| Variance | 3 tasks | 7.0pp |
Across 3 retries, Sonnet 4-6 showed a 7 percentage point variance between best and worst runs, indicating moderate non-determinism on borderline tasks.
Sources
- ADE-Bench on GitHub (dbt-labs) — Benchmark repository and methodology
- Snowflake Blog: Cortex Code CLI Expands Support — Cortex Code benchmark results (65%, Opus 4.6)
- dbt Labs Blog: Introducing ADE-Bench — dbt Labs benchmark results (59%, Sonnet 4.5)