Skip to main content

ACES v2 — Effective April 2026 · Anchors recalibrated June 2026

Evaluation Methodology

A systematic framework for assessing agentic developer tools across six dimensions of enterprise relevance.

The ACES Framework

ACES v2 is the evaluation methodology powering Agentic Tools Radar — every tracked tool is scored on six dimensions using a 5-band comparative rubric, then assigned a signal level based on the quality and recency of supporting evidence.

Each of the six dimensions is scored on a 1–20 scale and averaged to produce a 0–100 rating. A signal level — Validated, Assessed, Tracked, or Detected — reflects the quality of supporting evidence. An evidence grade (A–D) summarizes recency, depth, and hands-on testing.

Six Evaluation Dimensions

Each tool is assessed across six equally-weighted dimensions. Scores run 1–20 across five comparative bands.

DimensionWhat It Measures
AutonomyAbility to plan and execute tasks independently
IntegrationVCS, CI/CD, project management, and team connectivity
ContextDepth of codebase and ecosystem understanding
ComplianceEnterprise governance: certs, access controls, audit, deployment
ViabilityVendor sustainability and developer experience
InterfaceInteraction surface breadth and maturity

Scoring Scale — 5-Band Rubric

Each dimension uses the same five comparative bands on the 1–20 scale. Bands are defined relative to the category cohort, not by counting features: the middle band must be earned by clearing named downward triggers, and the top band requires hands-on or independent third-party evidence (vendor-only claims cap a dimension at 16). Detailed per-dimension anchors live in the scoring guide.

1–5Absent / Disqualifying

Capability essentially missing, or an active disqualifier for this dimension

6–10Below Baseline

Present but below what a serious tool in this category is expected to offer; named gaps pull a tool down here

11–13Meets Baseline

Meets the expected baseline for a credible tool in its cohort — not a default; a tool must clear the Below-Baseline downward triggers to be here

14–16Above Cohort

Demonstrably better than the cohort median on this dimension, with specific differentiators named relative to ≥2 named competitors

17–20Best-in-Class

Sets the bar the category measures against — evidence-gated: requires hands-on or independent third-party evidence; vendor-only claims cap this dimension at 16

Rating & Tiers

The Rating is the pure capability score (0–100) — what the tool can do when working as intended. It is an internal sort key. What you see in the UI is the tier derived from that score.

Formula:

Rating = avg(Autonomy, Integration, Context, Compliance, Viability, Interface) × 5

Each dimension is scored 1–20. The average across 6 dimensions is multiplied by 5 to produce a 0–100 scale (where 20 × 5 = 100). In practice, no tool scores 20 in every dimension, so the practical ceiling for well-rounded tools is in the mid-to-high 70s.

Example Calculation:

Autonomy: 12, Integration: 15, Context: 16, Compliance: 12, Viability: 15, Interface: 17
Average = (12 + 15 + 16 + 12 + 15 + 17) ÷ 6 = 87 ÷ 6 = 14.5
Rating = 14.5 × 5 = 72.5 → displayed as 73 → tier: Proven

Rating Tiers

The numeric rating is an internal sort key; the tier is the primary label shown in the UI. Scores are currently research-grade for most tools — treat a Tracked or Detected tool's tier as directional until hands-on testing improves evidence quality.

Leading≥ 76

Top-performing tools with strong capabilities across most dimensions

Proven64–75

Solid, enterprise-suitable tools with well-rounded scores

Emerging50–63

Capable in specific areas; not yet broad enough for all enterprise contexts

Watch< 50

Early-stage, limited scope, or constrained by active score caps

Signal Levels

Signal levels describe how much evidence supports a tool's rating. A high-rated tool at Detected carries more risk than a moderate-rated tool at Validated. Signal level is independent of rating.

Validated

Production-validated in enterprise environments. Multiple independent deployments confirmed.

Evidence required: Named enterprise customers, production metrics, or independent audits. No critical open blockers.

Assessed

Active evaluation with substantial evidence. Hands-on testing completed or underway.

Evidence required: Internal evaluation data, multiple independent sources, documented trial findings.

Tracked

Known tool being monitored. Research-based assessment without hands-on testing.

Evidence required: Official documentation, community consensus, analyst coverage. No direct testing.

Detected

Recently identified. Minimal evaluation completed; scored from public signals only.

Evidence required: Vendor documentation, initial public signals. Evaluation has not yet commenced.

Evidence Grades

Each evaluation carries an evidence grade (A–D) derived from three factors. The grade signals how much to trust the score — stronger evidence means less interpretation needed when comparing tools.

GradeEvidence ScoreProfile
A≥ 0.75Recent (<30d), thorough evaluation, hands-on tested
B0.50–0.74Recent or thorough, with partial hands-on evidence
C0.25–0.49Moderate age or depth; limited or no hands-on testing
D< 0.25Stale (>90d), minimal depth, no hands-on testing

Evidence Factors:

Recency (30%)
<30 days = 1.0, 30–90 days = 0.5, >90 days = 0.0
Evaluation Depth (35%)
Thorough = 1.0, Moderate = 0.5, Minimal = 0.0
Hands-On Testing (35%)
Tested = 1.0, Demo = 0.5, Not Tested = 0.0

Score Caps

14 score caps limit dimension scores when specific conditions are documented. Caps are grouped across security, capability, enterprise, trust, and stability categories. Temporary caps are removed when the triggering condition is resolved; permanent caps reflect fundamental design limitations.

CapTriggerImpact
Critical Security VulnUnpatched critical CVE or active security incidentCompliance ≤ 5
No Codebase IndexingNo codebase indexing or semantic search capabilityContext ≤ 11
Single IDE OnlyOnly works in one IDE with no CLI or web optionInterface ≤ 10
No Automation ModeNo CLI, API, or headless mode for CI/CD integrationInterface ≤ 12
No Enterprise FeaturesNo enterprise customers/features (SSO, RBAC, audit, compliance certs)All dims ≤ 14
Pricing OpacityNo public pricing or highly opaque pricing modelCompliance ≤ 15
Pricing VolatilityFrequent pricing changes causing budget unpredictabilityCompliance ≤ 12
Severe Negative SentimentWidespread negative sentiment (sentiment score 1–2)Status impact only
Reliability ComplaintsWidespread reliability complaints (breaks often, unreliable output)Autonomy ≤ 12
Unvalidated BenchmarksVendor-only benchmark claims with no independent validationAutonomy ≤ 14
Community ExodusDocumented user exodus or mass migration away from the toolAll dims ≤ 12
Stalled DevelopmentNo releases or meaningful updates in 90+ daysAll dims ≤ 12
Acquisition UncertaintyAcquisition with unclear product roadmapCompliance ≤ 12
Funding ConcernsFunding or runway concernsCompliance ≤ 10

Four capability caps were removed in June 2026 (no tools used them): manual-acceptance-required, no-git-integration, file-level-only, small-context-window. Cap count: 18 → 14.

DX Testing Protocol (Planned)

A standardized 60-minute first-contact testing protocol is under development. It will cover installation, a defined task scenario, error recovery, and a structured exit-interview rubric. Currently, most evaluations are research-based with hands-on testing conducted opportunistically. When formal DX testing is completed, it will unlock Grade A evidence and improve signal-level confidence.

Risk Context

The same tool presents different risk profiles depending on use case. A tool assessed as Validated for frontend prototyping carries different implications than one writing data migrations or touching production infrastructure. Scores reflect general enterprise-engineering capability; teams should layer their own use-case risk assessment on top of signal levels.

Evaluation Cadence

Evaluations are published on a monthly release cycle. The AI engineering landscape moves too fast for quarterly cycles — significant security incidents, pricing changes, and major releases can shift a tool's position within days.

  • Monthly re-scoring uses AI-assisted analysis of new releases, community signals, and market intelligence
  • Human spot-checks review AI-proposed score changes above a defined threshold before publication
  • Critical security incidents trigger immediate out-of-cycle updates
  • Quarterly data quality audits review completeness, score validity, and cap consistency across the full catalog

ACES v2 + Phase A tiers + Phase B anchor recalibration

Core framework effective April 2026 (ACES v2): 6-dimension model, three-signal model (Rating / Signal Level / Evidence Grade). Phase A (June 2026): the 0–100 composite rating is presented as a 4-tier band (Leading / Proven / Emerging / Watch) — numeric score retained as an internal sort key. Phase B (June 2026): recalibrated the per-dimension rubric anchors to be comparative, forced-distribution, and evidence-gated, and consolidated the 7-level rubric to 5 bands — dimension count unchanged at 6.

June 2026

Ready to explore the signal?