Evaluation Methodology

A systematic framework for assessing agentic developer tools for enterprise adoption.

Evaluation Philosophy

Our evaluation framework balances capability assessment with validation confidence. Raw capability scores (Rating) represent a tool's technical potential, while the Adjusted Score reflects our certainty based on real-world enterprise deployments.

This approach recognizes that a highly capable tool with limited enterprise validation carries more risk than a slightly less capable tool with proven production deployments. The adjusted score helps enterprise decision-makers understand this risk-adjusted view.

Scoring Formulas

Rating (0-100) — Pure Capability Score

The Rating reflects what the tool can do when it works as intended, without accounting for validation level or enterprise readiness.

Formula:

Rating = (AI Autonomy + Collaboration + Contextual Understanding + Governance + User Interface) ÷ 5 × 5

Each dimension is scored 1-20. The average is calculated and multiplied by 5 to convert to a 0-100 scale.

Example Calculation:

AI Autonomy: 16, Collaboration: 12, Contextual Understanding: 16, Governance: 8, User Interface: 16
Average = (16 + 12 + 16 + 8 + 16) ÷ 5 = 68 ÷ 5 = 13.6
Rating = 13.6 × 5 = 68.0

Adjusted Score (0-100) — Confidence-Adjusted Score

The Adjusted Score accounts for evaluation status and evidence quality. Each status has a confidence band (floor–ceiling). Evidence factors — evaluation recency, research depth, and hands-on testing — determine position within the band.

Formula:

Confidence = Floor + Evidence Score × (Ceiling - Floor)Adjusted Score = Rating × Confidence

Where Confidence ranges from 0.30 (Not Enterprise Viable floor) to 1.00 (Adopted ceiling)

Evidence Factors:

Recency (30%)
<30 days = 1.0, 30-90 days = 0.5, >90 days = 0.0

Evaluation Depth (35%)
Thorough = 1.0, Moderate = 0.5, Minimal = 0.0

Hands-On Testing (35%)
Tested = 1.0, Demo = 0.5, Not Tested = 0.0

Adopted Tool Example:

Rating: 68.0
Status: Adopted (85-100% band)
Adjusted Score = 68.0 × 1.00 = 68.0
High evidence → top of band

Emerging Tool Example:

Rating: 68.0
Status: Emerging (55-80% band)
Adjusted Score = 68.0 × 0.70 = 47.6
Default evidence → mid-band confidence

Rating vs Status: Independent Dimensions

Rating and Status measure different things and are intentionally independent:

Rating (0-100)

Pure capability score — what the tool can do when working as intended. Based on five evaluation dimensions.

Status

Validation level — how much we've verified and trust those capabilities. Based on enterprise deployments and evidence.

Example combinations: A tool can be Adopted (fully validated) with Rating 60 (limited capabilities), or Emerging (limited validation) with Rating 85 (very capable but unproven at scale).

Confidence Bands

Each evaluation status has a confidence band reflecting the range of possible confidence levels. Evidence quality determines position within the band.

Status	Band	Description
Adopted	85–100%	Production-validated across enterprise deployments
In Review	65–90%	Active evaluation with substantial evidence
Emerging	55–80%	Promising capabilities, limited validation
Watch	50–75%	Established tool being monitored, not yet formally evaluated
Deferred	40–65%	Evaluation paused, will revisit
Not Enterprise Viable	30–50%	Significant blockers for enterprise use

Five Evaluation Dimensions

Each tool is assessed across five equally-weighted dimensions (0-20 scale each).

AI Autonomy

20% of total

Ability to plan and execute multi-step tasks (assistive → agentic → self-directed)

Task completion without manual steps

Multi-step workflow execution

Error recovery and self-correction

Background/async operation capability

Collaboration

20% of total

Human + AI co-creation fluency (prompting → pairing → natural collaboration)

Git/GitHub integration depth

Team communication tools (Slack, Teams)

Project management integration

Code review workflow support

Contextual Understanding

20% of total

Depth of understanding across repos, projects, and systems (file → repo → ecosystem)

Codebase indexing and search

Cross-repository awareness

Historical context retention

Organization-specific knowledge

Governance

20% of total

Enterprise readiness: compliance, observability, and trust controls

SSO/SAML authentication

Audit logging and compliance

Data residency controls

Role-based access management

User Interface

20% of total

Interaction maturity: keyboard → chat → multimodal ("vibe coding")

IDE integration breadth

CLI/terminal support

Web interface quality

API/headless operation

Release Cadence

Our evaluation follows a structured release schedule to balance thoroughness with timeliness.

Monthly Evaluation

First week of each month

Full evaluation refresh with documented changes

Score adjustments based on tool updates
Status transitions with rationale
New tool evaluations from pipeline
Quick take and rationale updates

Quarterly Deep-Dive

Q1, Q2, Q3, Q4

Strategic analysis and methodology review

Comprehensive dimension scoring review
Enterprise adoption trend analysis
Methodology refinements
Stakeholder briefing preparation

Continuous Monitoring

Ongoing

Market intelligence and pipeline management

Security vulnerability tracking
Acquisition and funding news
Community sentiment analysis
Submission review and triage

Evaluation Pipeline

Submitted

Initial intake

Backlog

Validated, queued

In Review

Active evaluation

Decision

Final status

Backlog vs Deferred

Backlog

New tool approved for evaluation but never reviewed. Waiting in queue for initial assessment.

Deferred

Previously reviewed, now paused. We have context but are deprioritizing (e.g., no public product, strategic hold, capacity constraints).

Data Sources

Product Documentation

Official docs, security whitepapers, compliance certifications

Enterprise Feedback

Client deployments, stakeholder interviews, production metrics

Benchmark Data

SWE-Bench, Terminal Bench, third-party evaluations

Market Intelligence

Funding rounds, acquisitions, partnership announcements

Ready to explore the radar?

View Adopted Tools →Browse All Tools