Skip to main content

Evaluation Methodology

A systematic framework for assessing agentic developer tools for enterprise adoption.

Evaluation Philosophy

Our evaluation framework balances capability assessment with validation confidence. Raw capability scores (Rating) represent a tool's technical potential, while the Adjusted Score reflects our certainty based on real-world enterprise deployments.

This approach recognizes that a highly capable tool with limited enterprise validation carries more risk than a slightly less capable tool with proven production deployments. The adjusted score helps enterprise decision-makers understand this risk-adjusted view.

Scoring Formulas

Rating (0-100) — Pure Capability Score

The Rating reflects what the tool can do when it works as intended, without accounting for validation level or enterprise readiness.

Formula:

Rating = (AI Autonomy + Collaboration + Contextual Understanding + Governance + User Interface) ÷ 5 × 5

Each dimension is scored 1-20. The average is calculated and multiplied by 5 to convert to a 0-100 scale.

Example Calculation:

AI Autonomy: 16, Collaboration: 12, Contextual Understanding: 16, Governance: 8, User Interface: 16
Average = (16 + 12 + 16 + 8 + 16) ÷ 5 = 68 ÷ 5 = 13.6
Rating = 13.6 × 5 = 68.0

Adjusted Score (0-100) — Confidence-Adjusted Score

The Adjusted Score accounts for evaluation status and evidence quality. Each status has a confidence band (floor–ceiling). Evidence factors — evaluation recency, research depth, and hands-on testing — determine position within the band.

Formula:

Confidence = Floor + Evidence Score × (Ceiling - Floor)Adjusted Score = Rating × Confidence

Where Confidence ranges from 0.30 (Not Enterprise Viable floor) to 1.00 (Adopted ceiling)

Evidence Factors:

Recency (30%)
<30 days = 1.0, 30-90 days = 0.5, >90 days = 0.0
Evaluation Depth (35%)
Thorough = 1.0, Moderate = 0.5, Minimal = 0.0
Hands-On Testing (35%)
Tested = 1.0, Demo = 0.5, Not Tested = 0.0

Adopted Tool Example:

Rating: 68.0
Status: Adopted (85-100% band)
Adjusted Score = 68.0 × 1.00 = 68.0
High evidence → top of band

Emerging Tool Example:

Rating: 68.0
Status: Emerging (55-80% band)
Adjusted Score = 68.0 × 0.70 = 47.6
Default evidence → mid-band confidence

Rating vs Status: Independent Dimensions

Rating and Status measure different things and are intentionally independent:

Rating (0-100)

Pure capability score — what the tool can do when working as intended. Based on five evaluation dimensions.

Status

Validation level — how much we've verified and trust those capabilities. Based on enterprise deployments and evidence.

Example combinations: A tool can be Adopted (fully validated) with Rating 60 (limited capabilities), or Emerging (limited validation) with Rating 85 (very capable but unproven at scale).

Confidence Bands

Each evaluation status has a confidence band reflecting the range of possible confidence levels. Evidence quality determines position within the band.

StatusBandDescription
Adopted85100%Production-validated across enterprise deployments
In Review6590%Active evaluation with substantial evidence
Emerging5580%Promising capabilities, limited validation
Watch5075%Established tool being monitored, not yet formally evaluated
Deferred4065%Evaluation paused, will revisit
Not Enterprise Viable3050%Significant blockers for enterprise use

Five Evaluation Dimensions

Each tool is assessed across five equally-weighted dimensions (0-20 scale each).

AI Autonomy

20% of total

Ability to plan and execute multi-step tasks (assistive → agentic → self-directed)

Task completion without manual steps
Multi-step workflow execution
Error recovery and self-correction
Background/async operation capability

Collaboration

20% of total

Human + AI co-creation fluency (prompting → pairing → natural collaboration)

Git/GitHub integration depth
Team communication tools (Slack, Teams)
Project management integration
Code review workflow support

Contextual Understanding

20% of total

Depth of understanding across repos, projects, and systems (file → repo → ecosystem)

Codebase indexing and search
Cross-repository awareness
Historical context retention
Organization-specific knowledge

Governance

20% of total

Enterprise readiness: compliance, observability, and trust controls

SSO/SAML authentication
Audit logging and compliance
Data residency controls
Role-based access management

User Interface

20% of total

Interaction maturity: keyboard → chat → multimodal ("vibe coding")

IDE integration breadth
CLI/terminal support
Web interface quality
API/headless operation

Release Cadence

Our evaluation follows a structured release schedule to balance thoroughness with timeliness.

Monthly Evaluation

First week of each month

Full evaluation refresh with documented changes

  • Score adjustments based on tool updates
  • Status transitions with rationale
  • New tool evaluations from pipeline
  • Quick take and rationale updates

Quarterly Deep-Dive

Q1, Q2, Q3, Q4

Strategic analysis and methodology review

  • Comprehensive dimension scoring review
  • Enterprise adoption trend analysis
  • Methodology refinements
  • Stakeholder briefing preparation

Continuous Monitoring

Ongoing

Market intelligence and pipeline management

  • Security vulnerability tracking
  • Acquisition and funding news
  • Community sentiment analysis
  • Submission review and triage

Evaluation Pipeline

1
Submitted
Initial intake
2
Backlog
Validated, queued
3
In Review
Active evaluation
4
Decision
Final status

Backlog vs Deferred

Backlog

New tool approved for evaluation but never reviewed. Waiting in queue for initial assessment.

Deferred

Previously reviewed, now paused. We have context but are deprioritizing (e.g., no public product, strategic hold, capacity constraints).

Data Sources

Product Documentation

Official docs, security whitepapers, compliance certifications

Enterprise Feedback

Client deployments, stakeholder interviews, production metrics

Benchmark Data

SWE-Bench, Terminal Bench, third-party evaluations

Market Intelligence

Funding rounds, acquisitions, partnership announcements

Ready to explore the radar?