The Metrics Problem

Most AI coding agent dashboards lead with vanity metrics: PRs opened, tickets processed, lines generated. These are easy to count and tell you almost nothing about whether the platform is working.

The right metrics are harder to measure but they answer the actual question: is this platform delivering compounding value, or is it generating noise? This post covers twelve metrics that matter, with healthy-target ranges for each, plus the metrics that should be ignored even though they look impressive.

1. First-Time Acceptance Rate

What it measures: The percentage of AI-opened PRs merged with no human-requested changes.

Why it matters: This is the proxy for whether the AI is actually doing the work, vs. doing 80% and leaving the engineer to finish.

Healthy range: 70%+ after week 8. Below 50% indicates the AI's scope is too broad or the validation pipeline is too lax. See [how the self-improving engine raises this over time](/blog/self-improving-ai-learns-from-code-reviews).

Failure mode: Tracked aggregate, not per-category. A 70% aggregate number can hide a 30% rate on one category that's losing trust.

2. Per-Category Acceptance Rate

What it measures: Acceptance rate broken out by ticket type â€” bug fix, feature, dependency bump, test backfill, etc.

Why it matters: Aggregate hides drift. If feature work falls from 60% to 40% acceptance while dependency bumps stay at 95%, the aggregate looks healthy and feature work is rotting.

Healthy range: Each category above 60%. Categories below should either be improved or removed from AI scope.

3. Bug Escape Rate

What it measures: Production incidents traceable to AI-generated changes per 1,000 AI-merged PRs.

Why it matters: This is the safety metric. Acceptance rate without escape rate is a false reassurance â€” high acceptance and high escape means you're shipping bugs through the validation pipeline.

Healthy range: Below the team's baseline escape rate for human-authored changes. See [the bug escape data](/blog/ai-code-review-bug-escape-rate-data).

Failure mode: Hard to attribute. An AI change merged in March that causes an incident in June â€” was it the AI? Investment in incident-attribution tooling pays for itself.

4. Mean Time to PR (MTTP)

What it measures: Time from ticket creation to PR open, for AI-handled tickets.

Why it matters: This is the customer-perceived response time. Lower is better.

Healthy range: Under 30 minutes for the median ticket. Under 4 hours for P95.

Failure mode: Confused with MTTR. MTTP is "ticket to PR." MTTR is "ticket to merged-and-deployed." Track both.

5. Mean Time to Resolution (MTTR)

What it measures: Time from ticket creation to deployed fix.

Why it matters: What customers actually feel. PRs that sit unmerged help nobody.

Healthy range: AI-handled bug categories should drop median MTTR to under 24 hours. See [reducing MTTR with AI](/blog/reducing-mttr-with-ai-code-generation).

6. Cost per Merged PR

What it measures: Total AI spend divided by PRs that actually merged (not opened).

Why it matters: Cost per opened PR is misleading because rejected PRs are pure cost. Cost per merged PR shows the real economics.

Healthy range: $1-$8 for typical work. Above $15 indicates either over-iteration or over-scoped tickets.

7. Confidence-vs-Acceptance Calibration

What it measures: When the AI's confidence score is X, what percentage of those PRs actually merge?

Why it matters: A miscalibrated confidence score is worse than no confidence score. If the AI says 90% confident but only 60% merge, the decision engine is making bad routing decisions.

Healthy range: Confidence score should track actual acceptance rate within Â±5%. Miscalibration above 10% means recalibrate the scoring.

8. Reviewer Effort per AI PR

What it measures: Average time a human reviewer spends on an AI PR vs a human PR.

Why it matters: AI PRs that take 30 minutes to review are not saving time. The goal is AI PRs that take less time to review than they took to write.

Healthy range: AI PRs should be reviewed in under half the time of equivalent human PRs. If reviewer effort is high, the validation pipeline isn't catching enough.

9. Reviewer Acceptance vs Override Rate

What it measures: When a reviewer requests changes, are those changes substantive (real bugs) or stylistic (preference)?

Why it matters: Distinguishes real quality issues from reviewer-AI mismatch. High substantive change requests = real quality problem. High stylistic change requests = pattern config drift.

Healthy range: Stylistic changes under 10% indicates the per-repo config is well-tuned.

10. Catastrophic Action Rate

What it measures: AI attempts at actions outside its allowed scope â€” touching forbidden paths, exceeding max-files limit, attempting destructive operations â€” that the validation pipeline blocked.

Why it matters: Tracks whether the guardrails are doing their job and whether AI is trending into unsafe territory.

Healthy range: Should be near zero. Non-zero is fine if the pipeline blocks them. Increasing trend indicates need for guardrail tightening.

Failure mode: Conflated with normal blocked-by-validation cases. Catastrophic actions are specifically those that would have been dangerous if executed.

11. Auto-Merge Rate

What it measures: Percentage of AI PRs that auto-merge without human review (in categories where that's enabled).

Why it matters: Auto-merge is the highest-trust state. The metric tracks how much of the workload is fully autonomous.

Healthy range: Starts at 0% in month 1, climbs to 30-50% for low-risk categories (dependency bumps, test additions, format-only changes) by month 6. See [the rollout pattern](/blog/scaling-ai-code-generation-500-repositories).

12. Coverage of Eligible Backlog

What it measures: Percentage of AI-eligible tickets that the AI actually handles.

Why it matters: The AI can be high-acceptance and high-quality but only handle 5% of eligible tickets. The platform's value is bounded by adoption.

Healthy range: 60%+ of eligible backlog handled by AI by month 6. Below 30% indicates a workflow problem (people are routing around the platform), not an AI quality problem.

What to Ignore

These metrics look impressive and are mostly noise:

Total tickets processed. Volume without quality is not progress.
Total lines of code generated. AI loves to add lines. Fewer lines for the same outcome is better.
Tokens consumed. Operational, not value-bearing.
PR open rate. A PR that doesn't merge is cost without value.
Time saved. Estimated, not measured. Use cost-per-merged-PR and reviewer-effort instead.

Dashboard Hierarchy

How to organize the dashboard:

Top: Acceptance rate, escape rate, MTTR, cost-per-merged-PR. Four numbers.
Middle: Per-category breakdown of acceptance and cost.
Bottom: Operational â€” confidence calibration, catastrophic action rate, auto-merge rate, coverage of eligible backlog.

Anything else is for investigation, not the dashboard.

Reporting Cadence

Daily: Operational metrics for the on-call platform engineer.
Weekly: Aggregate metrics for the platform team.
Monthly: Reported to engineering leadership with category breakdowns and trend lines.
Quarterly: Strategic review â€” what categories to expand, contract, or change.

Reporting too often noise-trades. Reporting too rarely lets problems compound.

Anti-Goals

There are some metrics deployments should not optimize for, even if they're easy to track:

100% AI handling of all tickets. AI should handle the AI-suitable subset; the rest belong to humans.
Zero human review. Some auto-merge is healthy; all auto-merge is reckless.
Maximum throughput. Throughput at the cost of quality erodes trust.
Lowest possible cost. Cheap AI that fails frequently is more expensive than expensive AI that succeeds.

Summary

The right metrics for an AI coding agent program track quality (acceptance rate, escape rate, calibration), value (MTTR, cost-per-merged-PR, reviewer effort), and adoption (coverage of eligible backlog, auto-merge rate). Twelve numbers are enough; vanity metrics (total tickets, total lines) are distractions. With this measurement frame, the platform team can tell the difference between a healthy deployment and a noisy one â€” and act before noise turns into withdrawal.

For the broader rollout pattern these metrics support, see [scaling AI across 500 repositories](/blog/scaling-ai-code-generation-500-repositories). For the cost dimension, see [the 50-engineer ROI analysis](/blog/ai-code-generation-roi-50-engineer-team).

metricsAI evaluationplatform engineeringengineering analyticsAI ROI

Ready to automate your tickets?

See ensurefix process a real ticket from your backlog in a live demo.

Request a Demo

12 Metrics That Actually Matter for AI Coding Agent Success