Pull requests were designed for a world where code is written faster than it can be reviewed. Queue it up, wait for a human, get a thumbs up, merge. That made sense when humans wrote all the code and other humans were the best validation available.
Agents flip this entirely. Code is written and validated faster than a human can review it. The bottleneck is no longer writing — it's the human in the loop. And the numbers show it: GitHub saw 518 million pull requests merged in 2025 — up 29% year-over-year — while comments on commits dropped 27%. More PRs, less actual review. The volume is exploding. The substance is evaporating.
What a pull request actually does
Strip away the ritual and a PR serves four functions: code review, CI gates, discussion, and visibility. None of these require a pull request specifically.
- CI gates can run on push to any branch — or to main directly.
- Discussion can happen in issues, commit messages, or chat.
- Visibility is just git log with good commit messages.
- Review is the only one that genuinely benefits from the PR interface. And review is the one that breaks at agent scale.
A pull request is not a Git concept. Git has no notion of it. It's a GitHub feature — a web page that shows a diff and lets people comment before merging. Everything else got bolted on.
The case for review is strong — and outdated
Let's be honest about the evidence. Code review has a remarkable track record. An AT&T study of a 200-person organization found a 90% decrease in defects after introducing reviews. An IBM study tracked 11 programs built by the same team — the first 5 without reviews averaged 4.5 errors per 100 lines of code, the next 6 with reviews averaged 0.82. That's an 82% reduction. Capers Jones's analysis of over 12,000 projects found formal inspections catch 60-65% of defects — more than any single form of testing.
These numbers are real. But they describe a practice that no longer exists.
Those studies measured formal inspections — structured, in-person walkthroughs with trained reviewers following a defined methodology. What we do today is async PR reviews: skim the diff between meetings, leave a comment or two, approve. "Code review" in 1993 and "code review" in 2026 are completely different activities sharing the same name. The modern version inherited the reputation without the methodology.
Microsoft Research studied what modern review actually does. They found it's "less about finding defects than expected" — the primary value is knowledge transfer and team awareness. They literally titled one paper "Code Reviews Do Not Find Bugs." The more files in a change, the lower the proportion of useful comments. The reality of most PRs is a quick skim and an approval.
Meanwhile, Capers Jones found that unit testing catches about 25% of defects, function testing 35%, and integration testing 45%. Combine those with static analysis and type checking, and automated tools consistently outperform lightweight reviews at catching the things that actually break production.
Why this matters for agents
Even if we could bring back formal inspections — the rigorous kind that actually works — it wouldn't matter for agent workflows. The reason is that review's strongest remaining benefit doesn't apply.
Microsoft's research showed the main value of modern code review is knowledge transfer. A senior leaves a comment, a junior learns a pattern, the team builds shared understanding over time. This is genuinely valuable for human teams. It compounds.
Agents don't benefit from any of this. They don't learn from your PR comments the way a junior engineer does. Your "nice catch, let's use a map here instead" teaches a human. An agent doesn't carry that lesson forward unless you encode it as a rule in its configuration. The review is write-only — all cost, no compounding value.
Now add volume. An agent can produce dozens of changes per day. Are you reading every diff? Meaningfully? Or are you rubber-stamping — which is worse than no review at all, because it creates false confidence while consuming the hours you could spend improving automated checks.
What replaces PRs
Validation at write-time, not review-time. The agent should verify the code is correct before it commits — not ask permission after. Tests, type checks, integration checks, schema validation, acceptance criteria. All automated, all before the commit happens.
Continuous deployment over batched review. PRs batch changes and block on a human. Agents work continuously. The right model is: commit, validate, deploy, monitor, rollback if needed. Not: commit, open PR, wait, merge, deploy.
Observability over approval. Instead of a human reading diffs before merge, you watch production after deploy. Alerts, metrics, error rates. You catch the problems that code review misses anyway — which is most of them.
The control illusion
PRs give humans a feeling of control. "I reviewed it, so it's safe." But this is a comfort mechanism, not a safety mechanism. The actual safety comes from whether your tests are good, whether your types are strict, whether your deployment can roll back.
At agent scale, the feeling of control becomes actively harmful. You're spending hours reviewing diffs instead of improving the automated checks that actually prevent breakage. The review becomes the bottleneck that slows everything down while catching almost nothing.
When PRs still make sense
Open source — strangers contributing to your project need a gate. Cross-team coordination where multiple humans need to agree on an interface change. Regulatory compliance that explicitly requires a human sign-off. Security-sensitive code where the stakes justify the overhead.
These are real use cases. But they're the exception, not the default. Most teams using PRs today are doing it out of habit, not necessity.
The agent-native workflow
Here's what actually works when agents are writing code:
- Define acceptance criteria upfront — not review comments after.
- Agent writes code and runs full validation before committing.
- Strong CI on main — if it passes, it ships.
- Monitoring and alerting in production catch what tests miss.
- Automated rollback when something breaks.
No PR. No review queue. No human bottleneck. The guardrails are in code, not in process.
This is the same principle that applies to managing agents in general: guardrails in code, not in prompts. A PR is a prompt — a soft suggestion that someone should check this. A test suite is a guardrail — a hard gate that prevents breakage. Build the gate, skip the suggestion.
The horseless carriage
The industry's response to "agents write too much code for humans to review" has been predictable: have agents review each other. Agent writes code, opens a PR, a separate reviewer agent reads the diff, leaves comments, the author agent addresses them, CI runs, merge. It's the same ceremony, just faster.
The agents are cosplaying as a development team.
Multi-agent review tools run 10+ specialized agents in parallel — one for security, one for performance, one for architecture. The principle sounds right: "an auditor doesn't prepare the books." But think about what's actually happening: one model is reading another model's diff. Different prompts and specializations help at the margins, but the fundamental approach — pattern matching on surface features of a diff — is the same whether a human or an agent does it. You've made review faster, not better.
What actually changed? Latency. Not quality.
This is the "horseless carriage" stage. We removed the horse — the human reviewer — but kept the carriage — the PR workflow. The car hasn't been invented yet.
Verify, don't review
The real shift isn't faster review. It's replacing review with verification.
Review is opinion. An agent reading a diff and saying "this looks right" is pattern matching on surface features — no different from a human doing the same thing. Verification is fact. An agent running the code against a specification and demonstrating it satisfies every constraint is a fundamentally different activity.
"This looks correct" versus "this is correct — here's the evidence." The first depends on the reviewer's attention. The second depends on the specification's completeness. One scales with headcount. The other scales with compute.
The bottleneck moves from "do we have enough reviewers" to "is our specification good enough." That's the right bottleneck — because a specification is an asset that compounds. A review is a one-time event that evaporates. Every test you write makes every future change safer. Every review you do is gone the moment you click approve.
The agent doesn't ask "does this look right?" It asks "does this satisfy the defined transitions, pass the defined checks, meet the defined criteria?" No opinion involved. No "looks good to me" — from a human or from another agent.
The real question
The value was never the pull request. It was confidence that a change is safe. PRs were the best tool we had for generating that confidence in a human-only workflow. They're not the best tool anymore.
If your automated validation is good enough, you don't need a human reviewing diffs. If it's not good enough, a human reviewing diffs won't save you — they'll just slow you down while missing most of the problems. And replacing the human with another agent reading the same diff is just automating the wrong thing faster.
Invest in the specification, not the ceremony. Build the verification, skip the review. The teams that figure this out first won't just ship faster — they'll ship with more confidence than any review process ever provided.
Sources
- Capers Jones, "Software Defect Removal Efficiency" — analysis of 12,000+ projects on defect removal rates across inspections, testing, and static analysis.
- Fagan, "Advances in Software Inspections" (IEEE TSE, 1986) / AT&T case study — 200-person organization saw 90% defect reduction and 14% productivity increase after introducing formal inspections.
- Russell, "Experience with Inspections in Ultralarge-Scale Developments" (IEEE Software, 1991) — IBM study of 11 programs showing 82% defect reduction with reviews versus without.
- Bacchelli & Bird, "Expectations, Outcomes, and Challenges of Modern Code Review" (Microsoft Research / ICSE 2013) — found modern code review is less about defects than expected; primary value is knowledge transfer and team awareness.
- Czerwonka, Greiler & Tilford, "Code Reviews Do Not Find Bugs" (Microsoft Research, 2015) — empirical study showing most review comments address code style and maintainability rather than functional defects.
- Bosu, Greiler & Bird, "Characteristics of Useful Code Reviews: An Empirical Study at Microsoft" (MSR 2015) — analysis of 1.5 million review comments across five Microsoft projects.
- GitHub Octoverse 2025 — 518.7M pull requests merged (up 29% YoY), commits up 25%, commit comments down 27%. Coding agents created 1M+ pull requests between May and September 2025.