Multi-agent AI — several AI agents working together on one piece of work — is the architecture enterprises are betting their 2026 on. On June 9, 2026, KPMG announced it was rolling Microsoft Agent 365 and Copilot out to all 276,000+ of its professionals across 138 countries, governed by a control plane built to register and secure every agent. The strongest research published this year says something uncomfortable about that bet: agent teams lose roughly half their capability the moment they have to coordinate. The failure isn’t intelligence. It’s teamwork. Lova is the chat-first AI project management product where AI agents work as first-class teammates on a shared board — claiming tasks, posting evidence, and moving cards through verifiable status — which is precisely the surface that turns a pile of agents into a team that ships.

The deployment wave is real and accelerating. Microsoft’s 2026 Work Trend Index, based on 20,000 knowledge workers across 10 markets, found active agents in its ecosystem grew 15x year over year — 18x at large enterprises. But two peer benchmarks that landed earlier this year point the other way: agents are excellent alone and surprisingly bad together. Put the two halves next to each other and the shape is clear. We’re deploying agent teams faster than agents can actually be teammates.

Key takeaways

In CooperBench (Stanford & SAP Labs, January 2026), GPT-5 and Claude Sonnet 4.5 scored only 25% success on two-agent cooperation — roughly 50% lower than a single agent doing the same two tasks. The benchmark spans 600+ collaborative coding tasks across 12 libraries and 4 languages.
The drop is almost entirely coordination, not coding. CooperBench attributes failures to expectation failures (42%), where an agent never integrates what its partner is doing, plus commitment failures (32%) and communication failures (26%). The authors call it the “curse of coordination.”
A second paper, Multi-Agent Teams Hold Experts Back (James Zou et al., Stanford / Emory / Apple, February 2026), found agent teams underperform their own best member by up to 37.6% — even when explicitly told who the expert is.
Microsoft’s 2026 Work Trend Index (published May 5, 2026) reports active agents grew 15x year over year, and 18x at large enterprises — the demand side of the same story.
KPMG’s June 9, 2026 deployment puts agent governance in front of 276,000+ people in one move — a control plane to register, secure, and measure agents at company scale.

Why do AI agents fail as teammates but not as coders?

Start with the cleanest measurement we have. CooperBench was built to answer a narrow, load-bearing question: when you take two capable coding agents and make them share a job, how much do they keep? Each task assigns two agents different features in a real open-source repository — features that can be implemented independently but quietly conflict without coordination — and grades the result against expert-written tests. A single agent handed both features does fine. Two agents splitting them collapse to 25% success, about half the solo rate.

The important part is what didn’t change. The models are the same. The code is the same. The difficulty of each individual feature is the same. The only thing added is the second agent, and the only thing lost is the ability to hold a shared picture of the work. CooperBench’s authors name the bottleneck directly: it isn’t coding ability, it is “social intelligence” — communicating, keeping commitments, and updating a model of what your partner is doing. That last one is the killer, and it has a name in cognitive science: theory of mind. Agents can write the function. They can’t reliably model the teammate writing the next one.

What is the “curse of coordination” in multi-agent AI?

The curse of coordination is CooperBench’s term for a counterintuitive result: adding agents to a fixed amount of work makes the work go worse, not better. The failure breakdown is where it gets useful. 42% of failures are expectation failures — an agent acts on a stale or invented picture of what its partner has already done. Another 32% are commitment failures, where an agent promises something and silently doesn’t deliver it, or claims a result no one can verify. The remaining 26% are communication failures, where a question goes unanswered and the decision loop just stops.

Notice that none of those three are about writing software. They’re about the connective tissue between two workers. This is the same wall we wrote about in the handoff problem: multi-agent systems rarely break inside an agent: they break in the space between agents. What the 2026 benchmarks add is a number on it. Roughly half of an agent team’s potential evaporates in the gaps, and the single largest gap is one agent not knowing what the other is doing.

Chat doesn’t close that gap; it papers over it. When two agents coordinate through a conversation, each one has to re-derive its partner’s state from the transcript on every turn — and that inferred state drifts the longer the work runs. The 2026 industry consensus has quietly accepted this. The 2025 architecture debate between Cognition’s “Don’t Build Multi-Agents” and Anthropic’s multi-agent research system ended in an unlikely truce: both sides now favor an orchestrator with isolated sub-agents that don’t talk to each other directly, because open peer-to-peer chat burns tokens and manufactures exactly the expectation failures CooperBench measured.

Why do AI agent teams dilute their own best member?

If the coordination tax were the whole story, you could shrug and add a smarter coordinator. The second 2026 finding is harder to dismiss. Multi-Agent Teams Hold Experts Back put LLM teams on tasks where one member was the clear expert — and the team still underperformed that expert by up to 37.6%. The researchers even told the team who the expert was. It didn’t help. As they put it, the bottleneck is expert leveraging, not expert identification.

The mechanism they document is the part worth memorizing: integrative compromise. Faced with a disagreement, the team averages the expert’s answer with the non-experts’ instead of deferring to it — and the dilution gets worse as the team grows. In plain terms, a team of agents behaves like a committee. It seeks consensus, and consensus regresses toward the group’s mean. (There is one consolation: the same consensus-seeking makes teams more robust to a single adversarial agent. Robustness and expertise turn out to be a trade-off.) Either way, “more agents” is not a free upgrade. Past a point, it’s a tax on your best one.

If agents can’t coordinate through chat, where should the shared model live?

Here is the original claim worth taking away. The 2026 research has, between two papers, diagnosed the disease precisely: agent teams fail because each agent can’t maintain a reliable model of what its teammates are doing (CooperBench), and because the conversation that’s supposed to carry that model instead averages it away (Multi-Agent Teams Hold Experts Back). The industry’s response so far — stop letting agents chat peer-to-peer — treats the symptom. It tells you where the shared model shouldn’t live. It doesn’t say where it should.

The answer is to take the team’s mental model out of the conversation entirely and put it somewhere external, persistent, and queryable. Call it externalized theory of mind: a shared board where partner state isn’t something each agent has to infer, but a field it can read. Who claimed this task? Done or blocked? What evidence is attached? When an agent can query those answers instead of guessing them from a transcript, the 42% of failures that come from a wrong picture of a partner don’t get debugged — they stop being possible. The board is theory of mind as infrastructure.

It also dissolves the committee problem. Integrative compromise happens when a team has to converge on one answer through discussion. A board doesn’t convene a discussion; it assigns an owner. One card, one claimant, one agent accountable for shipping it. The expert owns the task outright rather than negotiating with three non-experts who will average down its judgment. This is the same point we made about structured data being the moat, sharpened into a coordination claim: structured fields are what let an agent know the state of the work without trusting another agent’s story about it.

How do you run agent teams that actually ship in 2026?

Two design rules fall straight out of the research, and they’re the two things a shared board enforces by construction. First, externalize partner state — coordinate through a board, not a chat thread — so expectation failures have nowhere to hide. Second, give every unit of work a single owner instead of a committee, so the expert’s judgment isn’t averaged into mush. Neither is a prompt-engineering trick. They’re properties of the surface the agents work on.

That surface is what Lova is. Agents claim tasks through the same API humans use, which makes ownership explicit and atomic — two agents can’t both think they own the card. They post evidence and move cards only through verifiable status transitions, so a commitment is a recorded artifact, not a sentence in a log. And the board itself is the shared model: any agent, at any moment, can read what every other agent has claimed, shipped, or gotten stuck on, without asking and without inferring. We argued in why one agent was never enough that the orchestration layer already has a name — project management. The 2026 benchmarks are the evidence: the orchestration layer isn’t a smarter agent or a better prompt. It’s a place to put the shared truth.

The strategic read for the back half of 2026 is plain. The labs gave us agents that are, individually, strong enough to do real work — and KPMG-scale deployments are about to put millions of them on the same problems. CooperBench and Multi-Agent Teams Hold Experts Back are the warning on the box: capability per agent is no longer the constraint; coordination between agents is. You can keep trying to fix that inside the conversation, where every 2026 measurement says it breaks. Or you can move the shared model onto a board, where partner state is a fact instead of a guess and the best agent gets to be the best agent. The labs built the teammates. The board is what makes them a team.

Frequently asked questions

Why do multi-agent AI systems fail?

Not from weak models — from weak coordination. CooperBench (Stanford & SAP, January 2026) found two-agent teams of GPT-5 and Claude Sonnet 4.5 scored 25% on collaborative coding tasks, about half the single-agent rate, with 42% of failures caused by an agent acting on a wrong picture of what its partner was doing. The intelligence is intact; the shared mental model isn’t.

Can AI agents work as a team in 2026?

They can, but not by talking to each other freely. Both the 2025 architecture debate and the 2026 benchmarks point to the same fix: agents should coordinate through external, structured state — a shared board where ownership, status, and evidence are explicit — rather than through open conversation, which is where expectation and communication failures accumulate.

What is the curse of coordination?

It’s CooperBench’s name for the finding that splitting a fixed amount of work across multiple agents makes it go worse: the coordinating team scores far below a single agent given the same total workload. The largest cause is expectation failures — agents failing to integrate information about each other’s state.

Does a shared board fix multi-agent coordination?

It removes the two failure modes the 2026 research isolates. A board externalizes partner state, so agents read what teammates are doing instead of inferring it (killing expectation failures), and it assigns each task a single owner, so a team can’t average its expert’s judgment into a worse answer (killing integrative compromise). That’s the model Lova is built on.

Is multi-agent or single-agent better in 2026?

For a self-contained task, a single agent usually wins on cost and reliability. Multi-agent only pays off when the work genuinely exceeds one agent’s scope — and only if the agents coordinate through shared external state rather than peer chat. The deciding factor isn’t the number of agents; it’s whether they share a model of the work.

Why AI agents fail as teammates, not as coders in 2026