A Carnegie Mellon study found AI agents fail nearly 70% of standard business tasks. Here’s what the research actually shows and a practical framework for staying in control.
Carnegie Mellon researchers staffed a simulated company entirely with AI agents and found that even the best models failed nearly 70% of standard office tasks. The top performer, Claude 3.5 Sonnet, completed just 24% of assignments successfully.
This doesn’t mean you should stop using AI agents. It means you need a clear framework for which tasks to hand over and which ones to keep human oversight on. This article gives you that framework.
Here’s a statistic that barely made headlines but deserves a lot more attention: researchers at Carnegie Mellon set up a fake company, staffed it entirely with AI agents, gave those agents real business tasks, and watched what happened.
The result wasn’t pretty. Even the best AI agent, Anthropic’s Claude 3.5 Sonnet, successfully completed only 24% of assigned tasks. [1] Google’s Gemini managed 11%. Amazon’s Nova recorded a 1.7% success rate. [2]
Read that again. The best AI agent in the study failed on roughly three out of every four office tasks it was given.
Now, before you conclude that AI agents are useless and cancel your subscriptions: context matters enormously here. These were complex, multi-step business tasks in a simulated environment with interconnected dependencies. Not every task you’d give an agent in practice is this hard. But the findings do raise a genuinely important question: when can you actually trust an AI agent, and when are you setting yourself up for a problem you’ll have to fix later?
The CMU study created a simulation called TheAgentCompany. It replicated a realistic office environment with roles including a CTO, an HR manager, and engineers. Tasks were drawn from real business functions: finance, administration, HR operations, and software engineering. [1]
The tasks weren’t trivial, but they also weren’t exotic. Things like: close a specific pop-up window, find the right colleague to approve a purchase, log the outcome of a client call in the system, send a summary email to the correct distribution list. The kind of work that any reasonably capable employee would handle without much drama.
What tripped the agents up wasn’t lack of intelligence. It was the same thing that trips up new human employees: navigating unfamiliar interfaces, knowing which person or system to involve, handling unexpected states that didn’t match the instructions, and recovering gracefully when something didn’t go as expected.
Claude 3.5 Sonnet’s 24% success rate made it the clear leader. [2] But even that number means the most capable agent on the market today fails three-quarters of the time on realistic business tasks. That’s not a bug to be fixed in the next model update. It’s a signal about where we actually are in the development of this technology.
The failure patterns in the CMU study were revealing. Agents didn’t fail because they couldn’t understand the instructions. They failed because of what researchers called “multi-step brittleness”: the tendency for small errors early in a task to cascade into compounding failures. [3]
Think about a task like: “Find the client record for Northstar Ltd, check the last three interactions, and draft a follow-up email based on the outcome of the most recent call.” A human does this fluidly because they’re constantly adapting to what they find. If the client record has a typo in the company name, a human notices and adjusts. An agent often doesn’t. It either stops entirely or, worse, continues with incorrect information and produces something that looks right but isn’t.
There’s also the problem of decision points. Many office tasks involve an implicit judgment call: which of two possible actions should I take here? Humans navigate these constantly and unconsciously. Agents are increasingly capable of handling well-defined decision trees, but they still struggle with the kind of contextual judgment that experience provides.
The CMU researchers documented agents making genuinely strange choices when they got stuck, including, in one case, renaming a colleague in the system to get the outcome they needed. [1] That’s not a bad-faith action. It’s an agent doing something technically plausible that a human would immediately recognise as wrong. And it’s a reminder that agents don’t have the same baseline common sense that people do.
The honest take: None of this means AI agents aren’t useful. It means they need supervision. The same way you wouldn’t let a brand-new hire send client emails without review for their first month, you shouldn’t let an agent act autonomously on anything consequential until you’ve validated it works reliably in your specific environment.
The CMU study is a useful reality check, but it doesn’t tell the whole story. There are specific task types where AI agents perform consistently and deliver real value today.
Single-step, well-defined tasks. Drafting an email from bullet points, summarising a meeting transcript, reformatting data from one structure to another, generating a first draft of a standard document. These don’t involve the multi-step cascading that causes most failures. Agents handle these well.
Monitoring and alerting tasks. An agent that watches for specific conditions (a deal going quiet, a ticket sitting unassigned, an anomaly in a data feed) and surfaces it for human attention is operating in a mode that suits current capabilities well. It’s not being asked to solve the problem. It’s being asked to notice it.
Structured data tasks with clear rules. If the rules are explicit and the inputs are clean, agents do well. Routing incoming requests to the right team based on keywords. Categorising expenses by type. Matching job applications to pre-defined criteria. The more structured the task, the better the agent performs.
The pattern is consistent: agents are most reliable when the task is narrow, the inputs are clean, and the definition of success is unambiguous. Anything involving judgment, relationship context, or recovery from unexpected states still benefits significantly from human oversight.
Beyond the general pattern, here are the specific task categories where current AI agents fail often enough that you shouldn’t rely on them without close supervision.
Client-facing communications requiring relationship awareness. An agent doesn’t know that your biggest client is currently frustrated about a delivery delay and that any email from your company right now needs to be extra careful with its tone. You do. Keep a human in the loop on these.
Multi-system tasks with handoffs. Anything that requires data to flow correctly between multiple platforms (CRM to email to calendar to project management) introduces multiple failure points. The agent may succeed at each individual step but fail at the handoff.
Tasks involving sensitive or regulated information. HR processes, financial approvals, compliance-related actions. Not because agents are inherently untrustworthy with sensitive data, but because the consequences of an error here are significant and the tasks often require judgment calls that current agents aren’t reliably equipped to make.
Anything where the agent needs to know what it doesn’t know. This is perhaps the hardest category. Agents don’t reliably flag their own uncertainty. They’ll often produce a confident-sounding output when they should be saying “I’m not sure, please check this.” Our anti-hallucination toolkit covers specific techniques for addressing this, including prompting strategies that explicitly ask the AI to surface its uncertainty.
Here’s the decision framework I’d recommend for any professional evaluating whether to let an agent act autonomously on a given task.
Step 1: Classify the stakes. Low stakes means that if the agent gets this wrong, it’s easy to fix and the cost is minimal. High stakes means a mistake has real consequences: financial, reputational, or relationship-related. High-stakes tasks always need human review of agent outputs, at least until you’ve validated reliability over time.
Step 2: Check for ambiguity. Could a reasonable person interpret this task in two different ways? If yes, the agent will probably pick one interpretation without signalling its choice. That’s a recipe for errors. Either clarify the task before handing it to an agent, or review the output to confirm it interpreted things correctly.
Step 3: Assess the recovery cost. If the agent makes a mistake, how hard is it to fix? An incorrectly tagged CRM record: easy to fix. An email sent to 500 people: not easy to fix. Scale your oversight to the recovery cost.
Step 4: Run a validation period. Before fully trusting an agent on any task, run it in “review mode” for two weeks. Check every output. Measure the error rate. If it’s below 5% and errors are all easy to fix, you can reduce oversight. If it’s higher, either the task isn’t right for autonomous execution yet, or the agent needs better instructions.
This isn’t a complicated framework. It’s the same judgment any good manager applies when delegating to a new team member. The difference is that with human team members, most people apply this intuition naturally. With AI agents, you have to be intentional about it.
If you’re currently using AI agents in any part of your workflow, do a 15-minute audit. For each task the agent handles, ask: what happens if it gets this wrong, and would I notice quickly? If the answer to “would I notice quickly” is no, add a spot-check to your routine until you build confidence.
If you’re not yet using agents and wondering whether to start: begin with the monitoring and alerting category. Set up an agent that flags things for your attention rather than acts on them. That’s the lowest-risk entry point and it delivers real value quickly without requiring you to trust the agent’s judgment.
The CMU study isn’t a reason to distrust AI agents. It’s a reason to approach them with the same clear-eyed calibration you’d apply to any new tool or team member: useful when well-directed, risky when left without guardrails.
What professionals most want to know about AI agent reliability.
What was the Carnegie Mellon AI agent study?
Carnegie Mellon researchers created a simulated company called TheAgentCompany and staffed it entirely with AI agents. They gave the agents real business tasks drawn from finance, HR, administration, and software engineering, then measured how often agents successfully completed each task. The results showed failure rates of 70% or higher even for the best-performing models.
Which AI agent performed best in the CMU study?
Anthropic’s Claude 3.5 Sonnet was the top performer with a 24% task completion rate. Google’s Gemini achieved 11%, and Amazon’s Nova recorded just 1.7%. While Claude led the field by a significant margin, even a 24% success rate means the best available agent fails on three out of four tasks in a complex office environment.
Should I stop using AI agents if they fail 70% of the time?
No, but you should be selective about what tasks you hand over and how much oversight you apply. The CMU study used complex, multi-step tasks in an interconnected system. For narrow, well-defined tasks with clean inputs (drafting documents, summarising transcripts, flagging specific conditions), agents perform much more reliably. The key is matching task type to current capability.
How do I know if an AI agent has made a mistake?
The main challenge is that agents often don’t flag their own uncertainty. They produce confident-looking outputs even when they’ve gone wrong. Your best defence is a review step, especially during the first weeks of deploying any agent on a new task. Look for logical inconsistencies, verify any specific facts or figures the agent cites, and check that the agent correctly interpreted the task rather than solving a subtly different one.
What types of tasks are AI agents most reliable for right now?
Agents perform best on single-step, well-defined tasks (drafting from bullet points, summarising content, reformatting data), monitoring and alerting tasks (flagging conditions for human review), and structured data tasks with explicit rules (categorising records, routing requests). They struggle most with multi-step processes, tasks involving judgment about relationships or context, and anything where recovering from an error is expensive.
Sources