It has been almost two years since Microsoft CEO Satya Nadella forecasted that generative AI would revolutionize knowledge work. However, in many law firms and investment banks today, human employees still dominate the workforce. Despite the buzz surrounding “reasoning” and “planning,” a recent study from Mercor, a training-data company, reveals why the robot takeover has hit a roadblock: AI struggles with the complexity of real-world tasks.
A reality check for the “replacement” theory
Mercor has introduced a challenging benchmark called APEX-Agents, which sets it apart from traditional tests that focus on AI’s ability to write poetry or solve math problems. This benchmark uses actual queries from professionals like lawyers, consultants, and bankers, requiring AI models to complete intricate, multi-step tasks that involve navigating different types of information.
The outcomes? Even the top-performing models on the market, such as Gemini 3 Flash and GPT-5.2, struggled to achieve an accuracy rate above 25%. Gemini led with 24%, followed closely by GPT-5.2 at 23%. Most other models lagged in the teens.
Why AI is struggling with real-world tasks
Mercor’s CEO Brendan Foody highlights that the challenge lies in context rather than raw intelligence. In practical scenarios, answers are not readily available. For instance, a lawyer might need to review a Slack conversation, analyze a PDF document, examine a spreadsheet, and then synthesize this information to address a query about GDPR compliance.
Humans excel at context-switching naturally, but AI struggles with this. When tasked with retrieving information from various sources, these models often get confused, provide incorrect answers, or simply give up.
The “Unreliable Intern”
For those concerned about job security, the study offers some reassurance. It suggests that current AI capabilities resemble those of an inexperienced intern who gets things right only about a quarter of the time.
However, the pace of progress is astonishingly rapid. Just a year ago, these models were achieving accuracy rates between 5% and 10%. Now, they are reaching up to 24%. While they are not yet ready to take over knowledge work entirely, they are learning at a much faster rate than anticipated. For now, the revolution in “knowledge work” is on hold until AI masters the art of multitasking.