As someone who's spent years helping businesses automate their workflows, I've watched AI evolve from a promising technology to an essential business tool. But there's a critical gap between AI's potential and its actual performance that most people aren't talking about—and it could be costing your business more than you think.
Recent research has revealed something that should make every automation professional pause: when it comes to basic calculations, even the most advanced AI chatbots get it wrong 40% of the time. Let me break down what this means for your business and how you can work around these limitations.
The Omni Research on Calculation in AI (ORCA) study tested five leading AI models with 500 everyday math problems—the kind of calculations businesses rely on daily. The results were eye-opening, and frankly, they challenge the narrative that AI can handle everything we throw at it.
The study examined the current generation of publicly available AI chatbots:
These aren't obscure models. These are the tools millions of businesses and individuals use every single day to make decisions, process data, and automate calculations.
Here's what keeps me up at night as an automation consultant: not a single AI model achieved better than 63% accuracy on everyday math problems.
Let me give you the complete picture:
Gemini leads the pack at 63% accuracy—meaning it still fails nearly 4 out of every 10 problems
Grok follows closely at 62.8%, essentially tied for first place
DeepSeek ranks third with 52% accuracy
ChatGPT trails at 49.4%—barely better than a coin flip
These numbers should fundamentally change how we think about deploying AI in business-critical scenarios.
Not all hope is lost. When it comes to pure mathematical operations and unit conversions (147 of the 500 test prompts), AI performs significantly better:
With an average accuracy of 72.1% across all models, this category represents the highest performance area—but even here, you're looking at roughly 1 in 4 answers being wrong.
According to the research, AI accuracy hits record lows when dealing with physical task calculations. This is particularly concerning for businesses in manufacturing, logistics, construction, or any field where real-world measurements matter.
As someone who designs automated systems, I've learned that understanding failure modes is just as important as understanding capabilities. The ORCA study identified four distinct error categories that explain why AI gets calculations wrong.
This is where AI understands the problem and selects the correct formula but fumbles during execution:
Think of this as knowing how to drive but accidentally pressing the wrong pedal. The AI "gets it" conceptually but fails in execution—especially dangerous in multi-step calculations where one error cascades through the entire solution.
These are the errors that concern me most as an automation professional because they reveal fundamental comprehension failures:
When AI makes logic errors, it's not just getting numbers wrong. It's solving the wrong problem entirely, which is far more dangerous in business contexts.
AI sometimes fails to correctly parse what's being asked. This manifests as:
This is particularly problematic in business environments where precision in language matters and ambiguity is common.
Sometimes AI simply refuses to answer or deflects the question rather than attempting a solution. While this might seem like the safest failure mode, it's frustrating when you need actionable answers quickly.
After reviewing this research and reflecting on years of automation work, here's my practical advice for business leaders and automation professionals:
The study offers clear guidance on which AI to use for specific scenarios:
Never trust AI calculations blindly. Here's what I recommend:
Always verify critical calculations using traditional calculators or spreadsheets
Use multiple AI models for important computations and cross-reference results
Rephrase and resubmit the same problem to check for consistency
Build validation steps into your automated workflows
Keep humans in the loop for high-stakes decisions
The research specifically highlights that rounding errors compound in multi-step calculations. If an error occurs at any point in a sequence, the final result can be dramatically off.
Action item: For any calculation involving multiple steps, break down the process and verify intermediate results, not just the final answer.
As someone who advocates for intelligent automation, I need to be honest with you: this research doesn't diminish the value of AI—it clarifies where we are in the technology's evolution.
AI remains transformational for:
But for calculation accuracy, we're not there yet. The 40% error rate isn't a reason to abandon AI—it's a reason to be smarter about how we deploy it.
The study's conclusion is clear: significant improvements are still needed to achieve reliable mathematical and conversational logic in AI systems.
As these models continue to evolve, I expect we'll see:
Until then, the responsibility falls on us—the practitioners, consultants, and business leaders—to use these tools wisely and implement appropriate safeguards.
AI has revolutionized how we work, and I'm more bullish than ever on automation's potential. But blind faith in technology is never the answer. The ORCA study gives us valuable data to make better decisions about when to trust AI and when to verify.
In my automation practice, I've always emphasized that the goal isn't to replace human judgment—it's to augment it. These findings reinforce that philosophy. Use AI as a powerful assistant, but keep your critical thinking engaged, especially when numbers matter.
The businesses that will win in the AI era aren't those that automate everything blindly. They're the ones that understand the technology's limitations and build systems that account for them.
Stay automated, stay intelligent, and most importantly—stay skeptical.
Hamza Baig is the founder of Hexona Systems—an automation agency and softwareplatform that helps thousands of entrepreneurs and business owners implement AI-powered workflows at scale.