Collective Intelligence

Collective Intelligence

Rethinking AI Evaluation Through Human Paradigms

7 min read
79 views

ℹ️

Disclaimer: I am not an AI expert. The thoughts that follow are offered from the perspective of an ordinary AI user—albeit one with hands-on experience building intelligent software. It is from this vantage point, as both a builder and an enthusiast, that I write, pumped for the future of AI.

The discourse surrounding artificial intelligence evaluation has reached a curious inflection point. As researchers benchmark AI systems against standardized tests like AIME and polyglot assessments, we find ourselves trapped in a reductionist framework that fundamentally misunderstands both human and artificial intelligence. A recent paper by Apple on "The Illusion of Thinking" got me thinking about this narrow perspective. The study critiques Large Reasoning Models (LRMs) like DeepSeek-R1 and Claude 3.7 Sonnet's thinking capabilities, focusing heavily on data contamination and the models' failure to execute algorithmic reasoning as disqualifying factors for genuine intelligence. I find this approach to miss the forest for the trees. I strongly believe that the true potential of AI lies not in creating a solitary, polymathic entity, but in fostering collective intelligence—a network of specialized, interconnected AI agents that mirrors the very engine of human progress.

Human Intelligence

This got me thinking: what if we turned the lens inward and examined human intelligence through the same unforgiving microscope? If we applied identical evaluation standards to human cognition, the results would be humbling. When humans encounter knowledge and undergo testing, the majority consistently score below 80% on virtually any assessment. A vast majority of humans, despite years of exposure to knowledge, would struggle to achieve high scores on diverse specialized tests. Under the stringent criteria we impose on AI systems, our hypothetical creators would likely deem humanity "unready" for deployment. Yet this perspective misses the fundamental truth about human intelligence: we are not individually omniscient, but collectively brilliant.

The Specialization Imperative

Consider the elegant division of human expertise. A pilot, masterful in navigating aircraft through complex atmospheric conditions, would likely stumble through a surgical procedure. Similarly, a surgeon, capable of intricate life-saving operations, might struggle inside the cockpit of an aircraft. A theater director understands narrative arc and human emotion but may be lost in a courtroom where a lawyer thrives on legal precedent and argumentation.

This professional specialization reflects foundational economic principles that trace back to Adam Smith's insights on division of labor , who argued that breaking down complex tasks into simpler, specialized tasks leads to increased productivity and efficiency because individuals become experts in their specific tasks, allowing them to perform them more quickly and accurately. David Ricardo's theory of comparative advantage further developed these concepts. Contemporary theories, particularly the work of Nobel laureate Gary Becker on "human capital" provide a modern lens for understanding how individuals invest in domain-specific expertise to maximize their productive capacity within specialized fields.

Each professional has undergone domain-specific training—years of knowledge drills, skill refinement, and experiential learning—representing what economists recognize as human capital investment. The "Illusion of Thinking" paper reveals that Claude 3.7 Sonnet could generate over 100 correct moves for Tower of Hanoi problems but failed after just 4-5 moves in River Crossing puzzles, attributing this to data contamination—Tower of Hanoi being more prevalent in training data than complex River Crossing instances. But isn't this precisely analogous to human specialization? In the context of AI evaluation, wouldn't our specialized education constitute "data contamination"? The surgeon has been "contaminated" by medical school, residency, and countless procedures. The pilot has been "contaminated" by flight training, simulator hours, and real-world flying experience. What we label as "contamination" in AI mirrors the very foundation of human expertise—concentrated exposure to domain-specific knowledge that creates comparative advantage, just as economic theory predicts.

This economic logic finds a direct parallel in software architecture's evolution from monolithic systems to microservices. A monolithic system, much like a generalist, attempts to handle all tasks within a single, tightly-coupled codebase, often becoming brittle and difficult to scale. Microservice architecture, however, mirrors human specialization precisely: it constructs complex applications from collections of small, independent services, each an expert in its domain. The platonic ideal of such a service is a pure function—a mathematical concept for an operation that, given the same input, produces predictable results with no unintended side effects. Deep training creates this "pure function" of expertise for specific domains. The critical breakthrough lies in orchestration—the equivalent of tool-calling in AI systems—which functions as an API gateway or service mesh in software architecture. Just as these orchestration layers coordinate microservices through standardized protocols, AI orchestration enables specialized agents to collaborate seamlessly. Therefore, evaluating a single AI model across all human knowledge resembles criticizing a highly-optimized microservice for not being the entire monolith. The profound challenge isn't building an all-knowing entity, but architecting the orchestration layer that allows specialized "pure function" agents to collaborate into powerful, resilient collective intelligence.

While the monolith versus microservices debate continues in software engineering, human civilization has already provided the answer: our greatest achievements emerge not from individual polymaths, but from networks of specialized experts collaborating through standardized protocols.

Beyond Individual Metrics

The benchmarks themselves serve an important purpose and effectively showcase AI's impressive capabilities. However, we need to fundamentally rethink our priorities. These evaluations are more a measure of how well a model can imitate a generalized human than a true assessment of its capabilities. Instead of obsessing over whether a single AI can rival a human polymath, the more pertinent questions are: How rapidly can these models learn? What are the boundaries of their learning capacity? And how can they be fine-tuned to excel in specific domains, much like human professionals? Instead of focusing solely on individual AI performance against human-designed tests, we should be developing frameworks to evaluate collective intelligence.

The "Illusion of Thinking" study, while methodologically rigorous, reveals an important limitation in our evaluation frameworks. It shows that LRMs perform well in medium-complexity tasks but collapse at high complexity, and that models "think less" as problems become more complex. However, these findings point to a deeper issue: we're evaluating AI through the lens of individual performance rather than collective capability.

Current state-of-the-art AI systems already exceed the capabilities of even the most accomplished human polymaths across most domains. With research being a notable exception (and even that gap is rapidly closing), we've reached a point where the limiting factor isn't individual AI capability but our ability to orchestrate collective intelligence networks. Given this reality, shouldn't we be focusing our evaluation efforts on collective intelligence—the very foundation of human civilization's greatest achievements?

Here's where my thinking my diverge from conventional wisdom: without any fundamental improvements to current AI models, truly agentic AI systems operating collectively could exponentially amplify productivity and potentially achieve Artificial General Intelligence and even Superintelligence. This isn't speculative—it's a logical extension of what we observe throughout human civilization.

In my view, the Model Context Protocol (MCP) becomes the game-changer in this paradigm. MCP offers a standardized framework for communication between different AI agents, enabling them to share information and collaborate on complex tasks. This protocol is the fertile ground from which agentic AI will flourish, representing the critical infrastructure for this transition toward collective super intelligence.

Imagine a future where a network of specialized AIs, each an expert in its own domain—from medical diagnosis and financial analysis to logistics and creative design—can seamlessly collaborate. This is not the creation of a single, all-knowing entity, but something far more powerful: a decentralized, collective intelligence that can tackle challenges far beyond the scope of any single human or AI.

The Ensemble of Specialized Intellects

The relentless drive to advance AI towards autonomy is, of course, crucial. However, we must evolve our evaluation frameworks beyond individual performance metrics. True intelligence—whether human or artificial—emerges not from isolated problem-solving ability, but from the capacity to contribute meaningfully to collective knowledge creation and application. The future of AI is not a solitary genius, but an ensemble of specialized intellects. It is time to move beyond the illusion of individualistic intelligence and embrace the paradigm of collective Artificial intelligence. The benchmarks will then shift from measuring mimicry to evaluating collaboration, and the true, awe-inspiring potential of artificial intelligence will finally be realized. We stand at the threshold of collective super intelligence. The question is not whether AI can match human performance on standardized tests, but whether we can create systems that amplify human collective intelligence while introducing entirely new dimensions of cognitive capability. The answer, powered by protocols like MCP and agentic AI architectures, appears to be a resounding yes.

Share this article

Comments