Skip to main content
All articles
Architecture11 min readUpdated Feb 13, 2026

AI Councils: Why One Model Isn't Enough

Karpathy, Perplexity, and a wave of startups all shipped the same idea within months. When AI models check each other's work, the results improve — but the details matter more than the headline.


In November 2025, Andrej Karpathy — former VP of AI at OpenAI, former head of AI at Tesla — spent a Saturday building a side project. He called it LLM Council. The idea: instead of asking one AI model a question, ask four of them, have them anonymously critique each other's answers, then synthesize a final response. He open-sourced it and posted about it on X.

Within hours it was trending. Within weeks it had 14,000 GitHub stars.

Ten weeks later, Perplexity shipped Model Council as a commercial product — the same core idea, packaged into their search engine for $200 a month. Around the same time, at least five other startups launched council-style platforms. Enterprise adoption of multi-model strategies accelerated from 61% to 78% in the second half of 2025.

None of these teams were copying each other. They arrived at the same conclusion independently, from different directions, at almost the same time.

That kind of convergence means something. It usually means the underlying problem has become too obvious to ignore.

The Problem With One Model

If you use ChatGPT daily, you've developed an intuition for how it thinks. It has preferences. It leans toward certain phrasings, structures its arguments in particular ways, avoids certain topics, and hedges in predictable patterns. The same is true of Claude, Gemini, and every other model — each carries its own set of biases baked in through training data, alignment tuning, and architectural decisions.

This isn't a flaw that will get patched in the next release. It's structural.

Sycophancy

The most insidious single-model problem is sycophancy — the tendency of AI models to tell you what you want to hear rather than what's accurate. In April 2025, OpenAI had to publicly roll back changes to GPT-4o after updates to its "personality" made it so agreeable that it endorsed a user's decision to stop taking prescribed medication. The model was optimized on thumbs-up/thumbs-down feedback, and learned that agreement gets rewarded.

A January 2026 survey from GovTech Singapore found that sycophancy isn't correlated with model size or capability. Bigger, smarter models aren't less sycophantic. They're often more skilled at it. This is a systemic problem, not a capability gap — the optimization pressure that makes models helpful also makes them eager to validate.

Training Data Shapes Worldview

Every model inherits the statistical patterns of its training corpus. A Stanford Law study found that ChatGPT used male pronouns 83% of the time when discussing "programmers" and female pronouns 91% of the time for "nurses." Even when explicitly instructed to avoid gender bias, it still defaulted to male pronouns 68% of the time.

A PNAS study tested eight models that had been specifically alignment-tuned to be unbiased. All eight still harbored implicit biases across race, gender, religion, and health. The models could pass explicit bias tests — direct questions about fairness — while failing implicit ones. They learned to say the right things without internalizing the right patterns.

The Echo Chamber You Don't Notice

Every conversation with a single model creates a feedback loop. The model adapts to your framing, mirrors your assumptions, and reinforces your priors. A study published in New Media & Society described this as the "Chat-Chamber Effect" — conversational AI interaction combines echo-chamber communication with filter bubble dynamics, reinforcing existing beliefs rather than challenging them.

A CHI 2024 study demonstrated this experimentally: users interacting with opinionated AI search systems showed measurably more confirmatory queries and higher opinion polarization compared to control groups. The model doesn't just reflect your biases back at you — it amplifies them.

Self-Preference Bias

When researchers use one LLM to evaluate the quality of another LLM's output — a common benchmarking technique called LLM-as-a-Judge — the evaluator consistently rates its own outputs higher. This self-preference bias has been documented across position effects (preferring whichever answer appears first), length effects (preferring longer answers regardless of quality), and self-enhancement (rating its own text higher than equivalent text from other models).

This bias directly motivated the development of "LLM juries" — panels of multiple evaluator models — as a corrective. The logic is straightforward: if every model is biased toward its own outputs, the only way to get a fair evaluation is to use multiple judges.

The Core Insight

Single-model bias isn't a bug that will be fixed. It's a consequence of how models are trained and optimized. Every model has a worldview. The question is whether you see it or not.

The Convergence

The striking thing about the multi-model movement isn't any single product. It's the timing. Multiple independent teams — with no coordination — arrived at the same architectural pattern within months of each other. That pattern: send the same query to multiple models, have them evaluate each other, present the results.

Karpathy's LLM Council

Karpathy's project emerged from a personal use case. He was reading books alongside multiple LLMs, comparing how each model interpreted and discussed the same material. He noticed that every model brought different strengths and blind spots to the conversation — GPT-5.1 was thorough but verbose, Gemini was concise, Claude was terse.

His council runs a three-stage pipeline:

  1. Collect — The user's query goes to all council members (GPT-5.1, Gemini 3.0 Pro, Claude Sonnet 4.5, Grok 4) in parallel. Each generates an independent answer.
  2. Rank — Each model reviews the others' responses, but with anonymized identities. No model knows which response came from which provider. This prevents brand-loyalty bias — models can't play favorites.
  3. Synthesize — A designated "Chairman" model (Gemini 3.0 Pro by default) receives all original responses plus all peer reviews and produces a unified final answer.

The entire stack runs through OpenRouter, a single API gateway that routes to 400+ models. Karpathy's key observation: ensemble methods are well-established in classical machine learning, but applying them to LLMs as a user-facing tool was largely unexplored territory.

VentureBeat called it a blueprint for the orchestration middleware layer that enterprise AI was missing. Community forks immediately added democratic voting, streaming, debate modes, and local model support via Ollama.

Perplexity Model Council

Perplexity's approach, launched February 5, 2026, takes the same core concept and packages it as a consumer product. Users select three frontier models (from options like GPT-5.2, Claude Opus 4.5, and Gemini 3 Pro), each with an optional "Thinking" toggle for deeper reasoning. The system runs all three in parallel, presents side-by-side responses, and then has a fourth model synthesize a unified answer highlighting where the models agree, where they diverge, and what each uniquely contributed.

Perplexity's pitch centers on the divergence problem: posing the same question to ChatGPT, Claude, and Gemini can yield significantly different answers, and users currently have no easy way to cross-reference without maintaining multiple subscriptions and manually comparing. Model Council turns that cross-referencing into a single-click operation.

The feature launched exclusively for Max subscribers at $200/month — a price point that drew criticism but also signals who Perplexity thinks the audience is: professionals making consequential decisions where model bias carries real cost.

The Startup Wave

Karpathy and Perplexity got the attention, but they aren't alone. Council AI offers 30+ models with blind spot detection and consensus answers. LM Council maintains curated council configurations optimized for coding, research, and creative work across 300+ models. PolyCouncil brings the pattern to local models via LM Studio with rubric-based cross-evaluation and weighted voting. AI Counsel MCP implements multi-round debate as a Model Context Protocol server, letting any MCP-compatible client tap into council-style deliberation.

The proliferation is telling. When a dozen teams build the same thing independently, the underlying demand is real.

The Enterprise Shift

The most compelling data point isn't any single product — it's the aggregate behavior of organizations. Between May and July 2025, 39% of organizations used only one LLM in production. By August through October 2025, that number dropped to 22%, while 59% were running three or more models. Gartner reported a 1,445% surge in multi-agent system inquiries from Q1 2024 to Q2 2025.

IDC's 2026 AI FutureScape went further, predicting that by 2028, 70% of top AI-driven enterprises will use advanced multi-tool architectures to dynamically manage model routing. CalypsoAI warned that a single-provider LLM strategy is a "ticking time bomb," pointing to OpenAI's two major outages in June 2025 as evidence of the concentration risk.

The enterprise motivation isn't just about quality. It's about resilience, vendor diversification, and matching specialized models to specialized tasks.

The Research Foundation

The convergence isn't happening in a vacuum. A growing body of peer-reviewed research validates multi-model approaches — and, critically, identifies where they fall short.

Mixture-of-Agents

The most rigorous validation came from Together.ai's Mixture-of-Agents (MoA) paper, which received an ICLR 2025 Spotlight designation. Their architecture uses a layered system: "Proposer" models generate initial responses, and "Aggregator" models synthesize them across multiple rounds. Using only open-source LLMs, MoA achieved 65.1% on AlpacaEval 2.0 — outperforming GPT-4 Omni's 57.5% — at roughly half the cost of a single frontier model.

The implication was significant: a well-orchestrated group of smaller, cheaper models can match or exceed a single expensive one.

Wisdom of the Silicon Crowd

A landmark study published in Science Advances tested whether LLM ensembles could match human collective intelligence. Researchers assembled a "crowd" of 12 LLMs and had them forecast 31 binary questions, then compared the results against 925 human forecasters from a three-month prediction tournament.

The LLM crowd matched human crowd accuracy. More importantly, the study found that model diversity — using different architectures from different providers — outperformed stochastic diversity — running the same model multiple times at different temperature settings. Different models bring genuinely different perspectives. Running GPT-4 three times does not.

Multiagent Debate

The foundational work on LLM debate comes from Du et al. at MIT, published at ICML 2024. Their approach has multiple model instances propose answers, critique each other's responses, and refine their positions over multiple rounds. The result: measurably improved factual accuracy and reduced hallucinations.

A 2025 study on adaptive heterogeneous multi-agent debate extended this, finding 4-6% higher accuracy and over 30% fewer factual errors compared to single-model methods. Crucially, heterogeneous agents (different foundation models) achieved 91% accuracy on GSM-8K, versus 82% for homogeneous agents (same model, different instances). Architecture diversity is the active ingredient.

The Honest Caveat

A critical ICLR 2025 evaluation tested five multi-agent debate frameworks across nine benchmarks and found that current implementations fail to consistently outperform simpler single-agent strategies. The field is producing real gains in specific contexts, but the "just add more models" narrative oversimplifies what actually works.

Where Synthesis Falls Short

Here's where the story gets more nuanced than the headlines suggest. The dominant council architecture — collect responses, then synthesize a final answer — carries an assumption that combining multiple perspectives into one produces a superior result. The research tells a more complicated story.

The Blending Problem

When a "chairman" model synthesizes multiple responses into a unified answer, it faces an impossible editorial task: reconcile disagreements, merge writing styles, and produce something coherent. In practice, this often means averaging. The synthesis smooths out the distinctive strengths of individual responses — the precision of one model's technical explanation, the clarity of another's analogy, the thoroughness of a third's caveats — into a blended middle ground that may be less useful than the best individual response.

An ACL 2025 paper on voting versus consensus in multi-agent debate identified a deeper problem: the Condorcet Jury Theorem — the mathematical foundation for "wisdom of crowds" — assumes that agents' errors are independent. In practice, LLMs share similar pre-training data and architectural lineages. Their errors are correlated. When all three models in a council confidently agree on something wrong, the synthesis doesn't catch the error — it amplifies it into what the researchers call "confabulation consensus."

An EMNLP 2025 study on the consensus-diversity tradeoff found that ensemble performance can be non-monotonic — accuracy increases as you add models, then starts decreasing. More models can improve easy queries while degrading hard ones, because on genuinely difficult questions, the additional models add noise rather than signal.

Selection Outperforms Synthesis

This is the finding that matters most for practical system design: an orchestrator that selects the best individual response reliably outperforms one that synthesizes a new response from all of them.

The logic is straightforward. When you have four responses to a query, one of them is usually meaningfully better than the others — more accurate, better structured, more complete. A good orchestrator can identify that response and surface it. A synthesis model, by contrast, is forced to incorporate elements from all responses, including the weaker ones. The result is regression toward the mean.

This doesn't mean synthesis is never useful. For structured tasks — summarizing a set of perspectives, identifying areas of agreement and disagreement, flagging points of contention — synthesis adds genuine value. But for the primary task of "give me the best answer to this question," selection wins.

Selection vs. Synthesis

The strongest multi-model architecture isn't one that blends all responses into a new answer. It's one that preserves the originals and helps you identify which one deserves your attention — while keeping the others accessible for comparison.

Transparency as a Design Principle

The synthesis approach has another, subtler problem: it obscures provenance. When a chairman model produces a unified answer, the user has no way to trace which parts came from which model, which claims were agreed upon, and which were contested. The synthesis looks authoritative — a single, clean response — but it hides the deliberation that produced it.

This matters because the value of a multi-model approach isn't just the final answer. It's the disagreement. When three models agree, your confidence should increase. When they diverge, that divergence is a signal — it tells you the question is harder than it looks, or that the answer depends on assumptions that different models make differently. A synthesis that papers over disagreements removes the most valuable information the council produced.

The best advisory boards don't produce consensus statements. They present each advisor's perspective — with reasoning — and let the decision-maker weigh the inputs. The same principle applies here.

How Advisory Boards Actually Work

The metaphor of an AI "council" or "advisory board" is more than marketing language. It maps directly to how high-functioning human advisory structures operate — and the mapping reveals important design principles that most current implementations miss.

The Board Meeting Pattern

In a well-run board meeting, each member presents their perspective independently before group discussion begins. This prevents anchoring — the cognitive bias where early opinions disproportionately influence later ones. The chair doesn't synthesize all opinions into a single recommendation. The chair facilitates: ensuring each perspective is heard, identifying points of agreement and tension, and presenting the landscape of options to the decision-maker.

Karpathy's anonymized peer review maps directly to this pattern. By stripping model identities before cross-evaluation, the council prevents models from deferring to perceived authority (GPT-5 always being "right" because it's the biggest model). The anonymization is doing real work — it's the architectural equivalent of having board members submit their analysis in writing before anyone speaks.

Where the Analogy Breaks

The analogy breaks down at synthesis. In a real advisory board, the CEO or board chair doesn't ask a separate person to write a summary of what everyone said and then present only the summary. The original perspectives are preserved. The decision-maker hears each voice, weighs the reasoning, and makes a judgment call.

Most current council implementations skip this step. They route everything through a synthesis model and present a single merged output. The user never sees the individual responses. The council's deliberation is hidden behind a clean final answer — which is exactly the single-model experience dressed up in multi-model clothing.

The implementations that get this right — including Perplexity's side-by-side view — preserve the original responses alongside any synthesis. The user can read each model's answer, see where they agree and disagree, and make their own judgment. The system augments the user's decision-making rather than replacing it.

Parallel Processing Isn't Conversation

There's a deeper structural issue that most council implementations don't address. In every system we've examined — Karpathy's LLM Council, Perplexity Model Council, the various startup platforms — the models process the user's query in parallel isolation. Each model receives the same prompt and generates a response independently. They don't hear each other. They don't build on each other's reasoning in real time. They don't react to what another advisor just said.

This is efficient. It prevents anchoring bias, and it's fast because all responses generate simultaneously. But it's not how real group conversations work. In an actual advisory discussion, one person's observation changes the direction of another's thinking. A question raised by the first speaker reframes how the second speaker approaches the problem. Context accumulates. The conversation evolves.

Neither approach is inherently superior — they solve different problems. Parallel processing is better for independent evaluation and bias reduction. Shared-context conversation is better for building on ideas, challenging assumptions in real time, and the kind of iterative reasoning that produces insights none of the participants would have reached alone. The council pattern optimizes for the first. But the second is where many of the most valuable multi-perspective interactions happen.

The Infrastructure Moment

The convergence isn't just intellectual. The infrastructure to build multi-model systems has matured to the point where the engineering barrier is nearly zero.

API Aggregation

OpenRouter provides a single API endpoint to 400+ models from 60+ providers. Karpathy's entire council runs through it. This means building a multi-model system no longer requires separate API integrations with OpenAI, Anthropic, Google, and xAI — it's one API key, one endpoint, one billing relationship. The three major cloud providers offer their own versions: AWS Bedrock, Azure AI Foundry, and Google Vertex AI each provide multi-vendor model access with unified tooling.

Cost Collapse

The economics that made multi-model approaches impractical two years ago have inverted. DeepSeek offers competitive inference at $0.70 per million tokens. Running a query across three models now costs roughly what a single GPT-4 query cost in early 2024. The Mixture-of-Agents paper demonstrated GPT-4 Turbo-level performance at half the cost by orchestrating cheaper open-source models. For many use cases, multi-model is now cheaper than single-model frontier.

Protocol Standardization

Two protocol efforts are converging to make multi-agent systems interoperable. Anthropic's Model Context Protocol (MCP) provides a standard way to equip AI models with tools and context. Google's Agent-to-Agent Protocol (A2A) enables inter-agent communication and coordination. They're complementary — MCP handles the vertical (model-to-tools), A2A handles the horizontal (model-to-model).

In December 2025, Anthropic, OpenAI, and Block co-founded the Agentic AI Foundation under the Linux Foundation to coordinate open standards for agentic AI infrastructure. When competitors collaborate on standards, the direction of the industry is clear.

The Orchestrator Model

NVIDIA released Nemotron-Orchestrator-8B in late 2025 — an 8-billion-parameter model designed not to answer questions itself, but to intelligently route them to the right specialist model or tool. It's a small model whose entire job is deciding which big model should handle your query. This represents a meaningful architectural shift: from "one large model does everything" to "a small model coordinates a team of specialists."

The Paradigm Shift

The conversation has moved from "which model is smartest?" to "how do you orchestrate multiple models effectively?" The value is shifting from the models themselves to the orchestration layer that coordinates them.

The Limits of the Council

We've been deliberately honest about what doesn't work throughout this piece, but the limitations deserve explicit attention.

Correlated Errors

A study on correlated errors in LLM ensembles found that on one benchmark dataset, models agreed 60% of the time when both were wrong. Models trained on overlapping data, with similar architectures, from overlapping research lineages make overlapping mistakes. An ensemble of four transformer-based models trained mostly on English web text is not the same as an ensemble of four genuinely independent perspectives.

A 2025 survey on ensemble LLMs warned explicitly: if individual models are homogeneously biased due to shared pretraining corpora, the ensemble may amplify rather than reduce biases. The practical recommendation: select ensembles with heterogeneous architectures and diverse data sources, which exhibit lower error correlation.

More Models ≠ Better

The ICLR 2025 evaluation of multi-agent debate frameworks found no obvious performance trends from adding more agents or more debate rounds. This is counterintuitive — you'd expect more perspectives to help — but it makes sense when you consider that each additional model adds both signal and noise. On easy questions, the extra perspectives are redundant. On hard questions, the extra perspectives may introduce more confusion than clarity.

The Rethinking Mixture-of-Agents paper (February 2025) introduced "Self-MoA" — aggregating multiple outputs from a single model family — and found a genuine tradeoff between diversity and quality. In some cases, a strong single model queried multiple times outperformed a diverse ensemble of weaker models. Diversity helps, but not at the expense of baseline quality.

Latency and Cost

Sequential debate rounds multiply latency linearly. A three-round debate across four models means twelve LLM calls before the user sees a response. Parallelization helps with the initial collection stage but doesn't eliminate the overhead of cross-evaluation and synthesis. For interactive applications with sub-second latency requirements, multi-model deliberation may not be practical without significant architectural compromises.

Cost multiplication is real, though the impact varies. Running three models per query triples the API cost of that query. Whether this matters depends on the use case: for a $200/month Perplexity Max subscriber making consequential research decisions, the cost is trivial. For a consumer chatbot handling millions of casual queries per day, it's prohibitive.

The Groupthink Risk

A November 2025 study on LLM debate found that majority pressure suppresses independent correction. When most models in a council agree on an answer, the dissenting model tends to abandon its position during debate rounds — even when the dissenter was originally correct. This is the AI equivalent of groupthink, and it means multi-round debate can actually degrade accuracy on questions where the minority view is right.

The counterpoint: the same study found that effective teams can overturn incorrect consensus when the dissenting model has strong reasoning. The design of the debate protocol — how much weight to give to confidence versus consensus, whether to allow anonymous dissent, how many rounds to run — matters as much as the choice of models.

What We're Building Toward

Kapwa was designed around many of these principles before the council pattern had a name — and it makes different architectural choices in places where the research suggests councils get it wrong.

Symphony Mode uses selection over synthesis. When a user activates Symphony, multiple AI advisors — each with distinct expertise, reasoning approaches, and analytical frameworks — respond to the same query independently. An orchestrator evaluates the responses and identifies which one best addresses the user's question, but every original response is preserved and accessible. The orchestrator's job isn't to blend all responses into a new one — it's to surface the strongest response while keeping every other perspective available. The user sees all the advisors' reasoning, not just a summary.

Persona Mode addresses the parallel-processing limitation we described earlier. When a user brings multiple personas into a conversation, those personas share context — they see each other's responses and build on them. This is a genuine group conversation, not siloed queries run in parallel. One advisor's observation can redirect another's thinking. A challenge raised by one persona reshapes how the next approaches the problem. Context accumulates across the exchange the way it does in a real advisory discussion.

Persona Mode also provides a synthesis-like layer through the Highlights tab — key insights distilled across the conversation, surfacing patterns and connections. But the original responses remain intact and primary. The synthesis augments rather than replaces.

Both modes share a core design principle: transparency. Every original response is preserved. The user always has access to the full reasoning, not just a chairman's summary. This maps to the advisory board pattern at its best: you hear each advisor's full perspective, then the chair highlights the key themes. You never lose access to the original reasoning.

The research supports this approach. Selection outperforms synthesis on response quality. Preserved originals outperform merged summaries on user trust and decision quality. Shared-context conversation produces insights that parallel isolation cannot. And the disagreements between advisors — the points where perspectives diverge — often carry more signal than the consensus.

The Oracle Is Dead

The mental model that dominated the first wave of consumer AI — type a question, receive The Answer — is giving way to something more honest. No single model has a monopoly on truth. Every model carries biases from its training data, blind spots from its architecture, and tendencies from its alignment tuning. The answer you get depends on which model you asked.

The council pattern doesn't solve this problem. No architecture can. But it makes the problem visible. When three models agree, you can have higher confidence. When they disagree, you know to dig deeper. When one model catches an error that the others missed, you've avoided a mistake you never would have noticed.

The shift is from oracle to advisory board. From "what does the AI say?" to "what do multiple AIs say, where do they agree, and where should I pay closer attention?" The user becomes the decision-maker, not the AI.

This is early. The research is clear that current implementations don't consistently outperform single-model approaches across all benchmarks. The field is still figuring out the right debate protocols, the right diversity-quality tradeoffs, and the right balance between cost and accuracy. But the direction is set. The enterprise market has voted. The research validates the core insight. And the infrastructure to build multi-model systems is now trivially accessible.

One model gives you an answer. Multiple models give you a perspective on the answer. The difference matters more than most people realize.

Dive deeper into how multiple AI systems coordinate at the architectural level.

Read: Multi-Agent Orchestration