After I learned agents can argue with each other, I began to think — can they work together and complement each other? Maybe 1+1 can be greater than 2?

The First Idea: A Pre-Input Agent

My original instinct was to put the verification layer before my prompts reached the AI — a "pre-input" agent that would check my questions, clarify anything ambiguous, and improve the prompt quality before passing it on. The logic seemed sound: better input → better output.

My AI assistant pushed back on this, and the reasoning was worth writing down:

About 80% of the value in a good AI response comes from having a clear, direct question — which most people can write naturally without an intermediary. A pre-input agent adds latency, adds a step, and in most cases produces marginal improvement at real cost. For the 20% of cases where prompt quality actually matters — complex technical questions, ambiguous multi-part requests — the better solution is to just ask a follow-up question, not to route everything through an extra agent.

That framing stuck with me. Not "this doesn't work" but "this isn't where 80% of the value lives." It was a calibration, not a rejection.

Shifting to Pre-Output

If the input side wasn't the right place to add verification, what about the output side? That's where the conversation shifted. The idea: before the AI sends me a final answer, have another model check it.

The first version was simple: run the answer through Gemini and flag any disagreements. One checker. Low cost, low latency, minimal overhead.

Then Two Checkers

A follow-up question: what if Gemini itself was wrong? A single verifier has a single blind spot. Two independent verifiers — from different model families, trained on different data — are much less likely to share the same error.

So we added GPT-4o-mini alongside Gemini. Now: one primary model, two independent checkers, both running in parallel. The cost of the second checker was about $0.0003 per query — negligible. The latency impact was near zero because both ran simultaneously.

Why different model families matter
Claude, Gemini, and GPT are trained differently, on different data, with different fine-tuning approaches. Errors that are systematic in one model are unlikely to be systematic in the others. Running all three and comparing is much more robust than running one model three times.

The Agent Council

At this point we had the council pattern in practice — we just didn't have a name for it yet. All three models receiving the same query in parallel, one synthesizing the others. The synthesis layer is key: it's not a vote (majority rules), it's a structured comparison that explicitly looks for disagreement and makes that disagreement visible.

When I later learned that "agent council" and "council mode" are established terms in multi-agent AI research — with papers specifically studying them for hallucination reduction — it was satisfying in the way that naming something you've already discovered always is.

What This Process Actually Looked Like

I want to be honest about how this evolved: it wasn't a clean design session. It was a conversation that wandered, doubled back, and found its shape gradually. I'd propose something. My AI assistant would tell me where it thought the value was (and wasn't). I'd refine. We'd try it. Something unexpected would come up, and we'd adjust.

That back-and-forth is how most of the interesting things in this blog actually got built. Not a product roadmap, not a spec — a conversation that kept producing better questions than it started with.

The agent council is the right setup for what I need. But the more useful thing I learned is the process that got there: start with a clear goal, take the pushback seriously, iterate on the idea rather than the implementation until the idea is right, then implement it once.