ACES Multi-Agent Architecture: Closing the Comprehension Gap
Designing a multi-agent code intelligence system grounded in first principles: deterministic fact extraction, queryable Constructs, rigorous evaluation, and protocol-based extensibility. Built in 3 months at 50% allocation.
Stakes and Scale
In Q3 2025, it was becoming clear that code generation was going to become a solved problem. For a hyper-scale enterprise with ~3-5k microservices across 8+ service frameworks, 5+ compute orchestration platforms, and a fragmented platform ecosystem, this opened the opportunity to tackle problems previously considered intractable: modernize the stack, consolidate onto the best-in-class components, and let a unified platform absorb the complexity that had fragmented across dozens of teams.
In 9 months, the best model on SWE-bench went from solving 62% of real GitHub issues to 81%.
| Model | SWE-bench Verified |
|---|---|
| Claude 3.7 Sonnet (Feb '25) | 62.3% |
| Claude Sonnet 4.5 (Sep '25) | 77.2% |
| Claude Opus 4.5 (Nov '25) | 80.9% |
Source: Anthropic model announcements; SWE-bench Verified leaderboard.
The rate of change meant that agent engineering expertise barely existed. For large enterprises with significant platform fragmentation, the gap between current capabilities and what agents could enable was widening every quarter.
Status Quo
Most teams building agents were writing prompts. A PR review agent that baked a methodology into its system prompt was the standard pattern. Prompt engineering, not agent engineering. Context engineering had barely entered the vocabulary. By Q4 2025, LLMs were seasoned coders, and agents, to most, still meant prompts with read and write tools attached.
What made this strange: TDD had been table stakes for over a decade. Code changes without tests don't make it to a PR. But agent changes without evals shipped constantly. "Eval First" was basically nonexistent. The conference circuit ran on recorded demos showing isolated agent runs under controlled conditions. Nobody was asking whether the agent actually outperformed a baseline model on the real task.
Strong models, weak methodology. The models were already good enough. Nobody was validating whether any given agent system actually outperformed the baseline model on its own.
Understanding the Problem Space
The volume of generated code is increasing exponentially while the capacity of humans maintaining that code stays the same. Nothing in the current structure resolves this.
When a maintainer reviews a PR, they're mapping the change against a representation of the codebase they carry internally, one that feels intuitive but is actually tacit knowledge, built over years of working the system, resolving incidents, fixing bugs, improving performance. That knowledge costs time to build. It requires writing the code.
Since agents write the code now, developers don't fix the bugs or trace the incidents. The mental map, the intuition that review depends on, never forms. Even a 50-line PR can't be reviewed effectively in isolation without it. If unverified code is mass-produced and deployed, security issues and production incidents aren't a risk. They're inevitable.
Addy Osmani named it in March 2026: comprehension debt, code that works but nobody understands why. Anthropic's own RCT (Shen and Tamkin, 2026) measured a 17 percentage point comprehension gap between developers who used AI and a control group.
A new developer onboarding onto a large codebase would depend on maintainers, tribal knowledge, and docs that were hopefully up to date. Even then, the real learnings came through the pain: the cognitive rake applied across the codebase when enhancing it, attending to incidents, tracing failures. Agents operating over the code hit the same wall at a different scale, because context windows are their skull, and a lack of structural awareness means less efficient context usage.
Three audiences, one root cause: volume is outgrowing the capacity to comprehend it, and there is no intermediate representation to bridge the gap.
Humans have always solved this problem the same way. When the territory got too large to hold in working memory, we built maps. When circuits got too complex to trace by reading schematics, we built simulation models. When musical scores outgrew what performers could sight-read cold, we developed rehearsal notation. We discard the concrete details that offer low value and surface the ones that let us reason better.
"The purpose of abstraction is not to be vague, but to create a new semantic level in which one can be absolutely precise." — Edsger Dijkstra
Constraints
Preparing for engine change during flight
The agent SDK ecosystem was churning: Autogen deprecated and rewritten from scratch, Claude SDK, Google ADK, CrewAI all active targets with no clarity on which would exist in six months. We couldn't bet on one runtime. Protocol-based boundaries. Swap one agent runtime for another without touching the domain core.
3,000–5,000 services, every codebase unique
The platform serves thousands of services across 8+ frameworks and 5+ compute platforms. Every codebase has nuances that matter: different languages, different patterns, different things a reviewer needs to see. A one-size-fits-all approach can't capture that diversity. Teams needed to compose their own pipelines from shared components, add their own, and share them back. Capabilities, reusable agent definitions, or both.
New paradigms take time to absorb
Agent engineering as a discipline barely existed in Q4 2025. Context engineering had barely entered the vocabulary. Expecting every team to internalize prompt engineering, context windows, and orchestration patterns while shipping product wasn't realistic. The architecture needed to let teams extend the system without understanding the engine. Adding a new agent is a filesystem operation: write a markdown definition, list it in a YAML config.
The trust deficit requires trust infrastructure
Developers don't trust AI-generated code, not for autocomplete, not for agents. The trust erodes further when a demo impresses and the real thing doesn't match. Verifiable outputs close the gap for deterministic tools. For agents, you need evals: pre-registered hypotheses, controlled conditions, blind judging, statistical validation. We don't commit the organization based on theory. We validate before committing.
First Principles
"The first principle is that you must not fool yourself — and you are the easiest person to fool." — Richard Feynman
A mentor handed me this one about a year before this project started. (Thanks Flint!) If there's one first principle to hold onto as a raft when navigating the phase shift, it's this one. And early 2025 is when the burns were common and sometimes hilariously so. I was worried for a bit as folks started referencing ChatGPT responses in discussions.
I derived most of these a couple of months before this project, working on data plane proxies and API gateways. When I applied them to agent systems, they mapped directly. I hadn't expected that. The environments are different, the failure modes look different, but the underlying constraints are the same. I'll make that connection explicit in a separate case study with a callback here, but see if you can spot the parallels.
1. Reliability degrades multiplicatively
R_system = R₁ × R₂ × ... × Rₙ — Series reliability model (Reliability Engineering)
The formula applies to sequential dependencies. Every stochastic component in sequence multiplies degradation into the product. Five agents at 90% reliability in sequence: 0.9⁵ = 59%. At 80%: 33%.
Deterministic components (parsers, encoders) contribute R≈1.0. Push everything deterministic into deterministic components. Use stochastic capability, meaning agents, only where reasoning is genuinely required.
For the remaining stochastic components, run them in parallel against shared state, not in sequence through message passing. Message passing creates conditional failures: each agent interprets the previous agent's interpretation. Shared state decouples the chain. Agents read from a common source independently, so one agent's error doesn't compound into the next.
The multiplicative math forces this architecture. It's arithmetic, not preference.
2. Attention is finite
Context windows are finite. Every token spent on one thing is a token not spent on something else.
Context is zero-sum. Every token spent on discovery is a token not spent on reasoning. Pre-compute what's computable so the context budget goes to judgment. The eval data proves it: 583 file reads in the control condition, 141 in the treatment. Same agent. Different starting knowledge.
3. You cannot reason about dimensions you've collapsed
"No processing of data can increase the information that the data provides." — Data Processing Inequality (Information Theory)
Processing cannot increase information. If a transformation loses structure, nothing downstream recovers it. The loss is permanent.
Johnson-Lindenstrauss shows that dimensionality reduction isn't inevitably destructive. A random projection from high-dimensional space to low-dimensional space can preserve pairwise distances within bounded error. Structure survives, if you choose the right kind of reduction for the right kind of structure.
J-L distance preservation is effective where similarity is the question. Recommendation systems ("people who bought X also bought Y"), NLP similarity search, image retrieval. For code, the question isn't "what's similar to this function?" It's "what breaks if I change it?" Two functions can be semantically identical, both make HTTP POST requests, but sit in completely different parts of the dependency graph. One in the payment pipeline. The other in a test utility. Embedding distance says neighbors. The call graph says different blast radii.
Code syntax is parsed as a tree (the AST). Dependencies and call relationships form graphs. Both structures are precise. Treating code as flat text loses the structure that determines whether a change is safe. Embeddings preserve semantic similarity between text chunks. Parsers preserve relational position in a graph. Embed a function and you know what it's about. Parse it and you know where it sits.
The Construct is that second kind of reduction. Precisely the new semantic level Dijkstra described.
4. The right tool for the right problem
"Only variety can absorb variety." — W. Ross Ashby, Law of Requisite Variety (1956)
A tool must match the variety of the problem it solves. A deterministic problem has one correct answer. A parser matches it: low variety in, low variety out, verifiable. An LLM applied to the same problem has high variety, meaning most of its output states are wrong. The extra degrees of freedom are noise, not capability.
The inverse holds. Judgment problems (reasoning about architecture, identifying patterns, synthesizing across files) have high variety. No parser handles them. Agency is the right tool because the problem demands it.
The architecture composes both. Parsers compute facts. Agents reason over them. Configuration declares which components run. The ordering falls out of their dependencies.
5. Encode the orchestration
"Go To Statement Considered Harmful." — Edsger Dijkstra, 1968
Dijkstra's argument wasn't about goto syntax. It was about verification: unstructured control flow makes programs impossible to reason about because you can't trace what executed or why. Agentive orchestration has the same property. A master agent deciding which specialists to run is stochastic control flow where deterministic control flow would suffice.
Encoding provides two guarantees agency structurally cannot. Approval integrity: a compiled DAG executes exactly what the configuration declares. The gap between what you approved and what runs is zero. Traceability: read the configuration, know what ran. With agentive orchestration, you can't reconstruct why a specialist was invoked or skipped.
If the orchestration path is knowable in advance, encode it.
Implications
No inter-agent communication
Do agents need to talk to each other? Is that communication essential complexity inherent to the domain, or accidental complexity introduced by assumption? The act of comprehending a codebase, understanding its structure, decomposes naturally. Architecture analysis, maintenance analysis, security analysis, each is independently evaluatable. Where one analysis depends on another, we sequence them and pass the output. That's data flow, not conversation. The cost is clear: every message between agents is a stochastic link, and reliability degrades multiplicatively.
No RAG over code
Semantic search captures what something is about. It loses where it sits, what it depends on, and what depends on it. Embed a sentence and you lose its position in the argument. Embed a function and you lose its position in the call graph, the containment hierarchy, the dependency tree. The meaning of a unit is inseparable from its relationships to other units. Code syntax is parsed as a tree (the AST). Dependencies and call relationships form graphs. Those tree-and-graph structures are the dimensions that determine whether a change is safe. Treating code as flat text loses them.
No fine-tuned models
Fine-tuning encodes a snapshot of the codebase into model weights. The relationships between components change with every PR. Across thousands of services, each with different patterns, you'd need to retrain per repo on every change. This treats a structural problem as a training problem. The Construct generates on every commit, low cost because it's parsers and static analyzers. This is context engineering: give the agent precisely what it needs to reason, nothing more.
No discovery by prompting
The agents are just agent definitions with tools and skills configured. The point isn't "no prompting." The point is that the agents don't discover facts by prompting. The facts are already computed. The agents reason over them. Don't use an agent to figure out what a function's cyclomatic complexity is when a parser can compute it in milliseconds.
Navigation is not comprehension
LSP preserves the structural dimensions that embeddings lose. Go to definition, find references, call hierarchy, real structure, not flat text. But LSP answers "what is the code right now." It gives you the static graph at a single point in time. It doesn't tell you how this code changed over time, whether the callers should even know about it, whether complexity is climbing, or what architectural patterns the relationships form. LSP is something the Construct composes with, not replaces.
Experienced engineers carry these dimensions intuitively. Which functions are volatile, which dependencies cross boundaries, which modules are growing in complexity. That knowledge takes years to build, and it doesn't scale. The current abstractions aren't precise enough to capture it. We needed a new semantic level, one where we could be absolutely precise about what matters for maintaining code.
Guide-level Explanation
The principles and their implications point to one thing: an intermediate representation between raw code and comprehension. I called it the Construct, after the training simulation in The Matrix where structural knowledge loads directly into the operator's context. I built the system over one quarter.
The Construct
The Construct is a semantic and structural representation of a codebase. A maintainer uses it to understand current state and decide whether to tune agents or rearchitect components. A new developer uses it to build a mental model without years of tacit knowledge. An agent reasons from computed facts instead of discovering them one file at a time.
It lives as a directory in the project itself. Three layers build on each other.
Structural graphs
Parsers and version history produce seven representations of the same codebase:
The enriched layers don't add new information. They surface patterns already latent in the structure.
This decomposed structure covers the same conceptual ground as Code Property Graphs (Yamaguchi et al., winner of the IEEE Test-of-Time Award in 2024), which merge AST, CFG, and program dependence graphs into a single queryable structure. The Construct keeps them separate so each dimension can be queried independently, and composed only where a specific analysis demands it.
Findings
Deterministic analysis tools produce metrics and findings, serialized into an embedded database. No server, no API key. Queryable with SQL.
Insights
Specialist agents (architecture, security, structure, technical debt) read the code and the Construct, then write insights back. Each agent reads shared state and produces independent output. Zero inter-agent messages.
The Construct directory includes a static visualization of the dependency graphs. A maintainer sees the shape of their system without reading a line of code. Experienced engineers carry architectural knowledge as structure, not text: what connects to what, where the boundaries are, which parts are volatile. The visualization externalizes that mental model.
Because it's files in the project directory, any tool that can read files can use it. Code editors, MCP servers, direct SQL queries against the embedded database. git diff on the Construct shows what changed structurally between commits.
When an agent claims "12 rate limiting implementations," verify by querying the Construct directly. Parser bugs are possible, but they're reproducible and fixable. Stochastic hallucinations are neither.
Usage
Convention over configuration. The default pipeline covers the 80% case: point it at a codebase, it runs. Deterministic encoders produce facts. Specialist agents reason over them. No config file required.
For the 20% that need customization, a config file in the project directory overrides defaults. Opt-in component selection: list what's enabled, unlisted components don't run. Per-agent runtime selection: one runtime for production (higher accuracy in benchmarks), another for local development (iteration speed over reasoning quality). Capability negotiation catches incompatibilities at initialization, not mid-run.
For platform teams managing the system across an organization, configuration cascades across three levels. Packaged defaults ship with the system, user-wide config customizes globally, project-specific config overrides locally. Later tiers override earlier. Same precedence as git config.
Adding a new agent is a filesystem operation. Write an agent definition, list it in config. No code changes, no framework imports. Drop a file, add a line, it runs.
The same orchestrator runs from multiple entry points. CLI, CI/CD pipeline, pre-commit hook, direct API access. Multiple driving adapters, same domain core.
Evaluation
I needed an evaluation harness. I reached for the existing orchestrator, and it worked, because a grader follows the same Component protocol as an encoder. A benchmark scenario is a component. A grader is a component. Same protocol, same DAG, same Construct. The system evaluated itself.
Six hypotheses, pre-registered before the first run. The first principle says don't fool yourself. Post-hoc hypothesis selection is the most common way to fool yourself with data.
| # | Hypothesis | Metric |
|---|---|---|
| H1 | Treatment more accurate | Accuracy score |
| H2 | Treatment fewer steps | Step count |
| H3 | Treatment costs less | Token count |
| H4 | Treatment reads fewer files | File read count |
| H5 | Higher precision and recall | P/R |
| H6 | Higher quality answers | Quality score |
40 runs across 4 scenario types. Control: agent with raw codebase access. Treatment: agent with Construct. Every run evaluated blind. 29 valid after filtering.
Results
Five of six hypotheses validated.
The standard approach, an agent exploring raw files, achieves 31% accuracy within the same resource budget. 583 file reads, 20.8 steps, 275K tokens. The agent builds structural understanding from scratch, one file at a time, the same way a developer opens an unfamiliar codebase and starts reading. The treatment approach starts with that structural understanding pre-computed. 51.7% accuracy, 141 file reads, 13.9 steps, 175K tokens. The difference isn't the agent. It's what the agent knows before it starts.
| Metric | Control | Treatment | Delta |
|---|---|---|---|
| Accuracy | 31.0% | 51.7% | +21pp |
| Steps | 20.8 | 13.9 | -33% |
| Tokens | 275K | 175K | -36% |
| File reads | 583 | 141 | -76% |
| Precision | — | 83% | — |
| Recall | — | 91% | — |
| Effect size | — | 0.42 | Medium |
95% confidence intervals: Control [17%-49%], Treatment [34%-69%]. The confidence intervals don't overlap. This is not random variance.
+21 percentage points.
The Paradox
H6 failed. Treatment scored lower on answer quality. Control: 0.51. Treatment: 0.46.
Control agents narrate: "Let me explore the codebase... I found some implementations... there might be more..." Treatment agents answer: "12 implementations found. [file paths, line numbers, type]."
The LLM judge, itself a frontier model, rewarded verbose narration and process transparency. It penalized terse correctness. A hedge reads as "thoughtful." A direct answer reads as "incomplete." The judge measures surface quality, not factual quality.
Pre-registration caught what ad-hoc evaluation would have missed. Without it, you'd report five passing hypotheses and a nice chart. With it, you discover that even the measurement needs measuring.
For agent systems operating at scale, eval infrastructure is safety infrastructure.
Reference-level Explanation
Component Protocol
Because the fifth principle says to encode the orchestration, the orchestrator needs to schedule, validate, and compose components without knowing what any of them do internally.
Every component declares three things: what it consumes, what it produces, and how it runs. The output is self-describing: it carries its own type identifier, so the orchestrator can verify it matches the declaration without knowing what's inside. Structural typing means any object with those three declarations satisfies the protocol. No base class, no framework import. A parser that computes a syntax graph in milliseconds and an agent that reasons via LLM for minutes both satisfy the same contract. The orchestrator doesn't distinguish between them.
From declarations alone, the orchestrator builds a dependency graph, validates topology (no missing producers, no cycles, no duplicate outputs), and schedules execution order. During execution, the orchestrator checks each component's output against its declaration. Two independent checks: one when the DAG compiles, one during execution. Errors surface early.
Components compose through the Construct, not through each other. The Construct is a typed, append-only ledger. Each entry is an artifact: frozen, timestamped, attributed to its producer. Components query the ledger by type and receive frozen artifacts. They produce new artifacts. The orchestrator appends them. No update, no delete, no mutation of existing entries. Parallelism and fault isolation are structural consequences: because no component can modify another's output, the orchestrator parallelizes anything without a dependency edge. Adding a component that consumes existing types extends the system without changing anything upstream.
Components depend on external tools through interfaces. No component imports infrastructure directly. The hexagonal boundary holds at the component level because the injection mechanism enforces it, not because convention requires it.
Ports and Adapters
The first constraint had a second dimension beyond survivability. The architecture needed multiple runtimes simultaneously: Claude SDK and Google ADK running side by side, one for production accuracy, the other for local iteration speed. The domain core depends on nothing external. Everything external depends on the domain.
The component protocol and the hex boundary operate on different axes. The protocol governs how components compose, through declarations and the Construct. The hex boundary governs how the system isolates external dependencies, through ports and adapters. Both operate simultaneously. Parsers, encoders, agents, and graders all implement the component protocol. The hex determines which need adapters and which are pure domain logic.
- Domain core: Orchestrator and Construct. Zero external imports.
- Driving ports: Composition (how pipelines are triggered).
- Driven ports: Agent runtime, config source, construct storage. All interfaces.
- Adapters: Claude SDK and Google ADK for agent execution, YAML for configuration, DuckDB for construct persistence. Swapping Claude SDK for ADK means changing the composition configuration. One adapter replaced, zero component changes, zero orchestrator changes. The domain core doesn't know which runtime executes its agents.
The two validated runtimes are not equivalent. Claude's SDK supports MCP and deeper tool use. ADK with Ollama supports local models and faster iteration. Because they're not equivalent, the system validates compatibility before anything runs: each runtime declares its capabilities, and agent definitions declare their requirements. A mismatch means immediate failure with a clear error, not a silent degradation mid-pipeline.
Orchestration
A master agent deciding which specialists to run introduces stochastic control flow at the orchestration level. The fifth principle resolves this: if the scheduling of components can be encoded as a DAG compiled from configuration, encode it. A compiled DAG removes that class of failure entirely. What the configuration declares is what executes.
Three phases. During compilation, the orchestrator reads component declarations, builds a dependency graph, and validates topology. Missing producer, cycle, or duplicate output type: caught before anything runs. During scheduling, topological sort produces an ordered sequence of batches. Components within a batch share no dependency edges and execute in parallel. Components across batches execute sequentially because the later batch consumes what the earlier batch produced.
The number of layers is not fixed. The diagram shows three batches because that's what the default pipeline produces: parsers, then encoders, then agents. A different configuration produces different layers. Adding a component between encoders and agents creates a fourth batch. The DAG compiler determines the batching from whatever the configuration declares.
Deterministic components naturally land in earlier batches because agents depend on their outputs. The dependency graph determines the separation, not the design. The orchestrator guarantees that all enabled components will run, in dependency order, with outputs produced for each. The orchestrator doesn't control what happens inside any given component. An agent may use tools, spawn sub-agents, or decide how deep to analyze. The orchestration is compiled. The behavior within each component is that component's concern.
Because scheduling is resolved when the DAG compiles, the execution trace is reproducible. Same configuration, same component set, same batch order. The only variance is inside the agents, and that variance is isolated per component. One agent failing doesn't cascade.
The Construct is generic over its subject type. For code analysis, the subject is a project. For evaluation, the subject is an agent trace. A component typed for one cannot receive the other. I didn't build a separate eval system. I composed one from the same parts: different subject, same protocol, same DAG, same engine.
Deterministic foundation, composable components, adaptable runtimes. All derived from first principles, not a framework's opinions. The ecosystem keeps changing. New techniques need critical evaluation, and evals provide the mechanism. But the eval paradox shows that the judge's biases can corrupt the verdict. It's the old problem: who watches the watchmen?
References
- Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Effect size standards: d = 0.2 (small), 0.5 (medium), 0.8 (large).
- Dijkstra, E.W. (1968). "Go To Statement Considered Harmful." Communications of the ACM, 11(3), 147-148. cs.utexas.edu
- Kim, Y., et al. (2025). "Towards a Science of Scaling Agent Systems." Google DeepMind. arXiv:2512.08296
- Ashby, W.R. (1956). "An Introduction to Cybernetics." panarchy.org
- Cockburn, A. (2005). "Hexagonal Architecture (Ports and Adapters)." alistair.cockburn.us
- Shen, J.H. & Tamkin, A. (2026). "How AI Impacts Skill Formation." Anthropic Research. arXiv:2601.20245
- Osmani, A. (2026). "Comprehension Debt." addyosmani.com
- Yamaguchi, F., Golde, N., Arp, D. & Rieck, K. (2014). "Modeling and Discovering Vulnerabilities with Code Property Graphs." IEEE S&P. (IEEE Test-of-Time Award, 2024)
Acknowledgments
Flint Weiss shared Feynman's first principle with me about a year before this project started. It has been one of the most valuable tools in the kit.