Skip to content

The Codebase Semantic Graph

A senior engineer decomposing a feature doesn’t think in files. They think in relationships: the User model is referenced by 15 downstream symbols, the auth module’s 6 files form a tight cluster with 47 internal references and only 3 external connections, and process_payment is called from both checkout_handler and webhook_receiver.

An agent without this information sees files and text. It can grep for symbols, but it can’t know which clusters of code should stay together, which modifications carry risk, or how changing one symbol ripples through the rest of the system. Building that understanding through exploration burns turns.

The Codebase Semantic Graph (CSG) computes this understanding from static analysis before any agent runs. tree-sitter parses every source file into an AST; graph algorithms (networkx) extract structure and relationships. No LLM. Deterministic. Four layers, each built from the one below.

Follow the User model through all four layers to see what each one adds.

Every defined symbol in the codebase becomes a node: classes, functions, methods, types, constants. ORM model classes carry schema annotations (table name, columns, foreign keys, relationships).

For User, Layer A captures:

PropertyValue
ClassUser in src/backend/models/user.py, inheriting from Base
Columnsid (Integer, PK), email (String, not null), team_id (Integer, nullable)
Foreign keyteam_idteams.id
RelationshipteamTeam (many-to-one)

The old pipeline had a separate symbol index and a separate schema snapshot. The CSG unifies both as properties on graph nodes.

Layer B adds edges between Layer A nodes. These edges capture call sites, imports, type references, and attribute access across the codebase.

For User, Layer B reveals:

  • create_user in auth/service.py instantiates User
  • register in api/users.py calls create_user
  • charge in billing/service.py references type User
  • user.py imports Base from models/base.py

Layer A tells you User exists. Layer B tells you User is created by the auth service, exposed through the user API, and referenced by billing. The difference is between a file listing and structural understanding of the codebase.

Community detection (Louvain algorithm) on the Layer B reference graph discovers domain clusters: groups of symbols that are densely connected internally and sparsely connected externally.

User lands in the auth cluster:

MetricValue
SymbolsUser, create_user, authenticate, auth_middleware, and 12 others
Filesmodels/user.py, auth/service.py, auth/middleware.py, and 3 more
Internal references47 (within cluster)
External references3 (crossing cluster boundary)
Cohesion0.94

A cohesion of 0.94 means splitting this cluster across tasks would create 47 cross-task coordination points. The Architect sees this number and keeps the cluster together.

The coordination cost of any proposed decomposition is computable: sum the cross-cluster edges that would span task boundaries. Minimum coordination cost points to the optimal decomposition.

Graph analysis on Layer B edges computes per-symbol risk metrics.

For User:

MetricValueMeaning
Blast radius23Changing User can transitively affect 23 other symbols
Centrality0.72High betweenness centrality; User bridges domain clusters
Dependents15Fifteen symbols directly reference User
StabilitybridgeHigh centrality, crosses domain boundaries, highest modification risk

Compare to a helper function like format_date: blast radius 0, centrality 0.01, stability volatile. The Reviewer receives these numbers alongside the diff and adjusts scrutiny accordingly. A change to a bridge symbol with blast radius 23 gets a different level of review than a change to a leaf function with blast radius 0.

The individual layers are established techniques. Applying them to AI agent orchestration is new. Four capabilities emerge from the combination:

CapabilityHow it worksWho uses it
Coordination cost optimizationCross-cluster edge counts quantify the coordination overhead of any task decomposition. The Architect minimizes this cost when drawing task boundaries.Architect
Graph-distance context scopingThe context pipeline selects what each agent sees based on structural proximity in the reference graph: full content for files touched and 1-hop neighbors, skeletons for 2-hop, nothing beyond.Layer 2 (all agents)
Risk-proportional reviewBlast radius and centrality set review intensity per change. The Reviewer sees “blast radius 23, stability bridge, verify these 8 downstream consumers” alongside the diff.Reviewer
Impact-aware decompositionHigh-centrality symbols (bridges between domains) require downstream coordination when modified. The Architect sees this before creating task boundaries.Architect

The CSG’s layers draw from established research. The combination into a multi-layer semantic graph designed for AI agent orchestration is novel.

LayerFoundationKey work
A (Symbol)Compiler symbol tablesStandard since the 1960s; extended with ORM schema annotations
B (Reference)Call graphs and use-def chainsAllen & Cocke (1976); Aider’s repo map uses similar tree-sitter extraction
C (Domain)Community detectionBlondel et al. (2008), the Louvain algorithm; Mancoridis et al. (1999), software clustering
D (Impact)Change impact analysisWeiser (1981), program slicing; Freeman (1977), betweenness centrality
Multi-layerCode property graphsYamaguchi et al. (2014), combined AST + control flow + dependence graphs for vulnerability detection

The entire CSG is built from tree-sitter: one parser infrastructure, 100+ language grammars, proper AST access. Each source file is parsed once. Multiple queries against the same AST feed both the CSG and file skeletons.

tree-sitter is maintained by Zed (previously GitHub) and powers Zed, Neovim, Helix, Aider, and Repomix.

For the full JSON structure, build pipeline, and performance characteristics, see the Architecture: Context Pipeline.