The Codebase Semantic Graph
A senior engineer decomposing a feature doesn’t think in files. They think in relationships: the User model is referenced by 15 downstream symbols, the auth module’s 6 files form a tight cluster with 47 internal references and only 3 external connections, and process_payment is called from both checkout_handler and webhook_receiver.
An agent without this information sees files and text. It can grep for symbols, but it can’t know which clusters of code should stay together, which modifications carry risk, or how changing one symbol ripples through the rest of the system. Building that understanding through exploration burns turns.
The Codebase Semantic Graph (CSG) computes this understanding from static analysis before any agent runs. tree-sitter parses every source file into an AST; graph algorithms (networkx) extract structure and relationships. No LLM. Deterministic. Four layers, each built from the one below.
Four layers, one through-line
Section titled “Four layers, one through-line”Follow the User model through all four layers to see what each one adds.
Layer A: Symbol
Section titled “Layer A: Symbol”Every defined symbol in the codebase becomes a node: classes, functions, methods, types, constants. ORM model classes carry schema annotations (table name, columns, foreign keys, relationships).
For User, Layer A captures:
| Property | Value |
|---|---|
| Class | User in src/backend/models/user.py, inheriting from Base |
| Columns | id (Integer, PK), email (String, not null), team_id (Integer, nullable) |
| Foreign key | team_id → teams.id |
| Relationship | team → Team (many-to-one) |
The old pipeline had a separate symbol index and a separate schema snapshot. The CSG unifies both as properties on graph nodes.
Layer B: Reference
Section titled “Layer B: Reference”Layer B adds edges between Layer A nodes. These edges capture call sites, imports, type references, and attribute access across the codebase.
For User, Layer B reveals:
create_userinauth/service.pyinstantiatesUserregisterinapi/users.pycallscreate_userchargeinbilling/service.pyreferences typeUseruser.pyimportsBasefrommodels/base.py
Layer A tells you User exists. Layer B tells you User is created by the auth service, exposed through the user API, and referenced by billing. The difference is between a file listing and structural understanding of the codebase.
Layer C: Domain
Section titled “Layer C: Domain”Community detection (Louvain algorithm) on the Layer B reference graph discovers domain clusters: groups of symbols that are densely connected internally and sparsely connected externally.
User lands in the auth cluster:
| Metric | Value |
|---|---|
| Symbols | User, create_user, authenticate, auth_middleware, and 12 others |
| Files | models/user.py, auth/service.py, auth/middleware.py, and 3 more |
| Internal references | 47 (within cluster) |
| External references | 3 (crossing cluster boundary) |
| Cohesion | 0.94 |
A cohesion of 0.94 means splitting this cluster across tasks would create 47 cross-task coordination points. The Architect sees this number and keeps the cluster together.
The coordination cost of any proposed decomposition is computable: sum the cross-cluster edges that would span task boundaries. Minimum coordination cost points to the optimal decomposition.
Layer D: Impact
Section titled “Layer D: Impact”Graph analysis on Layer B edges computes per-symbol risk metrics.
For User:
| Metric | Value | Meaning |
|---|---|---|
| Blast radius | 23 | Changing User can transitively affect 23 other symbols |
| Centrality | 0.72 | High betweenness centrality; User bridges domain clusters |
| Dependents | 15 | Fifteen symbols directly reference User |
| Stability | bridge | High centrality, crosses domain boundaries, highest modification risk |
Compare to a helper function like format_date: blast radius 0, centrality 0.01, stability volatile. The Reviewer receives these numbers alongside the diff and adjusts scrutiny accordingly. A change to a bridge symbol with blast radius 23 gets a different level of review than a change to a leaf function with blast radius 0.
What the CSG enables
Section titled “What the CSG enables”The individual layers are established techniques. Applying them to AI agent orchestration is new. Four capabilities emerge from the combination:
| Capability | How it works | Who uses it |
|---|---|---|
| Coordination cost optimization | Cross-cluster edge counts quantify the coordination overhead of any task decomposition. The Architect minimizes this cost when drawing task boundaries. | Architect |
| Graph-distance context scoping | The context pipeline selects what each agent sees based on structural proximity in the reference graph: full content for files touched and 1-hop neighbors, skeletons for 2-hop, nothing beyond. | Layer 2 (all agents) |
| Risk-proportional review | Blast radius and centrality set review intensity per change. The Reviewer sees “blast radius 23, stability bridge, verify these 8 downstream consumers” alongside the diff. | Reviewer |
| Impact-aware decomposition | High-centrality symbols (bridges between domains) require downstream coordination when modified. The Architect sees this before creating task boundaries. | Architect |
Research foundations
Section titled “Research foundations”The CSG’s layers draw from established research. The combination into a multi-layer semantic graph designed for AI agent orchestration is novel.
| Layer | Foundation | Key work |
|---|---|---|
| A (Symbol) | Compiler symbol tables | Standard since the 1960s; extended with ORM schema annotations |
| B (Reference) | Call graphs and use-def chains | Allen & Cocke (1976); Aider’s repo map uses similar tree-sitter extraction |
| C (Domain) | Community detection | Blondel et al. (2008), the Louvain algorithm; Mancoridis et al. (1999), software clustering |
| D (Impact) | Change impact analysis | Weiser (1981), program slicing; Freeman (1977), betweenness centrality |
| Multi-layer | Code property graphs | Yamaguchi et al. (2014), combined AST + control flow + dependence graphs for vulnerability detection |
Built with tree-sitter
Section titled “Built with tree-sitter”The entire CSG is built from tree-sitter: one parser infrastructure, 100+ language grammars, proper AST access. Each source file is parsed once. Multiple queries against the same AST feed both the CSG and file skeletons.
tree-sitter is maintained by Zed (previously GitHub) and powers Zed, Neovim, Helix, Aider, and Repomix.
For the full JSON structure, build pipeline, and performance characteristics, see the Architecture: Context Pipeline.