Language Support & Extraction
Language Support & Extraction
Section titled “Language Support & Extraction”SPEED uses a multi-layered extraction system to build the Codebase Semantic Graph (CSG). This system shifts the burden of understanding the codebase from the agent to the infrastructure, ensuring high-fidelity context without hallucinations.
The Language Registry
Section titled “The Language Registry”The LanguageRegistry is the single source of truth for all language knowledge. It maps file extensions to canonical language names and defines how the pipeline treats each language.
- Helix Integration: SPEED vendors a copy of the Helix editor’s
languages.tomlfor extension-to-name mappings (180+ languages). This is separate fromextraction.toml, which controls how each language is parsed. - Extraction Levels (defined in
lib/context/data/extraction.toml): Not every language requires full symbol extraction. Three levels:rules: Full symbol and relationship extraction using ast-grep (e.g., Python, Go, TypeScript).skeleton: Structure-only parsing for large files (e.g., HTML, CSS, JSON).none: File-type recognition only, no internal parsing.
Extraction Engine: tree-sitter + ast-grep
Section titled “Extraction Engine: tree-sitter + ast-grep”tree-sitter parses every source file into an AST. ast-grep queries that AST using declarative YAML rules to extract symbols and references. Both are required: tree-sitter provides the parse, ast-grep provides the query layer.
Why ast-grep?
Section titled “Why ast-grep?”- Declarative: Symbol extraction is defined in YAML, not Python code.
- Cross-Language: A single rule format works for Go structs, Python classes, and TypeScript interfaces.
- Predictable: No LLM calls are involved in symbol extraction, ensuring deterministic results.
Rule Metadata
Section titled “Rule Metadata”Each ast-grep rule in lib/context/rules/ declares what it produces via metadata. This mapping informs the CSG construction:
| Produces | CSG Artifact |
|---|---|
node | A Layer A symbol node (Class, Function, Struct). |
node + orm: true | A symbol node with database schema annotations. |
edge | A structural relationship (Inherits, Implements). |
reference | A Layer B usage edge (Attribute access, Method call). |
Custom Language Support
Section titled “Custom Language Support”For languages not built into ast-grep (like Prisma or GraphQL), SPEED uses a custom grammar pipeline:
- Grammar Compilation: Tree-sitter grammars are compiled to platform-specific shared libraries (
.dylib/.so). - Registration: Libraries are registered in
sgconfig.yml. - Unified Querying: Once registered, these languages are queried using the same YAML rule syntax as built-in languages.
Grammar Detection
Section titled “Grammar Detection”SPEED automatically detects installed tree-sitter grammars via pip. If you encounter a warning about “unparseable files” in speed status, you can enable full extraction by installing the corresponding grammar package:
pip install tree-sitter-kotlin tree-sitter-elixirAdding a New Language
Section titled “Adding a New Language”The speed add-language <lang> command automates the process. Behind the scenes, it performs three steps:
- Install the Grammar: The
tree-sitter-{lang}pip package must be installed first. The command validates this and exits with instructions if missing. - Define YAML Rules: The command reads the grammar’s
node-types.jsonand generates draftdefinitions.ymlandreferences.ymlrules inlib/context/rules/{lang}/. Review and test these withspeed test-rules {lang}. - Register in
extraction.toml: The command adds a[{lang}]section withextraction = "rules"tolib/context/data/extraction.tomlif not already present.