← Back to all posts

Symtrace: A Semantic Git Diff Engine Built on AST Analysis

rustdeveloper-toolsgit-toolingstatic-analysissemantic-diffopen-source

Symtrace: A Semantic Git Diff Engine Built on AST Analysis

Deterministic structural diffing using tree-sitter, BLAKE3 hashing, and a five-phase matching algorithm.

1. The Problem With Line-Based Diffing

git diff is a line differ. It compares sequences of text and marks what was added, removed, or unchanged. For most workflows that is sufficient. For code review that involves refactoring, it regularly misleads.

Consider a function that is extracted and moved to a utilities module:

- func validate(input string) bool {
-     return len(input) > 0 && input != "null"
- }
+ // utils/validation.go
+ func validate(input string) bool {
+     return len(input) > 0 && input != "null"
+ }

The textual diff presents this as a deletion and an addition — two unrelated changes across two files. There is no signal that the logic is identical and the only change is location. A reviewer reading this diff must mentally reconstruct the equivalence.

This degrades further at scale:

  • Move detection — line diff has no concept of a function moving across files. Each file is diffed independently.
  • Refactor noise — renaming a variable generates a large textual diff even when the semantics are unchanged.
  • Cross-file blind spots — inlining a function, extracting an interface, or splitting a class are structurally significant changes that produce noisy line-level output.

None of these are edge cases. They are among the most common operations in active codebases. A diff tool that cannot reason about them is operating below the level of abstraction that code actually lives at.

2. Core Idea: Semantic Diffing

Symtrace works at the syntax tree level, not at the text level. The approach: parse both versions of each file into an Abstract Syntax Tree, then compare the trees structurally.

Parsing with Tree-sitter

Tree-sitter provides incremental, error-tolerant parsers for many languages. Symtrace uses it to produce typed, hierarchical ASTs from source files — function declarations, class bodies, import lists, and similar constructs become named, traversable nodes rather than lines of text.

Source file
    └─ FunctionDeclaration "validate"
           ├─ ParameterList
           │      └─ Parameter "input: string"
           └─ BlockStatement
                  └─ ReturnStatement
                         └─ BinaryExpression ...

Node-Level Comparison and Structural Hashing

Each node in the AST is assigned a four-hash identity tuple:

HashCovers
syntax_hashNode type + children structure
content_hashLiteral text of the node
semantic_hashCanonicalized form (whitespace/comment stripped)
position_hashFile path + byte offset

This identity model lets symtrace distinguish between a node that moved (syntax_hash matches, position_hash differs), a node that was renamed (syntax_hash matches, content_hash differs partially), and a node that was genuinely modified.

Five-Phase Matching

Matching is performed in ordered phases. Each phase resolves a class of changes:

  1. Exact match — identical content_hash across versions; node unchanged.
  2. Semantic match — identical semantic_hash; whitespace-only or comment-only delta.
  3. Structure match — identical syntax_hash; content changed, shape preserved.
  4. Move detection — same node by identity appearing at a different position_hash, including cross-file.
  5. Rename + modify — partial structural similarity with both location and content change.

Nodes that remain unmatched after all phases are classified as additions or deletions.

The output is a structured change set — not a line diff — with explicit labels: UNCHANGED, MODIFIED, MOVED, RENAMED, ADDED, DELETED.

3. Architecture Overview

┌──────────────────────────────────────────────────────┐
│                      symtrace                        │
│                                                      │
│  Input: two git tree states (blob hashes)            │
│         │                                            │
│         ▼                                            │
│  ┌─────────────────┐   Short-circuit on matching     │
│  │  Blob Hash Check│──────────────────────────────►  │
│  └────────┬────────┘   blob hash → skip parse        │
│           │ hash differs                             │
│           ▼                                          │
│  ┌─────────────────────────┐                         │
│  │  Incremental AST Parser │  (tree-sitter, arena)   │
│  │  LRU cache keyed on     │                         │
│  │  content_hash           │                         │
│  └────────────┬────────────┘                         │
│               ▼                                      │
│  ┌────────────────────────┐                          │
│  │  Four-Hash Identity    │  BLAKE3 (all four lanes) │
│  │  per AST node          │                          │
│  └────────────┬───────────┘                          │
│               ▼                                      │
│  ┌────────────────────────┐                          │
│  │  Five-Phase Matcher    │  hash-bucket indexed     │
│  └────────────┬───────────┘                          │
│               ▼                                      │
│  ┌────────────────────────┐                          │
│  │  ChangeSet Output      │  structured, typed       │
│  └────────────────────────┘                          │
└──────────────────────────────────────────────────────┘

Incremental Parsing and Tree Reuse

Tree-sitter supports incremental re-parsing: given an edit range, it reuses the unchanged subtrees from the previous parse. Symtrace uses this across a multi-file session — if the same file appears in multiple diffs with an overlapping change range, prior parse work is not discarded. This significantly reduces CPU time for workflows that process a sequence of commits.

LRU Cache

Parsed trees are cached in an LRU store keyed on the file’s content hash. The cache is bounded in both entry count and total memory footprint. Cache entries are versioned — a schema version in the envelope prevents stale deserialized trees from a prior symtrace version from producing incorrect results.

4. Performance Engineering

The following benchmarks compare symtrace against git diff (running --histogram algorithm) on a set of representative scenarios. These are honest measurements; results will vary by repository and workload.

ScenarioFiles changedgit diff (ms)symtrace (ms)Notes
Pure refactor (move + rename)12411Extra parse cost, richer output
Small feature commit524Parse overhead dominates
Large refactor (cross-file)803829Parallel parse + hash indexing
Repeated commit sequence80 × 1038061Tree reuse eliminates redundant parse

For single small commits, symtrace is slower than git diff because parsing and hashing introduce overhead that a line differ does not pay. The break-even is reached around 20–30 changed files, and symtrace scales sub-linearly beyond that point due to:

  • Parallel parsing: each file’s AST is computed on a separate thread. Parsing is independent per file and parallelizes naturally.
  • Arena allocator: AST nodes are allocated in a typed arena per parse session, eliminating per-node heap allocation overhead and improving cache locality.
  • Blob hash short-circuit: if the git blob hash for a file is identical between revisions, that file is skipped entirely — no read, no parse, no comparison.
  • Hash bucket indexing: the five-phase matcher builds inverted indexes over each hash type so match resolution is O(1) per node per phase rather than O(n).

The practical implication: symtrace is not a replacement for git diff in interactive terminal use. It is positioned for batch workflows — CI pipelines, pre-receive hooks, refactor audit tooling — where the additional parse cost is justified by structured output.

5. Security and Supply Chain Hardening

Symtrace processes source code from arbitrary repositories. The security posture is built on a few clear constraints.

No network surface. Symtrace performs no network I/O. There is no telemetry, no update check, no remote call of any kind. All processing is local and offline. This is not a default-off feature — there is no network code in the codebase.

unsafe code denied. The Cargo manifest carries #![deny(unsafe_code)]. No unsafe blocks exist in the primary codebase. Tree-sitter’s C bindings are wrapped via the tree-sitter crate, which provides a safe API surface; that boundary is the only location where Rust’s safety guarantees do not apply, and it is isolated to a thin FFI layer.

Bounded cache deserialization. Parsed tree caches can be persisted to disk and loaded on subsequent runs. The deserialization path enforces a size limit before reading to prevent malformed or adversarial cache files from causing unbounded memory allocation.

Versioned cache envelope. Every cache file begins with a schema version identifier. A version mismatch is treated as a cache miss, not an error, and the file is reparsed cleanly. This prevents a class of subtle bugs where an old cached tree is interpreted with a new node schema.

Dependency pinning and audit. Cargo.lock is committed. CI runs cargo audit on every push to surface known CVEs in the dependency tree. cargo deny enforces license compatibility and blocks duplicate dependency versions above a configurable threshold.

6. When to Use Symtrace

Symtrace is not a general-purpose replacement for git diff. It solves a specific class of problems. Use it when:

Refactor auditing — you want to verify that a refactoring PR introduces no semantic changes, only structural reorganization. Symtrace can confirm that all changed nodes are MOVED or RENAMED with no MODIFIED classification.

CI semantic validation — your pipeline needs to detect whether a commit changes the logic of a module, not just its formatting or file layout. A structured change set is easier to reason about programmatically than a line diff.

Structural regression detection — tracking whether a known function signature or class structure is altered across commits, independent of the surrounding file content.

Advanced code review workflows — generating review summaries that group changes by semantic class (refactors vs. modifications vs. additions) rather than by file.

If you need a fast, low-overhead diff for daily terminal use, git diff is the right tool. Symtrace is the right tool when you need to understand what changed structurally, not just what lines changed.

7. Roadmap

The current release handles the core parse-match-output pipeline. Planned work:

  • Persistent tree cache — serializing parsed ASTs to disk so that the first run on a repository pre-warms the cache for subsequent runs. This eliminates repeat parse cost in CI workflows where the same base commit is diffed repeatedly.
  • Multi-commit workflow optimization — a session mode that accepts a commit range and reuses intermediate parse results across the sequence, rather than treating each diff independently.
  • Further parse-time reductions — profiling shows that tree-sitter’s incremental re-parse path has headroom for additional optimization, particularly for files with localized changes in large source files.
  • Expanded language support — current support covers the most common compiled and scripted languages via tree-sitter grammars. Adding grammars for additional languages is straightforward; the bottleneck is validation of the hash identity model per grammar, not parser availability.

Try It / Contribute

Symtrace is available on crates.io:

cargo install symtrace

Source and issue tracker: github.com/JashT14/symtrace

If you have a use case the current architecture does not handle cleanly, or you find a case where the five-phase matcher produces an incorrect classification, open an issue with a minimal reproducer. The goal is correctness before coverage — a diff misclassification is a bug, not a known limitation.

Contributions are welcome, particularly around grammar validation for new languages and benchmark corpus expansion.