Skip to content

Rust engine parity: port the 11 remaining JS-only language extractors #1071

@carlos-alm

Description

@carlos-alm

Summary

The Rust native engine is missing extractors for 11 languages that the JS engine supports via WASM. Files in those languages are silently dropped by the Rust orchestrator (no nodes, no edges, no analysis), and the JS-side WASM backfill (#967, #1068) papers over the gap.

PR #1070 makes the orchestrator behave correctly when this asymmetry exists, but the asymmetry itself is the underlying issue — every WASM-only language is a future opportunity for the same class of regression.

Languages with JS extractor + WASM grammar but no Rust extractor

Language Extensions JS grammar
F# .fs, .fsx, .fsi tree-sitter-fsharp
Gleam .gleam tree-sitter-gleam
Clojure .clj, .cljs, .cljc tree-sitter-clojure
Julia .jl tree-sitter-julia
R .r, .R tree-sitter-r
Erlang .erl, .hrl tree-sitter-erlang
Solidity .sol tree-sitter-solidity
Objective-C .m (verify)
CUDA .cu, .cuh (verify)
Groovy .groovy, .gvy (verify)
Verilog .v, .sv tree-sitter-verilog

Source-of-truth registries:

  • JS: LANGUAGE_REGISTRY in src/domain/parser.ts — 35 entries
  • Rust: crates/codegraph-core/src/extractors/mod.rs — 21 modules, plus SUPPORTED_EXTENSIONS in crates/codegraph-core/src/file_collector.rs

What "add a Rust extractor" requires per language

  1. Verify a Rust tree-sitter grammar crate exists on crates.io (or vendor one).
  2. Add a LanguageKind variant in crates/codegraph-core/src/parser_registry.rs and wire from_extension.
  3. Add the extension(s) to SUPPORTED_EXTENSIONS in file_collector.rs.
  4. Add pub mod <lang>; to extractors/mod.rs and a struct implementing SymbolExtractor.
  5. Add the dispatch arm in extract_symbols_with_opts.
  6. Add a fixture under tests/benchmarks/resolution/fixtures/<lang>/ with expected-edges.json (or extend the existing one).
  7. Add a per-language extraction test under tests/parsers/<lang>.test.ts (JS-side, exercises the WASM extractor) and a Rust-side unit test.
  8. Verify build-parity tests pass (WASM and native produce identical output for the language).

Why this matters

  • Any file in one of these languages costs at least a per-rebuild WASM parse cycle through the JS backfill (which fix(native): skip unsupported-extension files in detect_removed_files #1070 makes safe — but not free).
  • The asymmetry is invisible to users: they enable the language, builds work, but they're paying a hidden perf cost on every incremental rebuild on the native engine.
  • New WASM-only languages added in the future will hit the same trap unless the Rust port keeps pace.

Acceptance criteria

  • All 11 languages above have Rust extractors that pass per-language extraction tests.
  • The build-parity tests pass for each language (WASM and native produce the same node/edge counts on the language fixture).
  • LANGUAGE_REGISTRY.length === extractors/mod.rs declared modules + 2 (typescript and tsx share javascript.rs; ocaml-interface shares ocaml.rs).
  • A CI gate prevents future drift: a test that fails when a JS LANGUAGE_REGISTRY entry has no corresponding Rust extractor (or an explicit allowlist of intentionally WASM-only languages).

This is a substantial body of work and probably wants to be split per language — opening this as the umbrella so the effort is tracked.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions