VectorVault: Offline Semantic Search Engine in C
VectorVault: Offline Semantic Search Engine in C
Understanding Semantic Search at the Systems Level
Semantic search is not about embeddings or vector databases. It is about numerical representation, geometric relationships, and deterministic computation.
Abstractions hide these fundamentals. VectorVault exposes them.
Overview
VectorVault is a lightweight, offline semantic search engine built from scratch in C11 and compiled to WebAssembly. Unlike traditional keyword-based search engines, VectorVault understands the meaning of text by converting words into high-dimensional vectors and measuring their semantic similarity.
The project uses GloVe (Global Vectors for Word Representation) pre-trained embeddings to map words into a 50-dimensional vector space where semantically similar words are positioned closer together. Documents are split into chunks, embedded by averaging their word vectors, and stored in a vector database. Search queries are embedded the same way and matched against stored chunks using cosine similarity.
Technical Stack
- Language: Pure C11 (no C++ dependencies in core)
- Vector Model: GloVe 50-dimensional embeddings
- UI Framework: Dear ImGui (via cimgui C bindings)
- Target Platform: WebAssembly (Emscripten) + Native builds
- Graphics: SDL2 + OpenGL ES3
This architecture enables the system to run entirely in a browser without any server-side processing or external API calls.
Video Demonstration
Watch the complete walkthrough showing how VectorVault was made, its semantic search engine, and demonstration:
The Problem with Modern Search
Traditional keyword search operates on exact string matching. A query for “machine learning” will not find documents about “neural networks” or “artificial intelligence” unless those exact terms appear.
Modern search systems solve this through semantic understanding:
- Convert text to numerical vectors
- Position similar concepts close together in vector space
- Measure similarity through geometric distance
However, most implementations hide this process behind:
- Hosted embedding APIs that require network connectivity
- Black-box vector databases with opaque internals
- Managed services that abstract away computational details
- Cloud-only execution models with latency and privacy concerns
VectorVault strips away these layers to expose the underlying mechanics.
How Vector Embeddings Work
The Core Insight
Words are not independent symbols. They exist in a web of relationships:
- “King” relates to “Queen” the same way “Man” relates to “Woman”
- “Paris” relates to “France” the same way “Tokyo” relates to “Japan”
GloVe embeddings capture these relationships by training on word co-occurrence patterns across massive text corpora. The result is a mapping from words to vectors where:
- Similar words have similar vectors
- Analogical relationships are preserved arithmetically
- Semantic distance becomes geometric distance
Vector Space Representation
Each word maps to a 50-dimensional vector. In simplified 2D projection:
^ Dimension 2
|
AI * * ML
|
------*----------> Dimension 1
cat | * dog
|
human
*
In the full 50D space:
- Similar concepts cluster together
- Distance equals semantic dissimilarity
- Direction encodes meaning relationships
Why 50 Dimensions?
Lower dimensions lose semantic nuance. Higher dimensions increase memory and computation costs without proportional accuracy gains. GloVe 50D provides a practical balance:
- Captures most semantic relationships
- Fits comfortably in browser memory
- Enables real-time search performance
Document Indexing Process
VectorVault processes documents through a multi-stage pipeline:
Stage 1: Text Chunking
Input Document (1500 chars)
|
Split into ~200 char chunks
|
Chunks: ["Machine learning is...", "Neural networks...", ...]
Chunk size is configurable via CHUNK_SIZE_CHARS in the source. Smaller chunks provide finer granularity but increase storage overhead.
Stage 2: Tokenization
Chunk: "Machine learning is powerful"
|
Lowercase: "machine learning is powerful"
|
Split: ["machine", "learning", "is", "powerful"]
|
Remove stopwords: ["machine", "learning", "powerful"]
Stopword removal eliminates common words that carry little semantic meaning (“is”, “the”, “a”) to improve signal-to-noise ratio.
Stage 3: Word Vector Lookup
"machine" -> [0.23, -0.45, 0.12, ...] (50D vector)
"learning" -> [0.34, -0.21, 0.56, ...]
"powerful" -> [0.11, -0.67, 0.33, ...]
Each word is looked up in the GloVe vocabulary. Words not found are skipped.
Stage 4: Mean Pooling
Average all word vectors:
chunk_vector = (v_machine + v_learning + v_powerful) / 3
|
Normalize to unit length (L2 norm = 1)
Mean pooling aggregates word vectors into a single chunk representation. Normalization ensures consistent magnitude for similarity computation.
Stage 5: Storage
Store: {
id: 42,
text: "Machine learning is powerful",
vector: [0.227, -0.443, 0.337, ...],
doc_id: 5
}
Each chunk is stored with its text content, embedding vector, and source document reference.
Cosine Similarity: The Core Algorithm
VectorVault uses cosine similarity as the semantic distance metric:
A · B
similarity(A, B) = ─────────────
||A|| × ||B||
Where:
- A · B = dot product (sum of element-wise multiplication)
- ||A||, ||B|| = vector magnitudes (L2 norm)
The result ranges from -1 to 1, though VectorVault filters negative values since they indicate semantic opposition rather than similarity.
Interpretation Scale
| Score | Interpretation | Example |
|---|---|---|
| 0.95+ | Near identical | ”car” vs “automobile” |
| 0.85-0.95 | Very similar | ”king” vs “monarch” |
| 0.70-0.85 | Related | ”king” vs “royalty” |
| 0.50-0.70 | Loosely related | ”king” vs “castle” |
| < 0.50 | Unrelated | ”king” vs “banana” |
Search Execution
- Query “machine learning” -> Embed -> Query Vector Q
- For each stored chunk vector C_i:
- Calculate similarity(Q, C_i)
- Sort by similarity (descending)
- Return top N results
Time complexity: O(n) where n = number of chunks. This is a linear scan without indexing.
Binary File Format Design
GloVe Model Binary
The original GloVe text format is space-inefficient and slow to parse. VectorVault converts it to a custom binary format:
+------------------------------------------+
| Header (16 bytes) |
+------------------------------------------+
| Magic Number: 0x474C5645 ("GLVE") 4B |
| Version: 1 4B |
| Word Count: 10000 4B |
| Embedding Dimension: 50 4B |
+------------------------------------------+
| Word Entries (repeated) |
+------------------------------------------+
| Word String (null-padded) 16B |
| Vector [50 floats] 200B |
| ... (repeat for each word) |
+------------------------------------------+
Total size: ~2.16 MB for 10,000 words
Benefits:
- 10-50x faster loading than text parsing
- Magic number prevents loading corrupted files
- Platform-independent byte ordering
Index Cache Binary
Indexed chunks are persisted for session recovery:
+------------------------------------------+
| Header (32 bytes) |
+------------------------------------------+
| Magic: 0x53534543 ("SSEC") 4B |
| Version: 1 4B |
| Chunk Count: 100 4B |
| Embedding Dimension: 50 4B |
| Reserved 16B |
+------------------------------------------+
| Chunk Entries (repeated) |
+------------------------------------------+
| Chunk ID 4B |
| Source Document ID 4B |
| Text Content (null-padded) 256B |
| Embedding Vector [50 floats] 200B |
+------------------------------------------+
WebAssembly Architecture
Why WebAssembly?
Traditional ML stacks require:
- Python runtime
- NumPy/SciPy installations
- Model framework dependencies
- Server infrastructure
WebAssembly enables:
- Browser-native execution
- Zero installation
- Cross-platform compatibility
- Sandboxed security model
Memory Model
WASM operates within a linear memory space. VectorVault manages this explicitly:
- Default memory: 256 MB initial, 2 GB maximum
- Vocabulary: ~2.16 MB (10,000 words x 216 bytes)
- Chunk storage: ~464 bytes per chunk
- Maximum chunks: 1,000 (configurable)
Emscripten Integration
The build process uses Emscripten to:
- Compile C11 to WASM bytecode
- Generate JavaScript glue code
- Preload GloVe binary into virtual filesystem
- Bind SDL2/OpenGL for rendering
Performance Characteristics
Indexing Performance
- Speed: ~1,000 chunks/second
- Memory: ~464 bytes per chunk
- Limit: 1,000 chunks default (configurable)
Search Performance
- Linear scan: O(n) complexity
- Latency: < 50ms for 1,000 chunks
- No indexing overhead
Comparison with Indexed Approaches
| Approach | Build Time | Search Time | Memory |
|---|---|---|---|
| VectorVault (brute-force) | O(n) | O(n) | O(n) |
| HNSW | O(n log n) | O(log n) | O(n) |
| IVF | O(n) | O(sqrt(n)) | O(n) |
VectorVault trades search speed for simplicity and transparency. For datasets under 10,000 chunks, the difference is negligible.
Development Challenges
Building VectorVault surfaced several non-obvious technical challenges:
Out-of-Vocabulary Words
Problem: Queries containing words not in the GloVe vocabulary (limited to 10,000 words by default) produce degraded results.
Symptoms:
- Search returns irrelevant matches
- Status shows “0 words found in vocabulary”
Resolution: The system gracefully handles OOV words by skipping them during embedding. Users can expand the vocabulary by loading the full GloVe model (400,000 words) at the cost of increased memory usage.
WebAssembly Memory Constraints
Problem: Browser memory limits caused crashes when indexing large document sets.
Symptoms:
- Browser tab crashes
- Console shows “Out of memory” errors
Resolution: Implemented chunk limits via MAX_CHUNKS configuration. Default is 1,000 chunks (~464 KB storage). Users can adjust based on their browser’s memory capacity.
File Upload Timeouts
Problem: Large files (> 5 MB) would timeout during browser upload due to blocking I/O.
Resolution: Implemented chunked file reading with progress callbacks. Files are processed incrementally rather than loaded entirely into memory.
CORS Restrictions
Problem: Opening the application directly via file:// protocol fails due to browser security restrictions.
Resolution: Documentation emphasizes HTTP server requirement. Included multiple server options (Python, Node.js) in setup instructions.
Floating-Point Precision
Problem: Different platforms produced slightly different similarity scores due to floating-point handling variations.
Resolution: Implemented consistent normalization and rounding to ensure deterministic behavior across browsers and native builds.
Submodule Dependencies
Problem: Dear ImGui C bindings (cimgui) required careful submodule management.
Resolution: Added explicit submodule initialization step to build instructions. Documented common errors and fixes for missing dependencies.
Configuration Options
Edit include/common_types.h to customize behavior:
#define EMBEDDING_DIMENSION 50 // Vector size (must match GloVe)
#define MAX_VOCABULARY_SIZE 10000 // Number of words to load
#define CHUNK_SIZE_CHARS 200 // Characters per chunk
#define MAX_CHUNK_TEXT_LENGTH 256 // Max stored text per chunk
#define MAX_SEARCH_RESULTS 50 // Top-k results to return
#define MAX_CHUNKS 1000 // Maximum indexed chunks
Recompilation required after changes.
Project Structure
VectorVault/
├── src/ # Core C implementation
│ ├── main.c # Entry point and main loop
│ ├── search_engine.c # GloVe loading and embedding
│ ├── vector_store.c # Chunk storage and similarity search
│ ├── tokenizer.c # Text processing and chunking
│ ├── serializer.c # Binary file I/O
│ ├── ui_renderer.c # ImGui interface
│ └── app_state.c # Global state management
├── include/ # Header files
├── data/ # Binary model files (gitignored)
├── tools/ # Utility scripts
│ └── convert_glove_to_bin.py
├── libs/ # Third-party libraries
│ ├── cimgui/ # Dear ImGui C bindings
│ └── cJSON/ # JSON parsing
└── build/ # Build outputs (gitignored)
Use Cases
Personal Knowledge Base Search
Search through notes, articles, and documentation by meaning rather than exact keywords.
Example: Query “learning algorithms” finds documents about “neural networks” and “training models” even without those exact terms.
Document Similarity Detection
Identify duplicate or similar content across document collections without manual comparison.
Content Recommendation
Given a document chunk, find semantically similar chunks to recommend related content.
Offline Research Tool
Index research papers, books, or articles and search semantically without internet access.
Privacy-Focused Search
All processing happens locally in the browser. No data transmitted to external servers.
Limitations and Trade-Offs
| Design Choice | Advantage | Limitation |
|---|---|---|
| Pure C implementation | Full control, transparency | Higher development effort |
| Brute-force similarity | Simple, correct | O(n) search complexity |
| Static embeddings | Deterministic, fast loading | No contextual understanding |
| 10K vocabulary | Low memory footprint | OOV words ignored |
| WASM target | Cross-platform, zero install | Memory constraints |
| No ML inference | Reduced complexity | Cannot generate new embeddings |
These trade-offs are intentional. VectorVault prioritizes understanding over capability.
Future Directions
Potential extensions while preserving architectural clarity:
- Approximate nearest neighbor indexing (HNSW, IVF)
- Vector quantization for reduced memory (INT8/FP16)
- Hybrid keyword + vector search
- Streaming document ingestion
- RAG-style retrieval integration
- PCA visualization of vector space
- Benchmarking suite against indexed approaches
Repository
Full source code and documentation: https://github.com/JashT14/VectorVault
License
MIT License - free for educational, personal, and commercial use.
Final Perspective
VectorVault demonstrates that semantic search is not inherently complex. It becomes complex when hidden behind layers of abstraction.
By implementing the complete pipeline in C - from text chunking through vector embedding to similarity computation - this project exposes the real mechanics of vector-based retrieval. Every calculation is explicit. Every data structure is visible. Every trade-off is documented.
Modern AI systems often feel like magic. VectorVault proves they are mathematics.
For engineers who want to understand why systems behave as they do, not just how to use them, this project provides a foundation. In an ecosystem dominated by APIs and managed services, VectorVault operates at the opposite end of the spectrum - where understanding is earned through implementation.