VectorVault: Offline Semantic Search Engine in C

Dec 30, 2025 · 8 min read

semantic-searchvector-embeddingsinformation-retrievalsystems-programmingwebassemblyoffline-ai

VectorVault: Offline Semantic Search Engine in C

Understanding Semantic Search at the Systems Level

Semantic search is not about embeddings or vector databases. It is about numerical representation, geometric relationships, and deterministic computation.

Abstractions hide these fundamentals. VectorVault exposes them.

Overview

VectorVault is a lightweight, offline semantic search engine built from scratch in C11 and compiled to WebAssembly. Unlike traditional keyword-based search engines, VectorVault understands the meaning of text by converting words into high-dimensional vectors and measuring their semantic similarity.

The project uses GloVe (Global Vectors for Word Representation) pre-trained embeddings to map words into a 50-dimensional vector space where semantically similar words are positioned closer together. Documents are split into chunks, embedded by averaging their word vectors, and stored in a vector database. Search queries are embedded the same way and matched against stored chunks using cosine similarity.

Technical Stack

Language: Pure C11 (no C++ dependencies in core)
Vector Model: GloVe 50-dimensional embeddings
UI Framework: Dear ImGui (via cimgui C bindings)
Target Platform: WebAssembly (Emscripten) + Native builds
Graphics: SDL2 + OpenGL ES3

This architecture enables the system to run entirely in a browser without any server-side processing or external API calls.

Video Demonstration

Watch the complete walkthrough showing how VectorVault was made, its semantic search engine, and demonstration:

The Problem with Modern Search

Traditional keyword search operates on exact string matching. A query for “machine learning” will not find documents about “neural networks” or “artificial intelligence” unless those exact terms appear.

Modern search systems solve this through semantic understanding:

Convert text to numerical vectors
Position similar concepts close together in vector space
Measure similarity through geometric distance

However, most implementations hide this process behind:

Hosted embedding APIs that require network connectivity
Black-box vector databases with opaque internals
Managed services that abstract away computational details
Cloud-only execution models with latency and privacy concerns

VectorVault strips away these layers to expose the underlying mechanics.

How Vector Embeddings Work

The Core Insight

Words are not independent symbols. They exist in a web of relationships:

“King” relates to “Queen” the same way “Man” relates to “Woman”
“Paris” relates to “France” the same way “Tokyo” relates to “Japan”

GloVe embeddings capture these relationships by training on word co-occurrence patterns across massive text corpora. The result is a mapping from words to vectors where:

Similar words have similar vectors
Analogical relationships are preserved arithmetically
Semantic distance becomes geometric distance

Vector Space Representation

Each word maps to a 50-dimensional vector. In simplified 2D projection:

      ^ Dimension 2
      |
  AI  *     * ML
      |
------*----------> Dimension 1
  cat |     * dog
      |
    human
      *

In the full 50D space:

Similar concepts cluster together
Distance equals semantic dissimilarity
Direction encodes meaning relationships

Why 50 Dimensions?

Lower dimensions lose semantic nuance. Higher dimensions increase memory and computation costs without proportional accuracy gains. GloVe 50D provides a practical balance:

Captures most semantic relationships
Fits comfortably in browser memory
Enables real-time search performance

Document Indexing Process

VectorVault processes documents through a multi-stage pipeline:

Stage 1: Text Chunking

Input Document (1500 chars)
       |
Split into ~200 char chunks
       |
Chunks: ["Machine learning is...", "Neural networks...", ...]

Chunk size is configurable via CHUNK_SIZE_CHARS in the source. Smaller chunks provide finer granularity but increase storage overhead.

Stage 2: Tokenization

Chunk: "Machine learning is powerful"
       |
Lowercase: "machine learning is powerful"
       |
Split: ["machine", "learning", "is", "powerful"]
       |
Remove stopwords: ["machine", "learning", "powerful"]

Stopword removal eliminates common words that carry little semantic meaning (“is”, “the”, “a”) to improve signal-to-noise ratio.

Stage 3: Word Vector Lookup

"machine"  -> [0.23, -0.45, 0.12, ...]  (50D vector)
"learning" -> [0.34, -0.21, 0.56, ...]
"powerful" -> [0.11, -0.67, 0.33, ...]

Each word is looked up in the GloVe vocabulary. Words not found are skipped.

Stage 4: Mean Pooling

Average all word vectors:
chunk_vector = (v_machine + v_learning + v_powerful) / 3
       |
Normalize to unit length (L2 norm = 1)

Mean pooling aggregates word vectors into a single chunk representation. Normalization ensures consistent magnitude for similarity computation.

Stage 5: Storage

Store: {
  id: 42,
  text: "Machine learning is powerful",
  vector: [0.227, -0.443, 0.337, ...],
  doc_id: 5
}

Each chunk is stored with its text content, embedding vector, and source document reference.

Cosine Similarity: The Core Algorithm

VectorVault uses cosine similarity as the semantic distance metric:

                      A · B
similarity(A, B) = ─────────────
                   ||A|| × ||B||

Where:

A · B = dot product (sum of element-wise multiplication)
||A||, ||B|| = vector magnitudes (L2 norm)

The result ranges from -1 to 1, though VectorVault filters negative values since they indicate semantic opposition rather than similarity.

Interpretation Scale

Score	Interpretation	Example
0.95+	Near identical	”car” vs “automobile”
0.85-0.95	Very similar	”king” vs “monarch”
0.70-0.85	Related	”king” vs “royalty”
0.50-0.70	Loosely related	”king” vs “castle”
< 0.50	Unrelated	”king” vs “banana”

Search Execution

Query “machine learning” -> Embed -> Query Vector Q
For each stored chunk vector C_i:
- Calculate similarity(Q, C_i)
Sort by similarity (descending)
Return top N results

Time complexity: O(n) where n = number of chunks. This is a linear scan without indexing.

Binary File Format Design

GloVe Model Binary

The original GloVe text format is space-inefficient and slow to parse. VectorVault converts it to a custom binary format:

+------------------------------------------+
| Header (16 bytes)                        |
+------------------------------------------+
| Magic Number: 0x474C5645 ("GLVE")  4B    |
| Version: 1                         4B    |
| Word Count: 10000                  4B    |
| Embedding Dimension: 50            4B    |
+------------------------------------------+
| Word Entries (repeated)                  |
+------------------------------------------+
| Word String (null-padded)         16B    |
| Vector [50 floats]               200B    |
| ... (repeat for each word)               |
+------------------------------------------+

Total size: ~2.16 MB for 10,000 words

Benefits:

10-50x faster loading than text parsing
Magic number prevents loading corrupted files
Platform-independent byte ordering

Index Cache Binary

Indexed chunks are persisted for session recovery:

+------------------------------------------+
| Header (32 bytes)                        |
+------------------------------------------+
| Magic: 0x53534543 ("SSEC")         4B    |
| Version: 1                         4B    |
| Chunk Count: 100                   4B    |
| Embedding Dimension: 50            4B    |
| Reserved                          16B    |
+------------------------------------------+
| Chunk Entries (repeated)                 |
+------------------------------------------+
| Chunk ID                           4B    |
| Source Document ID                 4B    |
| Text Content (null-padded)       256B    |
| Embedding Vector [50 floats]     200B    |
+------------------------------------------+

WebAssembly Architecture

Why WebAssembly?

Traditional ML stacks require:

Python runtime
NumPy/SciPy installations
Model framework dependencies
Server infrastructure

WebAssembly enables:

Browser-native execution
Zero installation
Cross-platform compatibility
Sandboxed security model

Memory Model

WASM operates within a linear memory space. VectorVault manages this explicitly:

Default memory: 256 MB initial, 2 GB maximum
Vocabulary: ~2.16 MB (10,000 words x 216 bytes)
Chunk storage: ~464 bytes per chunk
Maximum chunks: 1,000 (configurable)

Emscripten Integration

The build process uses Emscripten to:

Compile C11 to WASM bytecode
Generate JavaScript glue code
Preload GloVe binary into virtual filesystem
Bind SDL2/OpenGL for rendering

Performance Characteristics

Indexing Performance

Speed: ~1,000 chunks/second
Memory: ~464 bytes per chunk
Limit: 1,000 chunks default (configurable)

Search Performance

Linear scan: O(n) complexity
Latency: < 50ms for 1,000 chunks
No indexing overhead

Comparison with Indexed Approaches

Approach	Build Time	Search Time	Memory
VectorVault (brute-force)	O(n)	O(n)	O(n)
HNSW	O(n log n)	O(log n)	O(n)
IVF	O(n)	O(sqrt(n))	O(n)

VectorVault trades search speed for simplicity and transparency. For datasets under 10,000 chunks, the difference is negligible.

Development Challenges

Building VectorVault surfaced several non-obvious technical challenges:

Out-of-Vocabulary Words

Problem: Queries containing words not in the GloVe vocabulary (limited to 10,000 words by default) produce degraded results.

Symptoms:

Search returns irrelevant matches
Status shows “0 words found in vocabulary”

Resolution: The system gracefully handles OOV words by skipping them during embedding. Users can expand the vocabulary by loading the full GloVe model (400,000 words) at the cost of increased memory usage.

WebAssembly Memory Constraints

Problem: Browser memory limits caused crashes when indexing large document sets.

Symptoms:

Browser tab crashes
Console shows “Out of memory” errors

Resolution: Implemented chunk limits via MAX_CHUNKS configuration. Default is 1,000 chunks (~464 KB storage). Users can adjust based on their browser’s memory capacity.

File Upload Timeouts

Problem: Large files (> 5 MB) would timeout during browser upload due to blocking I/O.

Resolution: Implemented chunked file reading with progress callbacks. Files are processed incrementally rather than loaded entirely into memory.

CORS Restrictions

Problem: Opening the application directly via file:// protocol fails due to browser security restrictions.

Resolution: Documentation emphasizes HTTP server requirement. Included multiple server options (Python, Node.js) in setup instructions.

Floating-Point Precision

Problem: Different platforms produced slightly different similarity scores due to floating-point handling variations.

Resolution: Implemented consistent normalization and rounding to ensure deterministic behavior across browsers and native builds.

Submodule Dependencies

Problem: Dear ImGui C bindings (cimgui) required careful submodule management.

Resolution: Added explicit submodule initialization step to build instructions. Documented common errors and fixes for missing dependencies.

Configuration Options

Edit include/common_types.h to customize behavior:

#define EMBEDDING_DIMENSION     50      // Vector size (must match GloVe)
#define MAX_VOCABULARY_SIZE     10000   // Number of words to load
#define CHUNK_SIZE_CHARS        200     // Characters per chunk
#define MAX_CHUNK_TEXT_LENGTH   256     // Max stored text per chunk
#define MAX_SEARCH_RESULTS      50      // Top-k results to return
#define MAX_CHUNKS              1000    // Maximum indexed chunks

Recompilation required after changes.

Project Structure

VectorVault/
├── src/                    # Core C implementation
│   ├── main.c             # Entry point and main loop
│   ├── search_engine.c    # GloVe loading and embedding
│   ├── vector_store.c     # Chunk storage and similarity search
│   ├── tokenizer.c        # Text processing and chunking
│   ├── serializer.c       # Binary file I/O
│   ├── ui_renderer.c      # ImGui interface
│   └── app_state.c        # Global state management
├── include/               # Header files
├── data/                  # Binary model files (gitignored)
├── tools/                 # Utility scripts
│   └── convert_glove_to_bin.py
├── libs/                  # Third-party libraries
│   ├── cimgui/           # Dear ImGui C bindings
│   └── cJSON/            # JSON parsing
└── build/                # Build outputs (gitignored)

Use Cases

Personal Knowledge Base Search

Search through notes, articles, and documentation by meaning rather than exact keywords.

Example: Query “learning algorithms” finds documents about “neural networks” and “training models” even without those exact terms.

Document Similarity Detection

Identify duplicate or similar content across document collections without manual comparison.

Content Recommendation

Given a document chunk, find semantically similar chunks to recommend related content.

Offline Research Tool

Index research papers, books, or articles and search semantically without internet access.

Privacy-Focused Search

All processing happens locally in the browser. No data transmitted to external servers.

Limitations and Trade-Offs

Design Choice	Advantage	Limitation
Pure C implementation	Full control, transparency	Higher development effort
Brute-force similarity	Simple, correct	O(n) search complexity
Static embeddings	Deterministic, fast loading	No contextual understanding
10K vocabulary	Low memory footprint	OOV words ignored
WASM target	Cross-platform, zero install	Memory constraints
No ML inference	Reduced complexity	Cannot generate new embeddings

These trade-offs are intentional. VectorVault prioritizes understanding over capability.

Future Directions

Potential extensions while preserving architectural clarity:

Approximate nearest neighbor indexing (HNSW, IVF)
Vector quantization for reduced memory (INT8/FP16)
Hybrid keyword + vector search
Streaming document ingestion
RAG-style retrieval integration
PCA visualization of vector space
Benchmarking suite against indexed approaches

Repository

Full source code and documentation: https://github.com/JashT14/VectorVault

License

MIT License - free for educational, personal, and commercial use.

Final Perspective

VectorVault demonstrates that semantic search is not inherently complex. It becomes complex when hidden behind layers of abstraction.

By implementing the complete pipeline in C - from text chunking through vector embedding to similarity computation - this project exposes the real mechanics of vector-based retrieval. Every calculation is explicit. Every data structure is visible. Every trade-off is documented.

Modern AI systems often feel like magic. VectorVault proves they are mathematics.

For engineers who want to understand why systems behave as they do, not just how to use them, this project provides a foundation. In an ecosystem dominated by APIs and managed services, VectorVault operates at the opposite end of the spectrum - where understanding is earned through implementation.