Systems Engineering · Retrieval Architecture
AI Retrieval Core
Understanding retrieval latency at the systems level, before recommending it at the product level.
RAG (Retrieval-Augmented Generation) is now a standard pattern in AI product development, but most AI PMs who spec RAG systems have never built the retrieval layer themselves. They don't know what 'fast retrieval' actually costs at the systems level, what the latency-accuracy tradeoff looks like in practice, or when a simpler keyword approach beats vector search. I built this C++ retrieval engine to fix that blind spot, not to become an infrastructure engineer, but to earn credibility in architecture conversations and know when to push back.
- 1.Vector Similarity Search: Implement approximate nearest-neighbour search from scratch in C++, without relying on libraries like Faiss, to understand the underlying algorithm at the implementation level.
- 2.Index Construction: Build and compare multiple index strategies (flat, IVF-style partitioned) to understand the latency and memory tradeoffs at different corpus sizes.
- 3.Query Routing Logic: Implement a query router that selects the appropriate search strategy based on query characteristics and corpus size, mirroring decisions made in production retrieval systems.
- 4.Latency Benchmarking: Instrument the system to produce per-query latency distributions across index configurations, giving concrete numbers to what 'fast' and 'slow' actually mean.
- ◆Algorithm Study: Read the FAISS and HNSW papers before writing a line of C++, building a precise mental model of what approximate nearest-neighbour search is trading off.
- ◆Implementation: Implemented flat brute-force search first as a correctness baseline, then IVF-style partitioning, benchmarking against the brute-force baseline at each stage.
- ◆Benchmarking Design: Designed benchmark suites across three corpus sizes (10K, 100K, 1M vectors) and four dimensionalities (128d, 256d, 512d, 1536d) to map the latency surface.
- ◆Insight Extraction: Documented concrete thresholds, corpus sizes, dimensionalities, latency targets, at which each index strategy becomes the right or wrong choice.
Before building this, I knew vector search was 'fast'. After building it, I know that 'fast' means sub-10ms for indexed corpora under 1M documents at 1536d, but that index construction time scales non-linearly and can take hours at scale. I know that flat search outperforms IVF under 50K documents and doesn't need the overhead. These are the numbers that let an AI PM say 'that architecture won't hit your latency target at that corpus size' instead of deferring to the engineering team.
Python would have been faster to write. C++ was the right choice because it forces you to think about memory allocation, cache locality, and algorithmic complexity in a way that high-level languages abstract away. The goal wasn't a production system, it was building an accurate mental model of what production retrieval systems are actually doing. C++ achieves that. A Python wrapper around a library does not.
An AI PM who understands retrieval latency makes better calls when the engineering team says 'this approach won't scale.' I built this not to be an infrastructure engineer but to earn credibility in those conversations, and to know when to push back.
First-principles understanding of the performance tradeoffs in retrieval systems that informs every RAG product decision I make.