Searching 1 billion embeddings from disk in 15 ms on a single machine

In my last post, I built a semantic map of 4 million Hacker News posts and dived deep into semantic search and AI embeddings. Continuing the fun research and hacking in this space, in this post, I'll deep dive into how vector search works with interactive visualizations (and the interesting theory behind them), the limitations of current approaches, and how I ended up building a vector DB called CoreNN. It's able to search through 1 billion Reddit comment embeddings in 15 ms from a 4.8 TB index on disk—all time is spent embedding the query and fetching results:

The problem of finding the nearest query to a set of candidates (e.g. documents) is known as nearest neighbor search. The naive way is to compare the query to every candidate, which is O(n); algorithms exist to build indices for searching vectors in O(log n) time, such as HNSW. However, while they're quite good at building fast and accurate indices, they have shortcomings when trying to use them in real-world production use cases:

These algorithms are in-memory, meaning that for reasonable performance, you have to fit the entire dataset (graph + vectors) in RAM. For 1 billion embeddings (quite small for a search engine), this means around 2 TB of RAM.
Loading such a large amount in memory is not easy nor efficient. Storing and transporting this data (e.g. backups, deployment) is also not pleasant.
These algorithms don't tend to support upserts or deletes, relying on a post-filtering mechanism to remove deleted vectors from the results, while still bloating the index.
Being designed memory-first means that even if mutations were possible, persisting to disk is not easy, and typically requires a full serialization to disk regardless of how small the change is.

The common theme is that current algorithms are like an in-memory data structure, whereas what I want is an on-disk data structure. Real world databases support datasets larger than RAM, update parts of the dataset quickly, safely persist all mutations, while ensuring queries are accurate, fast, and up-to-date.

CoreNN aims to solve these problems in an ergonomic and economical way:

Uses cheap flash storage, not expensive DRAM, retaining high recall and throughput while costing 40x–100x less.
Seamlessly scales from 1 to 1 billion vectors in a single index, without any downtime or manual management.
Truly upserts and deletes vectors such that they free space and improve query performance, and don't require a full rebuild and rewrite.
Supports concurrent live queries and updates without blocking each other.

It's a vector database + algorithm designed from the ground up to be disk native. My motivation is that I've been working on a public search engine based on new SOTA embedding models as a fun challenge, so efficiency and simplicity as a solo side project is important. It allowed me to scale down my cluster of sharded HNSW indices totalling 3 TB RAM into a single 96 GB RAM machine, with a much simpler update and query pipeline: no sharding across machines, continuous rebuilding, chunking, storing diffs, dynamically un/loading into memory, etc. It builds on top of the DiskANN and FreshDiskANN papers, with some novel improvements discussed later.

CoreNN does this while still outperforming HNSW in query accuracy and speed, given the same parameters. For example, on the GIST1M dataset, it achieves a higher recall using far fewer graph traversals:

I also set out to make CoreNN as easy and ubiquitous to use as SQLite. It's a single binary with a simple REST API. It's also an embedded database that is available for Node.js, Python, and Rust, with more language bindings to come.

The project is completely open source; head on over to the GitHub repository to learn more. I'll be posting more about CoreNN and related projects, so follow my X for updates.

Curse of dimensionality
Graphs
Six Degrees of Kevin Bacon
Hierarchical Navigable Small World (HNSW)
Limitations of HNSW
Vamana
Analyzing Vamana
Comparison between HNSW and Vamana
CoreNN
Storage layout
Backedge deltas
Product quantization
float16, binary quantization, Matroyshka
Async vs sync
Usages and getting started

Curse of dimensionality

As mentioned before, our goal is to be able to find the closest k vectors in sublinear time, as doing a brute-force scan is not efficient enough for most use cases. There are many use cases for this, not just text embeddings. Essentially, we want some ideal data structure to index vectors.

Existing data structures like k-d trees and locality-sensitive hashing exist, but they tend to degrade to O(n) time complexity in higher-dimensional spaces due to the curse of dimensionality. This is described better by many others: empirical, formal analysis, geometric, probabilistic. I will attempt to explain in a more informal way — please point out any errors!

The main issue is distances becoming uniformly far and therefore less discriminating. I discovered some interesting intuitions to think about this:

Imagine a 2D shape, like a circle. As you move from the center to the periphery, a few points get closer, but more points get further away. As dimensions increase, the space of this periphery dominates and the "center" is vanishingly small (see this great visualization). Most points will lie "near the edge", and the expansive volume means almost every other point is near some other edge (unlikely to be near the exact same place or the tiny center), so almost everything's uniformly (edge A <-> center <-> edge B) far (edge to edge) away from everything else, including the "nearest" neighbors.
Given a point as a vector, being near the edge only requires one element being near the min/max for that element – imagine a 2D square with (x, y) points (vectors) and see this is true. Given many elements, it's very unlikely for a point to be "in the center", where every single element happens to be near its mean. This is why most points are "near the edge" in high dimensional space. The more contributing dimensions to the difference between two vectors, the higher the mean distance (distances are far) and lower the variance (distances are uniform); exceptions are usually due to skewed data or non-independent features.

A more relatable real-world example: we could model humans as a vector of trait probabilities. Humans are complex, so there may be thousands of features.

People are unlikely to be "perfectly normal", where every dimension is average; we're all "near the boundary" in this high-dimensional space.
Person A can be similar to B because of traits 1, C by 2+3, D by 1+4+5, and so on. It's hard to cleanly delineate the search space and organize people in a rigid hierarchy or groups.

I'd recommend going through the interesting links above if you're interested. They provide more context and background, different perspectives and intuitions, and more thorough explanations for what the curse is, what happens, and why it matters.

Because these data structures rely on distances to prune the search space quickly, but distances become less useful in higher dimensions, they don't work well in practice.

Graphs

This curse led to the discovery of graphs as the ideal data structure. We can model the problem as a graph by making searchable vectors (e.g. documents) nodes, and query by traversing the graph to find the node with the minimum distance to the query, representing our nearest neighbor. Unlike trees and hashes, we have a lot of flexibility, as edges can connect any two nodes (vectors); for example, we can connect two vectors simply because they're close – no need to find some ideal partition or hash function.

The graph should be greedily traversable such that, starting from some consistent entry node, we can stop when we've reached a local minima and be confident that it is the true neighbor. Otherwise, we can always say "the true neighbor could be elsewhere" and end up doing a global search, defeating the purpose.

The flexibility of graphs is what makes them work over the others: we can "adapt" the graph to our data distribution (whatever it is) and our greedy "query" operation, to ensure sublinear time complexity (i.e. not global search) regardless of our data.

Now we have an optimization problem: what edges should we pick to optimize for these outcomes?

A random graph – where every node has N arbitrary edges to other nodes – is a good baseline to establish the lower bound of effectiveness, and why smarter approaches discussed later are necessary. Let's try that:

The above is an interactive visualization. Vectors here are 2D (represented via positioned dots) for easy human visualization, and lines are the random edges. The blue node is the randomly-picked entry node, and the purple node is your query. Try it a few times: click anywhere to query that position, and the path taken to find it (or fail to) will light up.

After a few times, you'll notice that many queries fail to find the ground truth (the true nearest neighbor i.e. k=1), and that they take zig-zagged inefficient paths. This gets worse with more data points and higher-dimensional spaces – many queries succeed here by "luck", as the space is small and points are few.

This indicates we need to add some structure to this graph, as pure randomness doesn't work. So how do we figure out what this structure should be?

Six Degrees of Kevin Bacon

It turns out sociology can help us. HNSW has its roots from the famous Milgram small-world experiment ("six degrees of separation"), which found that most people can be reached from anyone else using only about six "short paths"—who each person knows personally—instead of having to know everyone else. The SW in HNSW stands for "small world".

Our needs are similar: we have a graph, with potentially billions of nodes, and we'd like to get to some specific node (the closest to some query) from some starting node. So what can we glean from how our society works?

Our transportation is designed hierarchically, with long (highways), medium (avenues), and short (streets) mechanisms for getting anywhere without connecting every possible two points.
In the Milgram experiment, people forwarded the letters to distant strangers by first giving to long-range connections (people who could bridge social groups, geographies, etc.), then to more local connections.

Long connections makes short work to get to the "general" area, which we then use short connections to hone in on the "neighborhood" for accuracy. And there's no global knowledge–we simply go to our nearest connection and navigate from there. If we translate to our desired graph properties:

We'd like an ideal distribution of long edges such that we can get to where we need quickly, preferably in sublinear time.
The edges should be picked optimally in such a way that we can find the global optima only by traversing the locally nearest adjacent nodes, so we can stop in sublinear time.

Armed with this knowledge, let's try some smarter strategies for placing edges. Let's start with the most simple idea: use some ratio of short to long edges.

To explore this idea, I've extended the demo to allow tweaking how many edges should connect to nearby nodes (short edges) and how many should be random (likely long edges, given most nodes are definitionally relatively further away). Play around with the parameters and get a sense for what works well.

In general, adding structure by connecting neighbors increases accuracy, up to a point, before local minimas prevent reachability too often (and the paths are inefficient with too many hops). There's also a relationship between node count and edge count, which we'll analyze deeper later. So it's on the right track, but we need to more explicitly figure out how to pick optimal edges rather than just guessing the ratio and picking random edges.

There's a great post by Bartholomy for further reading on the sociology context and structure behind ANN graphs.

Hierarchical Navigable Small World (HNSW)

While CoreNN doesn't use HNSW, it is the current most-popular graph-based ANN index and forerunner to the Vamana algorithm CoreNN uses. There is a surprising amount of similarity between HNSW and Vamana, as we'll soon see.

It's interesting to think about how to add those long edges strategically such that they empirically allow quick navigation across the graph regardless of the destination. HNSW's innovation is the H in its name. We can build a graph by inserting edges to neighbors of each node; once we are in the "neighborhood", this should allow us to find our exact target with a few hops. This also makes edge selection simple: just connect to the closest nodes. However, as previously mentioned, we also need some long edges. The HNSW authors came up with a neat trick to build on top of this to get optimal long edges without much extra effort:

HNSW query mechanism, where queries drop down the levels.

We can build multiple graphs called levels, where each higher level has uniformly-sampled fewer nodes, and the bottom level is our graph of all nodes. When building, we use the same approach to each level graph for picking edges, but a node is randomly assigned to a specific level (the higher, the less likely) and inserted to that level + all below it.

Why does this work? Think about how uniform sampling works: we pick nodes from across the map, but there are much fewer at a higher level. Now, edges are longer on average, because it's unlikely that a node's true neighbor also happens to be at the same level, so its level-specific neighbors are likely further away and "spread" across the map. We're still just adding edges based on nearest neighbors. It's easier to see in the visualization above: just by reducing the amount of nodes at a higher level, we get longer edges as a result.

Now we can navigate quickly: starting from the top-most level, we find the nearest node to our query. Then we drop down to the level below at the same node, and repeat. Long edges at the higher levels navigate quickly; shorter ones below hone in on the exact local neighborhood.

It's important to note that we're not intentionally picking spread-out nodes. It's still uniform sampling, so the data distribution is still the same. This is important, as we want our graph to be optimized for our dataset. If there are 100 nodes spread out across the latent space, but 1,000 in this one small corner, we want our higher levels to have more nodes in that small corner too. While they don't help queries "jump far" in absolute distances, they still skip over large amounts of nodes, helping navigate that dense region. If we intentionally only pick one node from that entire area for the higher level, we'd have to explore 1,000 once we drop down.

HNSW uses an exponential decay for the probability of being picked in each successive level. As you'd expect, this means that each level has exponentially fewer nodes than the one below it. It's logarithmic because we want to have sublinear querying time, so this way, we exponentially "close the gap" to our target at each level. You can think of it as similar to narrowing the "cone of visibility" each time.

Edge distances by level for an HNSW index built from the GIST1M dataset. The data distribution remains the same, but the node count drops exponentially.

Empirically, this works extremely well in terms of query accuracy and speed, which is why HNSW is still popular. For more details on how HNSW works, see this great article by Oracle.

Limitations of HNSW

CoreNN didn't exist yet at the time I was working on my search engine, so HNSW was my pick. Unfortunately, HNSW has some limitations for large-scale indices:

Memory access assumption

The algorithm was designed with memory operations in mind, and trying to use it from disk is too slow.

It takes around 100 ns to read memory, and 20,000 ns to read a random 4K SSD page. This is a 200x increase — differences of this magnitude break the assumptions that algorithms build around; one could perform multiple queries from in-memory HNSW before even a single node read request returns.

Its design also means that the index must be loaded entirely into memory before you can use it, even for just a few queries. Even some vector databases have this requirement.

If you have more vectors than you can fit in memory, you now have to shard your indices so you can use multiple machines for more combined memory. As I scaled my search engine, I ended up running a cluster of expensive high-memory machines with a combined 3 TB of RAM, which wasn't cost-efficient.

True deletions

The algorithm, as described in the original paper, was not designed with a mechanism to truly delete vectors. There are some discussions on possible approaches, but as of right now the official hnswlib library still has no such API.

Deletions are currently handled by marking them as deleted in the graph, and filtering them out in the final results. This poses a problem: as more and more nodes are deleted, more and more of the graph is "garbage" and queries waste a lot of time visiting them only to discard them.

Note that updates are also deletes; you can't simply change the vector of a node, or delete it from the graph, because the graph edges are optimally picked in order to make it navigable with that node as it exists.

Storage format

hnswlib, the official library, saves and loads the index essentially using memory dumps. While fast and simple, this makes it hard to do updates and persist them without a full rewrite. Given how large indices can be, this can become problematic. For example, inserting 10 vectors (<1 MB) into a 250 GB index requires copying the file to a large machine, loading into memory, performing the insert, then dumping 250 GB back to disk.

Vamana

In 2019, Microsoft Research published a paper: high accuracy, low latency querying over 1B embeddings on a 64 GB machine, by leveraging the disk. It defines a new graph, Vamana, with a new strategy, DiskANN, for using the disk with it. Combined, they bring a few innovations over HNSW:

Simpler algorithms for querying and inserting that only need one graph, building optimal long edges without the need of a hierarchy.
Tunable parameter to create longer edges, which allows fewer hops to navigate to the "local" area (and therefore fewer disk reads).
A procedure for truly deleting vectors in a disk-efficient manner.
An optimized storage layout and querying approach that exploit SSD characteristics — and allows indices to be queried from much larger and cheaper SSDs while retaining high accuracy.

It's a graph algorithm, in the same vein as HNSW, so it works via two mechanisms:

Search: repeatedly expand the closest unvisited node in the search list (starting from the entry node) by inserting its neighbors into the search list. Stop once there are no more unvisited nodes.
Insert: query the vector to insert (I) and collect all the nodes we visit (V) along the way. Insert edges V→I and I→V, except for any V where it's greedier to go V→V'→I.

Intuitively, we can see that the search is greedy and only considers the local optima. We can also see that the insert works by making it easier to reach I next time simply by adding all edges we saw on our path. It'll create long edges from the earlier nodes in V, short edges from the final nodes we saw before we found I, and everything in between. Where HNSW has to add another mechanism (hierarchy) to get those longer edges, Vamana doesn't. Nice and simple.

Because V can be quite large, and backedges are also inserted to the nodes on the other end, nodes can exceed the degree bound (the limit of edges a node can have, same as M in HNSW and that Edges per node slider in those demos above). The degree bound is important to avoid excessive disk reads (i.e. tail latencies). RobustPrune is Vamana's innovation for ensuring this while still picking edges that result in a very high quality graph.

One thing it does: we don't need V→I if V'→I exists and V is closer to V' than I. This is because our algorithm is greedy, so at V we'll go to V' (or some other node) and from there find I. One cool thing is that we don't actually check if V→I exists, the built graph just ends up making this work—no locking and consistency complexity necessary.

But what about long edges? HNSW had the hierarchy, but Vamana doesn't, so just picking the closest neighbors won't work as we saw from that earlier experiment.

RobustPrune therefore also has an alpha parameter, where alpha >= 1.0: instead of pruning all V→I where there is another greedier indirect path, we give some additional slack by setting the threshold to be (V → I) <= (V → V') * alpha, allowing some longer paths to survive. Note that alpha = 1 means we're back to no slack. This might be easier to see:

Blue edges are the candidates for connecting to the target node. All nodes we visit are candidates for one end. Notice how rarely we pick longer edges, with the edge from the entry point being the most obvious, unless we increase alpha, which allows us to tune based on what works best for our data distribution.

Combined, these pick and remove edges in a more intelligent way that enforces the hard upper bound limit of edges, while also preserving only the best edges that result in a high quality graph.

Deletion works in a similar fashion: we can't simply "delete" a node as its edges are part of the network for effective navigating, so the simplest approach is to connect all nodes connected to it to all nodes it connects to (a.k.a. contraction). However, this adds a lot of extra edges, likely redundant and exceeding the limit. Because we have RobustPrune, we can handle this by running that on all affected nodes. This allows for true deletions while preserving a high quality graph.

Let's now implement Vamana in our interactive demo to see that it works and how it compares to our strategies so far:

Analyzing Vamana

Let's empirically investigate how well Vamana works and what effects its parameters have. We'll use the popular GIST1M dataset, as it's reasonably large (1M) and high dimensional (960).

To greatly speed up exploring the space of hyperparameters, I precomputed the pairwise distance between all vectors in batches on a GPU using a small Python script. RobustPrune does O(n²) comparisons, so this reduces the time to build and evaluate each index configuration. The dataset was subsampled to 250K as 1M×1M f16 requires 1.8 TB of RAM. (Reading from slow disk in order to avoid much faster CPU computation defeats the purpose.)

We know there's a parameter for how many edges each node should have. The higher this is, the better steps we take at each hop, with the trade-off being that we must compare the query against every edge (so the more edges, the more computation). It's easy to see this if we take the extremes:

If a node has edges to every single other node, we can take the perfect step by going straight to our target node, at the cost of doing essentially a brute force global scan.
If a node has only one edge, it's unlikely that we'll take a good step in the direction and distance we want—what are the chances that our query just happens to align with that edge?

The curse of dimensionality applies: as higher dimensions means more disparate vectors are similar, we need more edges for better disambiguation. It's also the reason why we have to check every edge: very different vectors can be similar, and we can't cleanly divide the space. Let's take a look at various values for this parameter:

Here, we use the HNSW notation and refer to the degree bound as M. (You can also see the effects of different alpha parameter values via the top dropdown, discussed next.) The chart shows how quickly it found the true neighbors for queries on average, and there's a clear trend: the higher the M, the more accurate and quickly-converging. We can also see the sublinear characteristics generally: queries rapidly converge (thanks to those long edges) and we find most neighbors in just a few iterations, with diminishing returns as we try to find every last neighbor. In our high-dimensional space, we can clearly get more gains as we increase M until around 150. In fact, there appears to be an optimal minimum; below M=32, queries are not only inaccurate, but also take a lot longer to complete.

To put it simply: higher M = better accuracy, with diminishing returns past a certain point. Increase by as much query compute time you're willing to pay, and don't drop below a certain point.

The other main parameter is alpha, which increases the preservation of longer edges during the RobustPrune algorithm. If we transpose the chart and instead compare alpha values given a reasonable M:

To make it easier to see, it only highlights 4 values initially: 1.0, 1.1, 1.2, and 4.0. (Click legend items to toggle them.) Compared to M, it's more interesting, as there isn't a clear relationship between increasing alpha and increasing accuracy. It peaks at around 1.1, before dropping quite dramatically. It seems that excessive long edges forced a suboptimal graph fit for our dataset, and after alpha=1.1 they hinder instead of help query hops on average. This shows the importance of evals for quantifying effects of changing hyperparameters for your dataset.

We can take a look at the graph and query properties to see some of the internals as we tweak these hyperparameters. Let's start with the basic one: does alpha change edge lengths? If we look at the distribution of edge lengths, we can see that increasing alpha not only increases long edges, but all edges across the board:

This is expected as the alpha parameter increases acceptable edge lengths during pruning, so more edges are kept overall. Keep this in mind as we go through the subsequent analyses.

Let's look at the distance of the next closest node to expand at each query iteration:

We can see that the graph was very well optimized, and we rapidly closed in on the local neighborhood, and then spent a long time in the neighborhood to find every last neighbor. Our most optimized variant closed in the quickest.

Another perspective, told in two charts:

In the first chart, we can see that all the top-performing parameters make efficient graphs, where most nodes newly discovered are not discarded due to being seen before. In the second chart, even amongst the top-performing alpha values, alpha=1.1 stands out, by dropping the fewest newly discovered nodes, with a noticeable difference to the other high performers; it's the most efficient amongst the most accurate.

Meanwhile, we see the issues with the "unconverging" low alpha values not finding many nodes and exploiting the graph enough. On the other hand, the high alpha values are finding lots of candidate nodes but dropping most of them, likely "overconverging" and unable to hone in accurately in the local area.

Comparison between HNSW and Vamana

Because both HNSW and Vamana both build graphs with similar goals and properties (e.g. greedily searchable, sparse neighborhood graphs), I suspected it was possible to leverage CoreNN over existing HNSW indices, which I had and wanted to migrate without expensive rebuilds. To verify this, I analyzed the graphs Vamana and HNSW build. To make HNSW graphs comparable to Vamana, I flattened all levels into one graph, merging all edges across all levels for each given node.

When looking at the graph edge distribution when built over the GIST1M dataset with the same parameters, we see a similar shape:

From this, we can see that the Vamana graph has a lot more edges, which we can confirm from the distribution of node edge counts, where Vamana is utilizing the degree bound a lot more:

We also see that Vamana enforces a strict upper bound on the edge count, important to avoid excessive disk reads in the worst case. Despite rigid limits, by looking at the resulting query accuracy, we do see validation of Vamana building higher quality graphs than HNSW, where Vamana finds more ground truth, and converges faster:

RobustPrune does seem to accomplish both optimization for disk (avoiding tail latencies) while producing higher quality edges.

After flattening, it's possible to use GreedySearch and RobustPrune from Vamana to search and insert into an existing HNSW index. In fact, using GreedySearch often exceeds the accuracy of HNSW on the HNSW index while taking similar time. To do this, I ported relevant parts of hnswlib to hnswlib-rs in order to extract its built graph.

A nice benefit of this is that it's possible to migrate your existing built in-memory HNSW indices to CoreNN cheaply. As we've seen, the graph would not be as optimized as one built from scratch using CoreNN, but as you insert vectors, the graph will optimize over time. This allows you to migrate from in-memory indices and use disk-based scalability, without an expensive reindex and rebuild process.

Similarly, because Vamana tends to outperform HNSW algorithmically, CoreNN also has an ephemeral in-memory mode, to replace the need for HNSW in those situations too.

CoreNN

CoreNN builds on top of Vamana by using RocksDB as the backing persistent store for the graph, vectors, and other metadata.

Vectors are inserted with a key. For flexibility, arbitrary strings are accepted, and they're mapped to integer IDs internally. These integers are never reused, avoiding the complexity of correctly handling subtle race conditions across all parts of the system when an ID changes keys whilst querying or updating the graph. It also provides a total ordering of updates, which could come in handy for replication or streaming.

RocksDB is a robust, easy-to-use, fast KV database often used as the backing store for many production systems. LMDB was evaluated, but had large on-disk bloat (around 1.5-2x the size of the raw data, compared to RocksDB's 1x) and a mmap-based API that required an upfront allocation size and bumped into system resource limits.

The index used by CoreNN goes through two stages:

Initially, all nodes and full vectors are cached in memory and used for queries. This provides full accuracy and fast performance. This is the in-memory stage.
After running into memory limits, only compressed vectors are stored in memory, which moves the graph and full vectors to disk. Queries navigate the graph by reading nodes from the disk instead of memory. This is the on-disk stage, and what allows CoreNN to achieve truly large scale indices.

These transitions are handled transparently by CoreNN and querying remains seamlessly available and accurate at all times, including during transitions. There's no configuration, downtime, or manual work needed to scale the database from 1 to 1 billion.

Storage layout

DiskANN describes an optimization called implicit re-ranking using full-precision vectors. In the on-disk stage, the neighbors list for a node we expand during a query iteration is on disk instead of in memory. To capitalize the fact that disk reads are usually minimum 4K, we also store the original vector with the neighbors list to read the full-precision vector for "free". Our vectors in memory are quantized, so this will allow us to recalculate the distance to the candidate for a more accurate result.

This is better than fetching all the vectors and re-ranking after the search, as that means doing another round of disk reads when we're already tight on latency budget. Note that we're only reranking the already-expanded node: we don't fetch full vectors for all candidate nodes, as there are M per iteration, exploding I/O costs.

CoreNN stores each node's full vector + neighbors in a single RocksDB entry, to approximate the optimization. Keys have a 1-byte type prefix and use binary, not string, integer representation for maximum compactness. A common prefix allows efficient iteration. How the graph is built and queries traverse are unpredictable, so there's no "ideal" order to lay out nodes on disk to exploit spatial locality, and we gain nothing by omitting the full vector for more density.

In both stages, vectors and nodes are stored in the DB using the same entries, for durable persistence. Using the same DB format means transitioning between in-memory and on-disk require no mass rewrites or deletes; we simply evict from memory and instantly begin using the disk, already optimally laid out.

Backedge deltas

When inserting a new node, Vamana inserts backedges to its new neighbors. This poses a problem when on disk: each vector will require updating M other nodes, a huge write amplification that uses up a lot of IOPS.

The FreshDiskANN paper remedies this by using temporary in-memory indices. Inserts first go to an in-memory Vamana graph. Once this reaches a certain size, it's snapshotted to disk and inserts go to a new temporary in-memory graph. After there are a number of these snapshots, the whole system begins the StreamingMerge procedure, where vectors from these frozen graphs get inserted into the on disk index all at once, ideally coalescing writes.

This setup has one major drawback: all our computation optimizing those temporary graphs are discarded, and we have to expend the same amount of computation again to re-insert all those vectors into our main on-disk graph. During this time, the machine will be under heavy CPU utilization, causing a drop in QPS.

CoreNN takes a different approach. We can observe two things:

Inserting a backedge means rewriting the node's neighbors list of M for the purpose of adding 1/M more data, causing write amplification.
The node's neighbors list has size M as it was optimized with RobustPrune; adding just 1 node exceeds M and triggers the RobustPrune immediately.

Therefore, CoreNN stores a list of "additional neighbors" for each node, which is merged with the base neighbors when expanding. When writing, it's persisted at a different DB key, so no wasteful rewriting of existing nodes is done. RocksDB compacts small values efficiently, and the nodes touched during an insert are random on average (and therefore have few additional neighbors), which combine to reduce write amplification significantly.

CoreNN introduces max_degree_bound, which allows a node to reach this threshold of neighbors before optimizing down to degree_bound; this reduces the frequency of the O(n²) computations and rewrites of the node in the DB. This has no effect on accuracy, as we're only adding additional edges to consider, not removing any existing optimized ones; it just increases query computation times slightly.

I believe this is a better trade off as it's easier and cheaper to add another SSD and double the durability, IOPS, and space, than doubling the CPU core count or clock speed. Flash storage is flexibly shaped with almost no upper limit, but one machine can only have so many CPU cores or clock speed. Additionally, NVMe SSDs are massively parallel with separate read/write pipelines, whereas CPU time spent inserting cannot answer queries.

There is also the benefit of a near-instant transition to on-disk, as the DB inserts and optimization calculations are amortized. Using the FreshDiskANN approach, there is a sudden period of degraded SLA upon triggering the long and intensive process to re-insert all vectors from temporary indices into the on-disk index.

FreshDiskANN's approach requires rebuilding the latest unfrozen temporary index when loading up, as it hasn't been snapshotted yet. This makes recovery and repeated opening slow and expensive. CoreNN doesn't need to recompute as inserts are done directly to the main on-disk index.

Product quantization

Quantization comes into play as we move into the on-disk stage and start reading from disk instead of memory while traversing the graph.

When we expand a node during a query iteration, all its unseen neighbors are new candidates and we calculate the distance to the query for each. Even with a reasonable degree bound of 50, if our vectors were solely on disk, that would be 50 disk reads per iteration; the latency and I/O pressure adds up quickly.

We'd prefer the latency of RAM, but the vectors alone for 1 billion entries would occupy around 1 TB of RAM. Therefore, we keep reading the full vectors and graph nodes from disk, but store quantized vectors in memory. The accuracy remains high, as the graph was built using full-precision original vectors.

Product quantization is the approach CoreNN takes, as suggested by the DiskANN paper. It's a lossy vector compression mechanism that works by:

Dividing each vector into smaller chunks ("subspaces"). For example, a 512D vector becomes 64×8D vectors.
Clustering (e.g. k-means) all chunks for each subspace into C clusters, usually 256.
Representing each chunk by its cluster index. Now 64×8D vectors become 64×1D, as 256 possible values can be represented in a single u8 byte.

What was 64 subspaces × 8 dimensions × 4 bytes per f32 becomes 64 × 1 byte per u8, a compression ratio of 32. Like most other compression algorithms, PQ works well because real-world data have exploitable correlations and structures. Now, we can fit all vectors in memory.

Here's a quick interactive visualization:

After quantizing, we can represet each point's position only with the index of the line for each axis (in this case, only log(20) = 5 bits), rather than a more precise but expensive float (32 bits). Distance relationships are largely retained, but we save a lot of bits. Of course, as mentioned, in practice:

You cluster more than one dimension at once.
"Lines" are drawn along these clusters, not uniformly across the space.

But you get the idea.

`float16`, binary quantization, Matroyshka

For ANN, using float16 or bfloat16 over float32 often has negligible accuracy impact: the important thing is to preserve distance relationships between points, and both have enough precision for most well-designed embeddings, esp. in a high dimensional space. The upside is that they double the amount of vectors that can fit in memory, as well as halving storage costs. CoreNN supports these (as well as float64) natively, which is an improvement over hnswlib as of this writing.

All computations, regardless of dtype, use optimized AVX-512 where possible. (SVE is not yet supported in Rust.)

Surprisingly, text embeddings tend to perform reasonably well when precision quantized, all the way down to binary quantization, where only 1 bit is used to represent each dimension (usually the sign i.e. positive or negative). A theory is that effective training naturally leads to effective "spreading" of concepts across the space, using the full dynamic range effectively, so the sign alone contains enough signal to separate differing concepts, esp. when combined with a lot of dimensions.

CoreNN supports binary quantization as another compression mode out of the box. It also supports truncation compression if you have Matryoshka embeddings.

Async vs sync

My initial approach was to use async Rust, if only because this is a database that does I/O. But it turned out that async was not the best fit:

RocksDB itself is not async, as it's a C++ library. This means a lot of spawn_blocking at C++ boundaries, but the RocksDB C++ code itself does a lot of computation and not just I/O. A backing store written from the ground up using async Rust would likely fare better here.
There's a lot of computation, not just I/O, which doesn't mix easily with an async program. A lot of careful handling of compute so they don't block the async runtime, while also not oversaturating the system, is necessary.
Async lets you do millions of lightweight tasks concurrently, which makes sense with networking and event-driven programs, but less so with disk I/O which tends to scale more reasonably closely with threads.

But I think the biggest annoyance was that async doesn't work great with tooling like profiling and debugging, which are really nice things to have that work when you have a "traditional" program. Using perf and flamegraphs make light work of finding and squashing hotspots quickly, and is way better than 1) guessing or 2) manually adding timing statements everywhere. People suggest alternatives like tracing and tokio-console, but there's already an entire ready-to-go ecosystem with well-built out infrastructure, polished tools, great reading material, and standardized concepts that work well with all of these.

Usages and getting started

One of the nice things is that you can use CoreNN in many places for many different situations, given it uses a consistent data format, is highly scalable, and can function as an embedded library or a production-scale service:

Prototypes or research projects can drop in the dependency for the embedded library and get something working without setting up any services. Once you need to scale to production, you can simply switch out the library for the service.
Analysts can use CoreNN to process and query billions of social media comments, semantic data, etc. and have it updated in real time, leveraging the power of AI and deep learning based semantic analysis.
Libraries and programs can drop this into their code like SQLite and instantly add powerful semantic search capabilities.
The full disk mode means you can quickly run queries without loading the index from disk first, for one-off queries over giant datasets, or systems without a need for high QPS (e.g. only one or a few users). Examples: desktop OSes, Jupyter Notebooks, personal apps, static data dumps.

As mentioned, CoreNN is open source: you can get started by checking out the GitHub repo, which contains quickstarts for supported languages and more documentation. Please come and contribute to the open source project! If you're working on something in this space around this vision, and/or using CoreNN for a project you'd like to share, please drop by and let me know, I'd love to hear more.

Thanks for reading!