Building blobd: single-machine object store with sub-millisecond reads and 15 GB/s uploads

For a past content platform, I used S3 for serving user content videos and documents, where there were lots of small range requests for streaming and seeking. Despite serving from same-region datacenters 2 ms from the user, S3 would take 30-200 ms to respond to each request. Even slight delays when jumping around quickly felt grating, and in UX every millisecond counts.

S3 also felt suboptimal for small objects, like thumbnails. They have the same high TTFB as large objects, and the overhead of requests begins to dominate in terms of throughput, pricing, and rate limits. For example, a dynamic webpage may have hundreds of thumbnails that need to be shown quickly. That might mean 100 billed GetObject calls, saturating S3 rate limits and internal connections 100x faster for a single page, and still feel unresponsive since users will decide and scroll past most in a few milliseconds. At small sizes, the time handling requests dominates the actual transfer time.

To improve on this, I set out to build a new object store from scratch optimized for a lot of low-latency random reads and small objects. It would be interesting to experiment with newer ideas:

In terms of design trade offs, the object store would prioritize reads over writes. For writes, I prioritized creates over updates, and updates over deletes. Read performance should focus on latency, ensuring low constant latency regardless of object size or read offset/length. These fit the typical user content platform, where reads are more frequent and performance-sensitive than writes, and content typically grows over time — updates and deletes are rarer compared to creation.

From the physical limits perspective, modern NVMe SSDs can do hundreds of thousands of random I/O reads per second, and local DCs are just 1-5 ms from the user. How close can the object store get to these raw numbers?

In this blog post, I'll go over the design decisions and process of blobd with explanations for concepts, an open source object store built from scratch around these principles. I'll also go over the benchmarks, which show strong performance: on a system with eight 1.8 GB/s write NVMe SSDs, it saturates underlying disks when uploading, exceeding 15 GB/s:

Then, when doing random 4K reads across these objects, it achieves sub-millisecond time-to-first-byte, regardless of object size:

Store12.0 KB object size77.0 KB338 KB1.4 MB3.8 MB9.7 MB30.5 MB
blobd0.330 ms0.3390.3800.3600.4130.5530.767
minio66.404 ms69.12685.882105.755104.941195.205100.248

Turbostore

For the month of March 2022, I experimented with implementing a new object store called Turbostore written in C.

I only required point lookups and did not need to list objects. Listing typically requires tree-like data structures that have logarithmic time lookups, and some complexity from rebalancing. Without needing this, would it be possible to design an alternate simple object store and achieve constant low latency?

I decided to start with a basic design:

The disk layout would be simple: reserve space at the start for the journal, allocator state, and index, and use the remaining space for the heap. Each object would have metadata ("inode") and data: the data would be the actual value, while its inode would contain its key, size, version, heap address, etc. Since such metadata would itself be variable-length, it will use the same heap and allocation system to dynamically allocate space on disk for them to reside at.

Index

The index maps keys to where their allocated object inode resides on disk. All operations involving objects interact with the index, so its design largely determines performance characteristics of the system. Specifically for reads, after looking up an object's location in the index, reading it from disk typically becomes I/O-bound, so focusing on the index is key to increasing reads per second and TTFB, our goals.

Most databases use a tree as their index, as opposed to a hash map. The primary benefit is the ability to iterate by keys in order. However, I typically access objects via some key derived from an ID that exists elsewhere. For example, if a user requests a video, its ID is in the DB and its object key is statically mapped to videos/$ID/1080p.mp4. To list, I SELECT from the DB, not ListObjectsV2 from S3. So the set of objects that exist is known and enumerating is not needed.

A second reason systems often prefer trees is their ability to grow without needing a full rewrite. Expanding a hash map requires rehashing everything, expensive for I/O. Hash maps also have O(n) worst case performance, whereas trees guarantee O(log n).

However, I tried to think how a hash-based index could work well.

A fixed hash map with chaining is very easy to use: each bucket is a pointer, and inserting and removing elements involves simple pointer updates; the performance is guaranteed by the bucket size and hash function. With a tree, each node has a more complex structure, and the logic for rebalancing can get complicated, but it is necessary to ensure logarithmic performance.

The argument that resizing hash indices is expensive makes sense in theory. However, what are the numbers here? The size of hash indices is proportional to how many elements (in this case, objects) you want it to be able to store. The amount of objects you can have is inversely proportional to object size. For example, given a large 64 TB disk:

ContentTypical sizeCapacity
Image128 KB537 million
PDF2 MB34 million
Audio10 MB7 million
Video200 MB336 thousand

Given that most object stores store large blobs, as opposed to small records seen in KV stores and relational databases, it's possible that the absolute numbers aren't so high such that you could simply preallocate a very large fixed hash index, and avoid the issues around O(n) degradation and expensive resizes. Continuing the example, if we assume a mixed workload (e.g. user content), an upper bound might be 30 million objects. If each bucket pointer occupies 6 bytes, and we overprovision by 4x and use 120 million buckets, that's 687 MB, or 0.001% of the disk space. This makes sense — if objects are 2 MB, then 6×4 bytes per object is insignificant. Even at 500×3 million buckets, that's 8.4 GB, or 0.01% of disk space.

I did a quick empirical test by hashing all 4 million paths from find / into 12 million buckets (3x) using xxhash3_64 and found that 4% required more than 1 hop, 0.5% more than 2, and 72% of buckets were empty, close to the optimal 67%. The total size of the buckets was only 69 MB.

The O(1) lookups is likely more beneficial here than for in-memory data structures, because disk latency is much higher than memory, and each node traversal in a tree requires another roundtrip to disk. log(30e6) is about 17; assuming EBS latency of 1 ms, that could mean 17 ms and 17 read operations just to find the object on disk. This saturates IOPS quicker, and 17 ms can be significant if you're fetching multiple objects, doing things like media streaming or edge cache serving, or working as part of a larger system. For the hash index, it's so small that it can fit entirely in memory, and jumps directly to the location on disk almost all the time (tunable via hash index size).

So in Turbostore, a hash index with a configurable amount of buckets is used. On disk, it is stored as an array of bucket pointers. In memory, one read-write lock exists per bucket, which covers both the bucket state (linked list) as well as any objects (inode + data) in it. How it interacts with each operation:

While locking a bucket could theoretically lock more than one object, we expect one utilized bucket to equal one object almost all the time.

Allocation algorithm

With the index in place, we need a subsystem to allocate space for objects on the device to point index entries (object keys) to.

The simplest way to allocate is to give the exact bytes requested. The problem is external fragmentation: future allocations rarely match existing "holes". Over time, this degrades to a lot of free space in total but no large enough contiguous region, so you have to compact (lots of I/O) or fail the request. And even with efficient tracking data structures like trees, many distinct values can cause excessive overhead.

This is why allocator algorithms typically use a small set of fixed sizes, reducing external fragmentation while accepting some internal fragmentation (overallocating). I initially used a very simple allocation algorithm: split the heap into "tiles", hard coded to be 16 MB. To support small allocations, free tiles can convert to "microtiles" that are bump allocated. For example, an allocation of 65 MB would allocate 4 tiles + 1 MB from a microtile.

At the time, I was working with objects on a scale of 200 MB to 4 GB (and hence why I picked 16 MB), so this fixed the primary external fragmentation issue — the larger the file, the less likely that amount of contiguous space is available. However, this has similar "exact allocation" issues for small allocations (microtiles), meaning that many such allocations over time would likely result in more compactions than ideal. As I was dealing with mostly large objects, which meant fewer metadata and that their data (and not metadata) would dominate I/O and space, I accepted the trade-offs at the time.

This approach required background processes to coalesce microtiles to make freed space usable, which was complex and led to spikes in I/O. However, this was not a major problem, as the object store was used to store user content, where most users upload files and rarely delete them, so most objects were immutable, and deletions often happened in bulk which allowed for coalescing I/O work at the same time.

Allocation state tracking

I thought about whether it was possible to allocate and deallocate in bounded time. Given 16 MB tiles, and a reasonable max disk size of 256 TB, there would be 16,777,216 tiles. That's too many to have any sort of brute force linear algorithm in a hot path, but it is possible to leverage a hierarchy to efficiently divide-and-conquer a large but bounded search space in guaranteed fixed time.

In this case, there are only two states for a tile: free or used. Therefore, I used a hierarchical bitmap, where each bit represents whether there is any free tile in the level below for the range it represents. At the lowest level, each bit represents one tile; the next level up has one bit for its 64 tiles, the next level has one for its 64×64 tiles, and so on. Here's a conceptual demo, with two bits per element for simplicity:

On top of having fixed latency, CPUs have very efficient bitwise manipulation instructions. We need an instruction to find any set bit in an integer, and one does exist: tzcnt_u64 — get position of lowest set bit. Just four of these instructions can be used to find a free tile across 16 million tiles. Similarly, marking as used or new is simply four bit flips:

uint32_t fast_allocate_one_tile(freelist_t* fl) {
  uint64_t i1 = _tzcnt_u64(fl->tile_bitmap_1);
  uint64_t i2 = _tzcnt_u64(fl->tile_bitmap_2[i1]);
  uint64_t i3 = _tzcnt_u64(fl->tile_bitmap_3[i1][i2]);
  uint64_t i4 = _tzcnt_u64(fl->tile_bitmap_4[i1][i2][i3]);

  propagate_tile_bitmap_change(fl, fl->tile_bitmap_4[i1][i2][i3] & ~(1llu << i4), i1, i2, i3);
  uint32_t tile = (((((i1 * 64) + i2) * 64) + i3) * 64) + i4;

  return tile;
}

static inline void propagate_tile_bitmap_change(freelist_t* fl, uint64_t new_bitmap, uint64_t i1, uint64_t i2, uint64_t i3) {
  if (!(fl->tile_bitmap_4[i1][i2][i3] = new_bitmap)) {
    if (!(fl->tile_bitmap_3[i1][i2] &= ~(1llu << i3))) {
      if (!(fl->tile_bitmap_2[i1] &= ~(1llu << i2))) {
        fl->tile_bitmap_1 &= ~(1llu << i1);
      }
    }
  }
}

Allocating more than one tile would require enumerating each set bit in an element, which gets inefficient with large allocation requests. I searched for another CPU intrinsic to do this quickly, and found mm512_mask_compressstoreu_epi8, which uses a bitmap to pick up to 64 elements from a source vector and packs them contiguously in a destination. A visualization, reduced to 16 bits for compactness, to illustrate:

An out-of-range value is used as the "padding" to indicate the end of the elements. This can transform a bitmap element in the hierarchy into an array of indices in one instruction, which can then be iterated over:

array_u8_64_t vec_find_indices_of_nonzero_bits_64(uint64_t bits) {
  uint8_t indices_raw[64] = {0, 1, 2, ..., 62, 63};
  __m512i indices = _mm512_loadu_epi8(indices_raw);

  array_u8_64_t result;
  memset(result.elems, 64, 64);
  _mm512_mask_compressstoreu_epi8(result.elems, bits, indices);
  return result;
}

void freelist_consume_tiles(freelist_t* fl, uint64_t tiles_needed, cursor_t* out) {
  array_u8_64_t i1_candidates = vec_find_indices_of_nonzero_bits_64(fl->tile_bitmap_1);
  for (uint64_t i = 0, i1; i < 64 && (i1 = i1_candidates.elems[i]) != 64; i++) {
    // ...
  }
}

For a bitmap at the last level, as I didn't want all free tiles but only up to n, mm512_mask_blend_epi8 was used to mask off elements after n with the out-of-range value:

static inline vec_512i_u8_t find_free_tiles_in_region(uint64_t region_tile_bitmap, uint8_t tile_count_wanted) {
  uint8_t indices_raw[64] = {0, 1, 2, ..., 62, 63};
  __m512i indices = _mm512_loadu_epi8(indices_raw);
  __m512i fill = _mm512_set1_epi8(64);

  // Find tiles that are available.
  vec_512i_u8_t available;
  available.vec = _mm512_mask_compress_epi8(fill, region_tile_bitmap, indices);
  // Limit tile count to tile_count_wanted.
  available.vec = _mm512_mask_blend_epi8((1llu << tile_count_wanted) - 1, fill, available.vec);
  return available;
}

I took a similar approach for finding a microtile with enough space for any given allocation request. Instead of representing any(is_free for tile in next level), it's max(free_space for microtile in next level), since that's the objective we want to divide and navigate through the search space:

struct freelist_s {
  // For each element, bits [31:8] represent the maximum free amount 
  // of all elements it represents in the next layer, [7:7] if the 
  // tile is not a microtile, and [6:0] the self-index (although 
  // only 4 bits are used as there are only 16 elements per vector).
  vec_512i_u32_t microtile_free_map_1;
  vec_512i_u32_t microtile_free_map_2[16];
  vec_512i_u32_t microtile_free_map_3[16][16];
  vec_512i_u32_t microtile_free_map_4[16][16][16];
  vec_512i_u32_t microtile_free_map_5[16][16][16][16];
  vec_512i_u32_t microtile_free_map_6[16][16][16][16][16];
}

We're tracking more information so it's less dense: 32 bits per "region" rather than 1 bit. We use wide 512-bit (32x16) values and AVX-512, which is not as dense as the previous 64-bit (1x64) but still achieves 16x vectorization. 24 bits are used to represent the actual free space value, and 4 bits are used to represent the self-index as AVX-512 currently has no instruction for getting argmax, only max.

A query starts off by broadcasting the value to compare to (i.e. requested allocation size), to vectorize the comparison against all 16 elements at once:

uint64_t freelist_consume_microtile_space(freelist_t* fl, uint32_t bytes_needed) {
  __m512i y512 = _mm512_set1_epi32(bytes_needed << 8);

To find the element with enough capacity, mm512_cmpge_epu32_mask is used, which returns a mask, with a bit set for each element that is greater than or equal to the value. Then tzcnt_32 gets any element:

  uint32_t m1 = _mm512_cmpge_epu32_mask(fl->microtile_free_map_1.vec, y512);
  uint32_t p1 = _tzcnt_u32(m1);

Now we know the element at p1 in fl->microtile_free_map_1 contains enough space for our request. This is then repeated, traversing down the hierarchy:

  uint32_t i1, i2, i3, i4, i5, i6;
  i1 = fl->microtile_free_map_1.elems[p1] & 127;
  uint16_t m2 = _mm512_cmpge_epu32_mask(fl->microtile_free_map_2[i1].vec, y512);
  uint32_t p2 = _tzcnt_u32(m2);

  i2 = fl->microtile_free_map_2[i1].elems[p2] & 127;
  uint16_t m3 = _mm512_cmpge_epu32_mask(fl->microtile_free_map_3[i1][i2].vec, y512);
  uint32_t p3 = _tzcnt_u32(m3);

  i3 = fl->microtile_free_map_3[i1][i2].elems[p3] & 127;
  uint16_t m4 = _mm512_cmpge_epu32_mask(fl->microtile_free_map_4[i1][i2][i3].vec, y512);
  uint32_t p4 = _tzcnt_u32(m4);

  // ... and so on

Once the leaf element is reached, the offset can be calculated and the updated value can be propagated back up using mm512_reduce_max_epu32, which calculates the maximum value:

  uint32_t meta = fl->microtile_free_map_6[i1][i2][i3][i4][i5].elems[p6];
  cur_free = meta >> 8;
  i6 = meta & 127;
  microtile_addr = (((((((((i1 * 16) + i2) * 16) + i3) * 16) + i4) * 16) + i5) * 16) + i6;

  uint32_t new_free = cur_free - bytes_needed;
  fl->microtile_free_map_6[i1][i2][i3][i4][i5].elems[i6] = (new_free << 8) | i6;
  fl->microtile_free_map_5[i1][i2][i3][i4].elems[i5] = (_mm512_reduce_max_epu32(fl->microtile_free_map_6[i1][i2][i3][i4][i5].vec) & ~15) | i5;
  fl->microtile_free_map_4[i1][i2][i3].elems[i4] = (_mm512_reduce_max_epu32(fl->microtile_free_map_5[i1][i2][i3][i4].vec) & ~15) | i4;
  fl->microtile_free_map_3[i1][i2].elems[i3] = (_mm512_reduce_max_epu32(fl->microtile_free_map_4[i1][i2][i3].vec) & ~15) | i3;
  fl->microtile_free_map_2[i1].elems[i2] = (_mm512_reduce_max_epu32(fl->microtile_free_map_3[i1][i2].vec) & ~15) | i2;
  fl->microtile_free_map_1.elems[i1] = (_mm512_reduce_max_epu32(fl->microtile_free_map_2[i1].vec) & ~15) | i1;

  return ((uint64_t) microtile_addr) * TILE_SIZE + (TILE_SIZE - cur_free);
}

This approach uses a fixed six comparison + bit count operations in order to find a microtile with at least n bytes across up to 16 million microtiles. It may seem like storing the next-level maximum's index, rather than self-index, would be more efficient, as it would mean directly following indices down the tree at query time, rather than doing comparisons. But as mentioned, there is no argmax, so that would mean doing a max + equals comparison + find set bit for each level when bubbling up, rather than a single max instruction, and it approximately evens out.

This is essentially a max-heap-like tree with very wide branching to leverage SIMD. Having a fixed size means branchless execution, fixed latency, and no rebalancing. It also allows for updating any arbitrary microtile free space, since there's a computable fixed element index. This approach does require AVX-512F, which can cause CPU throttling and have higher latency than expected, so it may not be so clear cut.

Journal

To atomically write critical metadata to disk and handle interruptions that could cause corruption, a journal is used. Any metadata writes to disk are recorded as events in the journal first. Each event entry contains enough data to reapply them to get to a consistent state that was guaranteed to clients, in case of a crash or power loss. Client requests are responded with success only when they have been written to the journal on disk (and synced), since that guarantees persistence, even if not yet applied.

A subtlety arises when ordering events. Take, for example, a DB with three subsystems, each with their own internal state: index (A), allocator (B), and log (C). From the user's perspective, each request is atomic, advancing the state of the entire object store in lockstep as one, even if internally there are three separate subsystems. For example, it's not correct to insert an entry into the index but point it to space that's not actually allocated for it. A simple correct approach is to lock the entire system and perform each request one at a time:

void handle_request() {
  lock();
  a();
  b();
  c();
  commit();
  unlock();
}

This has the downside of serializing requests; it'd be nice to do some pipelining here with more granular locking:

void handle_request() {
  lock_a(); a(); unlock_a();
  lock_b(); b(); unlock_b();
  lock_c(); c(); unlock_c();
  lock_journal(); commit(); unlock_journal();
}

The issue is that there's no guarantee of execution ordering between locked subsystem calls. For example, it's possible for thread 1 handling request 1 to mutate A, then yield to thread 2, which handles request 2 but reaches and mutates both A and B before yielding back — perhaps it was faster at A and the OS decided to give it more CPU time. Now request 1 will mutate B after B was affected by request 2. This becomes a problem if request 1 is in the journal but the system crashes before request 2 makes it. Now, reapplying request 1 causes an incorrect system state, due to the partial changes in B for a non-existent request 2. If B was the allocator, this might mean lost space, double free, or worse. This applies all the way down: even if all subsystems executed in the same order across all requests, it would cause issues if request 3 came before request 2 in the journal.

One fix is to acquire the next lock before releasing the current one. This should guarantee sequential processing of requests across systems while still pipelining, and never deadlock if the execution path is always the same:

void handle_request() {
    lock_a(); a(); lock_b();
  unlock_a(); b(); lock_c();
  unlock_b(); c(); lock_journal();
  unlock_c(); commit(); unlock_journal();
}

Another approach would be to issue sequence numbers for requests. Each subsystem, including the journal, only processes in order, deferring out-of-order requests in a queue. You could also try batch processing requests serially.

The journal is checksummed using xxHash as it itself could get interrupted while being written. Similarly, applying the journal is idempotent: we can reapply as many times as we want safely, so as to handle crashing during journal recovery. If the journal hash mismatches, we can safely assume it was not built or erased completely, which by our semantics is OK.

Background flush

These metadata writes are quite small, typically less than 100 bytes. Writing each individually to disk immediately would be wasteful:

Therefore, these journal events first go into an in-memory buffer. After it becomes full, or around every 100 ms, a background thread applies the in-memory data to disk. By waiting a few milliseconds, the hope is that more events get collected and coalesced, reducing write amplification and exploiting larger sequential writes. It also reduces the amount of fsync calls. A condvar with a timed wait is used to wake up the thread based on timer or capacity.

Server

For efficiency, epoll was used, to avoid blocking threads when reading from and writing to network sockets during request processing. A simple binary protocol was adopted: a 255-byte leading metadata packet, containing the request type and arguments, followed by a request or response payload. Multiplexing is not supported but connections are reused across requests.

A standard C server approach is to accept clients, then make blocking read and write calls to process requests; threads are used to scale this approach, usually some multiple beyond CPU count since many threads may actually be blocked-idle at any time. With epoll, the approach is slightly different:

  1. An epoll instance is created.
  2. accept4 is provided with the SOCK_NONBLOCK flag and sockets are registered with the epoll instance.
  3. A server loop will wait for events, like "ready to read/write", on registered epoll sockets, then pass them off to request handlers.
  4. A non-blocking socket means read and write will return if not possible to fulfill immediately.
  5. At this point, we can "rearm" the socket, re-registering with epoll until we have sent or received more data and can continue processing the request.

This is efficient because our threads are either busy doing work or efficiently sleeping while I/O is done in the background, never blocked. The downside is that blocking code is simpler to write because it looks like normal sequentially-executed code. With epoll, one has to track function (the request handler) state and handle "re-entering" at a higher level, since at any point you do a read/write, you could return from the function, and then later be called again from that logical position and state.

Async in other languages

This is why syntactic sugar like async is used in other languages: write regular-looking functions and let the language handle state across yielding and resuming behind the scenes. You'll see something similar if you downlevel JS to a browser pre-async, where it goes from:

async function f() {
  let a = 1;
  await x(a);
  a += 1;
  await y(a);
  a += 2;
  return a;
}

to something like:

// To be used in combination with and driven by some generator runtime.
function f(state) {
  switch (state._entrypoint) {
  case 0:
    state.a = 1;
    return x(state.a);
  case 1:
    state.a += 1;
    return y(state.a);
  case 2:
    state.a += 2;
    return state.a;
  }
}

Asynchronous Programming in Rust has a good primer and detailed explanation on async programming here.

blobd

About a year later, when I was running into scaling issues with the object storage for my search engine, I decided to revisit Turbostore. There were some pain points that I wanted to resolve:

Turbostore already achieves the goal of constant low latency random reads, and I was satisified with its feature set. However, I wanted to experiment with further speed gains, seeking better mechanical sympathy with the underlying NVMe SSDs.

I renamed it to blobd: single binary, simple design, does one thing and does it well. New design changes went to blobd while the existing Turbostore was kept as blobd-lite, useful for systems with older Linux kernels or low memory.

Direct I/O

In Turbostore, the underlying storage was treated as a very large addressable region of bytes. But storage devices actually operate on aligned blocks, typically 512 bytes, not individual bytes. The kernel abstracts this away via in-memory buffers: when you write 6 bytes to offset 510, the kernel actually fetches blocks 0 and 1 (since it touches both), applies your changes to those memory buffers, then writes them back. A single 6-byte write, which seemed efficient, turned into a 1024-byte read-then-write.

While hiding this complexity works great for most applications, it's not so good for a low level storage system:

Examples of inefficiencies due to leaky abstraction
  • Two subsystems are placed at locations [0, 1000] and [1001, 2000]. In theory, they have independent authority over their regions. In reality, their boundary lies on the same disk block, so they get entangled: a short write issued by subsystem A actually has to wait for a longer write by subsystem B to the same block. What seems like a one-way (and therefore immediate) write by subsystem B actually requires first reading the underlying block, including data from A, before the write can be issued. A write by A can somehow damage B because the write to the underlying shared block was interrupted.
  • Write 1 writes across two blocks. The second block is already in memory, but the first block still needs to be read. Or maybe the first block is still being changed by an earlier write 0. Should block 2 be sent now for efficiency, causing only part of write 1 to be written and jump ahead of write 0, or should it wait for block 1 and send both together? What if write 5 touches block 4, 5, 6, but block 4 is not yet in memory, write 4 (earlier) touches block 5, and write 6 (later) touches block 5 and 6? There's a lot of complexity here to implement both correctly and optimally, and it also interferes with modeling I/O at the program layer. Synchronization points like fsync, necessary for correctness, become entangled and inefficient as per the previous point.
  • While you need similar synchronization points even with blocks (for example, write to 1000, write to 100 to update metadata like allocator state that references 1000, write 1000 again — sync/barrier after metadata to prevent later 1000 overwrite from making no sense), it's a bit easier and faster because you can just issue a write and await it, whereas with Linux buffer abstraction, you must then wait for a global fsync, which we've seen requires a lot more global pause-the-world timing. Each subsystem is truly has full independent control over its own blocks, can sync anytime without waiting/joining others (implicitly or explicitly), can know exactly order of writes (e.g. write 1 to block will always come before write 2 to same block), etc.
  • A subsystem might choose to use very small on-disk structures and values, or use in-memory state that allows for granular reading and writing of disk data, not realizing that every non-direct I/O reads and writes 4 KB of data, negating any optimization, and not leveraging this "bulk read/write" for free.

In blobd, direct I/O is used to enforce operating on disk realities. This is a Linux flag that makes every read/write correspond to I/Os issued directly to the physical storage device, bypassing the kernel abstraction layer. Because disks operate on blocks, this means all such reads and writes must be block aligned and multiples.

Direct I/O means no more fsync: no more complexity via background flushes and optimal scheduling of syncs. There's no kernel overhead from copying and coalescing. It essentially provides the performance, control, and simplicity of issuing raw 1:1 I/O requests.

Atomic writes

Another interesting hardware feature is atomic writes. When set on a write request, the physical device ensures that it either succeeds or leaves existing data untouched — no torn writes. Currently, our journal provides the equivalent via transactions for disk mutations, to prevent corruption in the case of interruptions. But it writes everything twice, inefficiently. It's another indirection layer to model and manage. And as mentioned previously, it has complexities that can cause subtle correctness issues.

My understanding of the history of atomic writes is that it was mostly an implicit guarantee by hardware manufactures for some devices. Then NVMe added it as an explicit part of their spec, so devices can expose what sized atomic writes they support. However, only recently did the Linux kernel add support for this via standard APIs: a flag to signal that a write must be atomic, and fail if it can't be (e.g. hardware doesn't support it). It seems that before this, one just assumed all writes below some size were atomic, as suggested by AWS. There's a great Stack Overflow answer for further deep diving.

As I proceeded with the redesign, I tried to incorporate atomic writes as well as direct I/O.

Tuple blocks

The original design revolved around disk-resident data structures to represent state. For example, the hash index was stored as a long array of bucket pointers, such that the location on disk of bucket i's pointer was at index_start + i * 6. For the allocator state, the amount of free space for a tile i was located at the 24-bit integer at state_start + i * 3. One reason was the simplicity of correctly reading and writing such data structures compared to others. Another benefit was allowing these data structures to be used without needing to load into memory first, since they are already structured on disk.

But there are downsides. As mentioned, reading or writing 6 bytes actually reads/writes 512 bytes, an 85x amplification; hashes are intentionally random so write coalescing is an anti-goal. In reality, all the data is initially read into memory anyway, as one issue is that state, represented on disk, needs to be visibly modified after processing each operation, but not actually modified on disk until flushed via journal — so it's easier to mirror state in memory. Then keeping memory and disk in sync is a simpler one-way (write from memory to disk) and there's no need to couple them.

In blobd, state on disk is instead represented by blocks of serialized tuples, where a block is the physical device I/O unit. When a tuple is created, it gets placed in any block with enough space. To coalesce writes, existing dirty blocks are prioritized, only allocating if none are suitable. When the server starts, all tuple blocks are read and processed from disk.

In this case, tuples represent index entries of (object_id, object_address). However, this approach generalizes to any state that can be represented as a flat bag or set of entries; I believe heap files are conceptually similar. The block is the set size as it's the smallest writable unit, so it minimizes write amplification.

It also ties into atomic writes. Previously object creations and deletions would require multiple writes at disparate locations: bucket pointer, allocator state, and possibly linked list node pointers. This meant it was not possible to leverage atomic writes, and so journaling was necessary. Now, it's just one atomic block write, so the journal can be dropped.

In-memory index

With no more disk-resident index, the initial approach was to simply build an index in-memory at startup. There are two main trade offs. One is slower startup times, which I think is fine for a production system that has far more uptime than restarts. The other is memory usage. What are the actual numbers here?

Using the previous example of 30 million objects, assuming every key was a very long 256 bytes, it would require 7 GB of memory to load the entire key set into memory. At a more reasonable 80 bytes on average, it would be 2 GB. If you hash every key using BLAKE3, it would only occupy 0.9 GB of RAM, which is fine for point lookups. So it seems reasonable to hold the entire index in memory.

I did a quick empirical test over a real world upper-scale dataset: indexing 282 million URLs (15.1 GB) from my search engine crawl on an AMD EPYC 7502P server:

Data structureInsert time (sec.)Inserts per secondMemory usage (GB)
ahash::HashMap102.32,755,04731.4
ahash::HashMap + BLAKE3171.81,641,37716.5
ahash::HashMap + MD5183.21,539,1168.5
blart::TreeMap + CString72.33,897,17343.8
fst::Set644.7437,2645.5
std::collections::BTreeMap121.32,323,75332.5

The standard data structures took up about 2x the raw bytes in memory. Interestingly, there isn't much difference between the BTreeMap and HashMap, meaning range functionality could be added without much cost. blart is an implementation of the adaptive radix tree, which promises tree functionality with both faster performance and lower memory usage, but only resulted in one. If only point lookups are needed and memory is constrained, hashing keys is a viable approach.

I believe there is much more room for optimization here, as the fst crate demonstrates. FSTs are very compact but static string-key maps. One approach is to build them repeatedly as the index grows, and use an overlay for intermediate keys between snapshots. Another hybrid technique could be LSMTs.

For my use case, where I have plenty of memory and want the fastest performance possible, I prefer this trade off. Writes are much simpler to correctly implement, while being faster. Reads are even faster: looking up a key is instant, and the data location on disk can be jumped to immediately. It's possible to switch out indices depending on use case (e.g. hash map for point lookups, tree map for range queries), and optimize them over time, without any rewriting or migration.

Buddy allocation

The buddy memory allocation algorithm was picked as the replacement. It's quite easy to understand and implement, and minimizes fragmentation, while remaining very fast to allocate and deallocate state with only bitwise operations and simple arrays.

It works by defining a set of allocation sizes, increasing by power-of-2. In blobd, the possible allocation sizes are 4 KB (212), 8 KB (213), 16 KB (214), all the way up to 16 MB (224). Then, the heap is divided into pieces and a bitmap represents whether each piece is free or not, with one bitmap for each size. Initially, all 16 MB pieces are free, while all pieces of all other sizes are not.

To allocate, find a free piece of the nearest allocation size. If none is available, allocate a piece from the next higher allocation size and "break" it into two, such that there's now two free pieces of the intended size. This can recurse all the way to the largest allocation size. To deallocate, mark that piece as free. If it's "buddy" piece, which is the previous even-address piece if it's odd and vice versa, is also free, then both coalesce back into the larger next-sized piece, recursing all the way to the largest allocation size.

It's greedy and simple, with minimal overhead and complexity, yet handles large disks, many objects, varying sizes, and reclaiming of space over long periods. The Wikipedia article has a good example with a visualization.

Unlike malloc, we don't need to store all of an object's data contiguously, so we can further reduce internal fragmentation by dividing the final <16 MB chunk into smaller-and-smaller fragments.

io_uring

I went with io_uring to get fast async I/O, which works by using two queues: one to send requests, one to get responses. Instead of frequently making expensive syscalls (e.g. read), you issue a lot of request messages (e.g. IORING_OP_READ) and then submit them to io_uring, which will execute them on your behalf in their kernel threads. This also provides asynchronicity, since your userland threads aren't blocked on these syscalls. Then you have some thread that repeatedly polls the other queue for responses. If you're busy enough, you can even have the kernel threads poll in a loop, rather than you submitting — reducing syscalls even further.

In io_uring, you submit requests in one queue, and get responses out in another. Syscalls are avoided by having the kernel, on the other end of the queues, do them for you in the background. Diagram courtesy of Red Hat.

io_uring has generalized beyond its initial I/O focus, and you can make many other syscalls with it, not just open/read/write/close.

This worked nicely with Rust, which has its own primitives: async, queues, and oneshots. These made it straightforward to tie into io_uring seamlessly. The file interface can remain very familar and easy to use:

impl UringFile {
  async fn read_at(&self, offset: u64, len: u64) -> Vec<u8> {
    // Notifier to wake us up when io_uring has done our thing.
    let (tx, rx) = oneshot::channel();
    // Send the request to our submission thread, which submits it to io_uring.
    self.userspace_requests_queue.send(Request::Read {
      req: ReadRequest { out_buf: vec![0u8; len], offset, len: u32!(len) },
      res: tx,
    });
    // Wait on our notifier, sleeping until then.
    rx.await
  }
}

// Using it looks like a normal async File:
let res = file.read_at(offset, len).await;

Under the hood, we spawn two userspace threads to submit requests and consume responses. Standard queues make it easy to do this in Rust:

// Spawn our userspace submission thread.
thread::spawn(|| {
  let mut uring_submission_queue = unsafe { ring.submission_shared() };
  // We use IDs to track requests on our end, match up responses from io_uring.
  let mut next_id = 0;
  while let Ok(msg) = userspace_requests_queue.recv() {
    let id = next_id; next_id += 1;
    // Build io_uring specific request entry.
    let entry = match &msg {
      Request::Read { req, .. } => {
        let ptr = req.out_buf.as_ptr() as *mut u8;
        // `Fixed(0)` means use the file descriptor we registered during setup.
        opcode::Read::new(Fixed(0), ptr, req.len).offset(req.offset).build().user_data(id)
      }
      Request::Write { req, .. } => {
        let ptr = req.data.as_ptr() as *mut u8;
        let len = req.data.len() as u32;
        opcode::Write::new(Fixed(0), ptr, len).offset(req.offset).build().user_data(id)
      }
    };
    // Store the request in userspace so we know how to handle when we get its response.
    // This also ensures buffers we passed to io_uring remain alive.
    pending.insert(id, msg);
    if uring_submission_queue.is_full() {
      uring_submission_queue.sync();
      ring.submit_and_wait(1).unwrap();
    }
    unsafe { uring_submission_queue.push(&entry).unwrap(); };
    uring_submission_queue.sync();
    ring.submit().unwrap();
  }
});

These code snippets has been simplified for brevity and skips many checks and optimizations; for example, you can wait for more requests before submitting.

On the responses end:

// Spawn our userspace completion thread.
thread::spawn(|| {
  let mut uring_completion_queue = unsafe { ring.completion_shared() };
  loop {
    // Wake when io_uring has new response ready.
    let Some(e) = uring_completion_queue.next() else {
      ring.submit_and_wait(1).unwrap();
      uring_completion_queue.sync();
      continue;
    };
    // Get corresponding userspace request.
    let id = e.user_data();
    let req = pending.remove(&id).unwrap().1;
    match req {
      // Signal caller that it's done.
      Request::Read { req, res } => res.send(req.out_buf),
      Request::Write { req, res } => res.send(req.data),
    }
  }
});

The use of oneshot is what allows the interface to remain a simple async method, where the Rust async caller automatically efficiently sleeps and wakes, rather than having callers deal with queuing and synchronization.

Performance

Let’s see how these design decisions perform in practice. I wrote a Rust benchmark to hit object stores concurrently in phases, one for each operation type, and then tested against:

StoreConfiguration
blobd
blobd-litemdadm RAID 0, 16777216 buckets
RocksDB with BlobDBmdadm RAID 0, xfs
MinIOxfs

RocksDB lacks the ability to read ranges and upload in parts; it was included as an interesting but not equivalent comparison point. MinIO used the Rust AWS SDK and S3-compatible REST API over 127.0.0.1, which adds around 20–100 µs round trip, essentially zero given the resulting numbers.

For each phase, at most 1024 operations are requested over all object keys. Object keys are range(object_count), serialized as big-endian 64-byte integers, shuffled in each phase to randomize request order. Object data is generated using random bytes of object_size. The kernel page cache is evicted before each read phase, and malloc_trim(0) is called between stores, to accurately measure memory usage and avoid any existing cache from skewing results. The phases are:

  1. Begin multipart upload (no-op for RocksDB)
  2. Upload parts (one part for RocksDB)
  3. Commit object (no-op for RocksDB)
  4. Inspect object
  5. Read small range of object from random offset
  6. Read entire object
  7. Delete object

To balance time and thoroughness, different object sizes and counts were tested. A consistent object count would result in excessive time for large objects or not enough for small objects. Numbers were unaligned to avoid hitting special code pathways.

Object countObject sizeTotal bytes
567,89012,3456.5 GB
400,89078,90129.5 GB
300,890345,67896.9 GB
123,4561,512,345173.9 GB
45,6784,012,345170.7 GB
17,89010,123,123168.7 GB
7,00431,987,654208.7 GB

Testing was done on a machine with:

The system was tuned with the following commands:

System tuning script
# File descriptor limits
sysctl -w fs.file-max=2097152
sysctl -w fs.nr_open=2097152
ulimit -n 1048576

# Virtual memory settings
sysctl -w vm.swappiness=10
sysctl -w vm.dirty_ratio=15
sysctl -w vm.dirty_background_ratio=5
sysctl -w vm.vfs_cache_pressure=50

# AIO limits
sysctl -w fs.aio-max-nr=1048576

# Network buffer sizes (256MB max)
sysctl -w net.core.rmem_max=268435456
sysctl -w net.core.wmem_max=268435456
sysctl -w net.core.rmem_default=16777216
sysctl -w net.core.wmem_default=16777216

# TCP buffer auto-tuning
sysctl -w net.ipv4.tcp_rmem="4096 87380 268435456"
sysctl -w net.ipv4.tcp_wmem="4096 65536 268435456"
sysctl -w net.ipv4.tcp_mem="786432 1048576 26777216"

# Connection handling
sysctl -w net.core.somaxconn=65535
sysctl -w net.core.netdev_max_backlog=250000
sysctl -w net.ipv4.tcp_max_syn_backlog=8192
sysctl -w net.ipv4.tcp_syncookies=1

# TCP optimization
sysctl -w net.ipv4.tcp_fin_timeout=15
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_max_tw_buckets=2000000
sysctl -w net.ipv4.tcp_slow_start_after_idle=0
sysctl -w net.ipv4.tcp_mtu_probing=1

# Local port range for more connections
sysctl -w net.ipv4.ip_local_port_range="10000 65535"

Starting with uploading objects, blobd and blobd-lite have much faster rates of creating and committing objects than MinIO. RocksDB isn't present as it doesn't have these methods.

The drop in commits per second for blobd at higher object sizes is likely due to not enough requests to fully utilize the performance — at the expected 300K op/s, 7K operations would take 0.02 seconds, where measurement variance and overhead may have an effect. Even at 50K op/s, at 32 MB object sizes you would need to be writing 1.5 TB/s before this became a bottleneck.

When uploading the parts, blobd is significantly faster than all other solutions, at times exceeding raw theoretical disk write speeds of 8×1.8 (14.4) GB/s:

A key focus was optimizing for random reads. In this test, 4000-byte reads are done from random offsets. (Note that RocksDB does not support range reads.) Both blobd and blobd-lite deliver that here, with blobd having much higher performance:

Part of this was a goal for consistent low latency random reads, regardless of object size. Both blobd and blobd-lite achieve this:

blobd has sub-millisecond latencies, 130x faster than blobd-lite and at times close to raw NVMe disk latency. Here are the exact numbers:

StoreMetric12.0 KB77.0 KB338 KB1.4 MB3.8 MB9.7 MB30.5 MB
directAvg0.3300.3390.3800.3600.4130.5530.767
liteAvg43.25642.99542.29141.14141.75441.97940.500
minioAvg66.40469.12685.882105.755104.941195.205100.248
rocksdbAvg4.1225.23513.16955.003163.355391.3461123.244
directP990.9960.9981.3141.3853.5003.9945.319
liteP9969.54768.48467.51165.66568.62773.30482.044
minioP99227.921374.208980.4921258.5991294.9021380.968153.009
rocksdbP9911.26014.01539.980193.706602.3241374.6503789.236

The improved and simplified design of blobd also means it uses less system resources:

The increase in memory usage as object sizes go up is likely due to to the benchmarking harness itself. 1024 32 MB requests in flight is 32 GB just for the combined payload data. We'd expect memory usage from database state (e.g. index) to go down as size increases as there are fewer objects.

blobd also has lower disk write amplification and overhead, writing fewer bytes and issuing fewer requests:

Stochastic stresser

Correctness and reliability is important for a system holding critical data and in critical performance paths. One way I sought to achieve this was through simplicity, getting as close to transparently and self-evidently correct design as possible.

However, I did also want to stress test blobd, and explored a harness to hit it with a lot of varying concurrent operations, to try and put it in an unexpected state and elicit bad behavior, called the stochastic stresser. It works by using a queue of valid tasks that could be performed. But instead of predictable FIFO polling, it dequeues random entries — legal tasks, but executed unpredictably.

For example, we can create an object with a random key and size at any time. The next legal action is to upload the parts, so those tasks get enqueued. Any of them could therefore happen immediately, at the end, or any time in between. Combined with many other random tasks, objects, and concurrent executors, there should be enough "chaotic mixing" to hit blobd in unseen and unpredictable ways. The hope is that this triggers rare code paths, unforeseen configurations, and race conditions. The stresser will assert that each task completes correctly and panic if otherwise.

This "stochastic" queue is backed by a vector of entries. When pushing, it is appended to the end. Popping picks a random index, then does an O(1) swap_remove; this is how the queue is made random. It's wrapped in a condvar, woken on pushes and sender/receiver changes, for efficient waiting on messages.

It's quite simple, but was able to catch subtle correctness bugs and race conditions. It's similar conceptually to fuzzing and I think is worth checking out for stateful systems generally. The next improvement would be to simulate crashes, hardware failures, and system errors.

What's next

There are some low hanging improvements that can be had around performance, such as caching metadata to avoid repeated disk reads when repeatedly hitting the same object in a short time period. Improving the developer experience is also a near-term goal — polishing documentation, creating more language bindings, and simplifying the configuration.

I am experimenting with a distributed blobd with replication and sharding, for cases where a single node is not enough.

It would be interesting to investigate using these primitives — atomic writes, io_uring, tuple blocks — for a KV store, optimized for small values. Could there be similar gains in performance and simplicity?

You can check out the open source code and usage instructions on the GitHub repo. Feel free to raise issues, ask questions and start discussions about blobd there! I'm especially interested in ideas or considerations to use blobd in projects, how the performance can benefit, and any limitations around using it.