Yang Cen
Back to latest

Swap, Cache, and mmap

Notes on swap, page cache, mmap, why databases moved away from mmap as a buffer-pool replacement, and why read-only mmap can still be acceptable for immutable Milvus data.

Swap, Cache, and mmap

As is well known, Milvus needs to load all data into memory before executing queries. This creates high memory requirements for query nodes. In offline scenarios, users may not be sensitive to query performance and can accept degradation. The conventional solution is to build a buffer pool and load data into memory on demand during queries.

However, that would require a large refactor and is not suitable for the current stage of Milvus, so I will not expand on it here. A simpler and more direct option is mmap. While working on this change recently, I ran into several issues and learned a lot, so I am recording the notes here.

Cache Mechanisms

Swap Space

An operating system has swap space. When memory is insufficient, it swaps some pages out to disk, into the swap space. The size of swap space is configurable and is usually chosen based on machine RAM. There are many recommended values online.

If memory is not enough, why not simply increase swap space and let the OS decide which pages to evict? The application cannot control which pages are swapped out. If pages on the critical path are swapped out, performance can suffer badly.

Page Cache

When the OS accesses a file, it often reads neighboring pages into the page cache as well, usually the current page, the previous 15 pages, and the next 16 pages. The page cache has a swap-in and swap-out mechanism similar to swap. The difference is that the OS knows which file a page comes from, so it can evict the page back to that file. If it is dirty, a disk write is needed. Otherwise, the page can simply be removed from physical memory.

In short, swap space is a special kind of page cache for anonymous pages, meaning pages not associated with files. Anonymous pages are swapped out to swap space, while file-backed pages are evicted to their associated files.

mmap

mmap can be divided into two types: file-backed maps and anonymous maps. A file-backed map usually looks like this:

void* map = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, offset)

An anonymous map usually looks like this:

void* map = mmap(NULL, size, PROT_READ, MAP_SHARED | MAP_ANONYMOUS, -1, 0)

Memory allocated by an anonymous map consists of the anonymous pages described above. When memory is insufficient, those pages are swapped out to swap space.

However, this still differs from malloc. malloc allocates memory on the heap, while an anonymous map only establishes a mapping and does not allocate physical memory immediately. In practice, malloc may call mmap to create anonymous maps for large allocations.

mmap in Database Systems

File IO

Consider reading a file. The simplest approach is open followed by read. In this process, there is an extra copy from data read from disk into a user-space buffer:

file -> page cache -> user buffer

The write path is similar:

user buffer -> kernel buffer -> file

With file-backed mmap, access goes directly to the page cache. Eviction on writes also goes directly from page cache to file, so one copy can be avoided. Based on this, some databases use mmap as an optimization.

Buffer Pool

Databases usually manage pages in one of two ways. One option is to implement their own buffer pool. The other is to mmap data files directly, effectively reusing the operating system's cache mechanism.

Why We No Longer Need mmap

Using mmap as a replacement for a buffer pool is now generally considered a bad idea. See this paper. According to the same paper, using mmap to accelerate file IO is also a poor fit today.

For file IO, modern operating systems provide many asynchronous IO mechanisms, such as io_uring. mmap blocks synchronously on page faults, so it can perform much worse than async IO. When reading files, the application can also provide hints to bypass the page cache and get improvements similar to mmap.

For buffer pools, mmap's page cache is fully controlled by the operating system. Although madvise can adjust OS behavior for the mapping, it is easy to use incorrectly. mmap can cause a database to flush pages at the wrong time, which may break consistency in transaction processing.

Why mmap Still Works for Milvus

The ideal solution is still to implement a buffer pool. A buffer pool designed around the system's own access patterns can achieve better performance. But mmap is not too problematic for Milvus in its current situation, because Milvus data is immutable. We only need read-only mappings.