Things that go wrong with disk IO
There are a few interesting scenarios to keep in mind when writing applications (not just databases!) that read and write files, particularly in transactional contexts where you actually care about the integrity of the data and when you are editing data in place (versus copy-on-write for example). We'll go into a few scenarios where the following can happen: And how real-world data systems think about these scenarios. (They don't always think of them at all!) If I don't say otherwise I'm talking about behavior on Linux. The post is largely a review of two papers: Parity Lost and Parity Regained and Characteristics, Impact, and Tolerance of Partial Disk Failures . These two papers also go into the frequency of some of the issues discussed here. These behaviors actually happen in real life! Thank you to Alex Miller and George Xanthakis for reviewing a draft of this post. Some of these terms are reused in different contexts, and sometimes they are reused because they effectively mean the same thing in a certain configuration. But I'll try to be explicit to avoid confusion. The smallest amount of data that can be read and written atomically by hardware. It used to be 512 bytes, but on modern disks it is often 4KiB. There doesn't seem to be any safe assumption you can make about sector size, despite file system defaults (see below). You must check your disks to know. Typically set to the sector size since only this block size is atomic. The default in ext4 is 4KiB . A disk block that is in memory. Any reads/writes less than the size of a block will read the entire block into kernel memory even if less than that amount is sent back to userland. The smallest amount of data the system (database, application, etc.) chooses to act on, when it's read or written or held in memory. The page size is some multiple of the filesystem/kernel block size (including the multiple being 1). SQLite's default page size is 4KiB. MySQL's default page size is 16KiB. Postgres's default page size is 8KiB. By default, file writes succeed when the data is copied into kernel memory (buffered IO). The man page for write(2) says: A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be delayed until a future write(), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data. If you don't call fsync on Linux the data isn't necessarily durably on disk, and if the system crashes or restarts before the disk writes the data to non-volatile storage, you may lose data. With O_DIRECT , file writes succeed when the data is copied to at least the disk cache . Alternatively you could open the file with (or ) and forgo fsync calls. fsync on macOS is a no-op. If you're confused, read Userland Disk I/O . Postgres, SQLite, MongoDB, MySQL fsync data before considering a transaction successful by default. RocksDB does not. fsync isn't guaranteed to succeed. And when it fails you can't tell which write failed. It may not even be a failure of a write to a file that your process opened : Ideally, the kernel would report errors only on file descriptions on which writes were done that subsequently failed to be written back. The generic pagecache infrastructure does not track the file descriptions that have dirtied each individual page however, so determining which file descriptors should get back an error is not possible. Instead, the generic writeback error tracking infrastructure in the kernel settles for reporting errors to fsync on all file descriptions that were open at the time that the error occurred. In a situation with multiple writers, all of them will get back an error on a subsequent fsync, even if all of the writes done through that particular file descriptor succeeded (or even if there were no writes on that file descriptor at all). Don't be 2018-era Postgres . The only way to have known which exact write failed would be to open the file with (or ), though this is not the only way to handle fsync failures. If you don't checksum your data on write and check the checksum on read (as well as periodic scrubbing a la ZFS) you will never be aware if and when the data gets corrupted and you will have to restore (who knows how far back in time) from backups if and when you notice. ZFS , MongoDB ( WiredTiger ), MySQL ( InnoDB ), and RocksDB checksum data by default. Postgres and SQLite do not (though databases created from Postgres 18+ will). You should probably turn on checksums on any system that supports it, regardless of the default. Only when the page size you write = block size of your filesystem = sector size of your disk is a write guaranteed to be atomic. If you need to write multiple sectors of data atomically there is the risk that some sectors are written and then the system crashes or restarts. This behavior is called torn writes or torn pages. Postgres , SQLite , and MySQL ( InnoDB ) handle torn writes. Torn writes are by definition not relevant to immutable storage systems like RocksDB (and other LSM Tree or Copy-on-Write systems like MongoDB (WiredTiger)) unless writes (that update metadata ) span sectors. If your file system duplicates all writes like MySQL (InnoDB) does (like you can with data=journal in ext4 ) you may also not have to worry about torn writes. On the other hand, this amplifies writes 2x. Sometimes fsync succeeds but the data isn't actually on disk because the disk is lying. This behavior is called lost writes or phantom writes. You can be resilient to phantom writes by always reading back what you wrote (expensive) or versioning what you wrote. Databases and file systems generally do not seem to handle this situation. If you aren't including where data is supposed to be on disk as part of the checksum or page itself, you risk being unaware that you wrote data to the wrong place or that you read from the wrong place. This is called misdirected writes/reads. Databases and file systems generally do not seem to handle this situation. In increasing levels of paranoia (laudatory) follow ZFS , Andrea and Remzi Arpaci-Dusseau, and TigerBeetle . I wrote a post covering some of the scenarios you might want to be aware of, and resilient to, when you write systems that read and write files. pic.twitter.com/7FxbpMo1xm Data you write never actually makes it to disk Data you write get sent to the wrong location on disk Data you read is read from the wrong location on disk Data gets corrupted on disk