Although we do not use ZFS in our production unit, the owners of the SDCast podcast invited me to talk about it. This article was born from the issue, but you can listen to the audio version here.
So, today I’m talking about ZFS. How the ZFS file system works, what components it consists of and how it works, and new features that have appeared or will soon appear in the latest releases.
ZFS And Its Differences From Other Solutions On The Example Of Linux
ZFS is a symbiosis of a file system and a volume manager that provides tools to manage a disk array efficiently.
Any file system is an abstraction for convenient data storage. Each file system was designed for specific requirements: how many disks it will have, what kind of storage system under the hood, etc.
For example, the EXT family is a straightforward system inspired by UFS.
XFS is a system with an emphasis on parallel access, and ZFS aims to be a system that includes everything you need to create ample local storage; in particular, this is reflected in the ease of use.
Linux uses the Logical Volume Manager (LVM) as a de facto standard, which also offers some abstractions over the underlying block device – you can create abstractions in the form of Physical Volume, Logical Volume, and so on. You can do the same thing as with ZFS, but in other ways, adding a bunch more layers to LVM.
Basics Of ZFS
ZFS is a copy-on-write file system; it never overwrites data. We always operate on a new block; we don’t need a journal to ensure data consistency, as in most other file systems.
Databases like MySQL and PostgreSQL have a so-called WAL log. By default, all data is written as a log, then written to a data block on disk, resulting in a double entry. In this case, you have to wait until the file system confirms that the data is on the disk.
Copy-on-write has the following advantage: old data does not change, you can not log and restore data that was previously written. We are not afraid of data corruption; since they cannot be corrupted, the new version of the block will be reported to a new location without overwriting the old one.
The copy-on-write process itself does not guarantee data consistency, but if we consider ZFS, its operation is based on the Merkle or Hash tree. ZFS always has a consistent state because it uses atomic transactions. There is a tree of blocks; for each of them, the hash sum is calculated from the lowest block, reaching the topmost block. The hash sum of the top block (uber block) allows you to validate the state of the entire file system at the time of the transaction.
However, because the copy-on-write system never writes to the same place, the problem of data fragmentation appears. It is also necessary to address the issue of reading and its effectiveness. This problem is mainly solved with an SSD, but it is noticeable when working on hard drives.
ZFS, like any copy-on-write system, needs to have free space on the disks so that there is somewhere to write data that is always written to a new location. Added to this is the problem of the registered order; the following blocks of one file will be reported to a different location on the disk; that is, there is a difficult task of effectively allocating data. However, even in classical file systems, fragmentation can be avoided only by reallocating a segment sequentially and working only with it. This is how we return to nailing the program entities to a specific disk, which is less convenient (and any file system, as we said earlier, seeks to make life easier for the developer).