vulcanridr

Nobody said there was math on this exam!

Every time I think I am starting to get a handle on my knowledge of ZFS, I either learn something new, or I climb way out on the limb and someone saws it off after me, and what I thought was true turns out to mean bupkus.

I have been an avid user of webzfs since it was first released, as I stated in an earlier blog post. One of the first things I learned after installing it on my laptop, is that my pool was at 83% usage, so I went in adjusted the snapshot retention, having never needing a single file from a a year old snapshot. More recently, I found that during an upgrade, when FreeBSD replaced user home directories with datasets, somehow, the dataset got overlaid over the existing home directory data (36GB), similar to mounting a filesystem over an existing directory...The data in the directory is still sitting underneath the mounted filesystem, and still consuming space. I did a zfs set mountpoint=none on the home directory and deleted the (2 year old) data, then remounted the home dataset. Finally, I deleted the snapshots on NX72003/home to clear out the old data. And the reason I was able to find this extraneous data hiding on the pool because I learned the meanng of the REFER value in zfs list.

Traditionally, I have used zpool list, to keep tabs on the usage of my storage. As an example, this is a listing from zpool list (which I truncated to fit on a single line):

$ zpool list NX74205
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP
NX74205 928G 203G 725G - - 50% 21%

But using these raw values seems not to be the best use of the myriad of statistics that ZFS provldes. zfs list, which is also the values displayed at the top was that on my NAS boxes in the Storage -> Pools display, had different numbers. The same dataset on the same host zfs list showed:

$ zfs list NX74205
NAME USED AVAIL REFER MOUNTPOINT
NX74205 384G 516G 88K /NX4205

So I started digging into why the numbers were so wildly different, and ended up getting pulled into the somewhat crazy world of ZFS math. I nearly named this post The Calculus of ZFS. I'm still working through it (Jim Salter wrote a good article for Klara Systems that I am still digesting). But the conclusions I have drawn thus far:

  • zpool list is the view that displays your storage from the disk perspective. It does not include any RAID (mirrors, RAID-Z, etc), parity, or snapshots. SIZE represents the of the disk before any RAIDing, compression, space reservations, deduplication, and the like is factored in. ALLOC is the amount of data written to disk...All of the disks. Simialrly, FREE is the amount of free space across all disks. zpool list is the pie-in-the-sky, best-foot-forward numbers that managers use to present to the executives or to customers when showing capabilities in a corporate setting. To me, it almost feels parallel to, on traditional filesystems, the way you had raw disk capacity, which was larger than the formatted amount of space on the disk after all of the structural stuff to be able to save and retrieve your data from said disk. A 20TB drive, for instance, may only give you 18TB of usable space after formatting and making it usable Obviously, ZFS' system is far more complex, as we will see below.
  • zfs list has space USED, which includes space used by the dataset and all children, plus all of the ZFS magic, such as metadata, any space reservations to keep your pool from filling up, inline compression, disk setup, e.g. is it a single-disk stripe, a mirrored pair, a RAID-Z/Z2/Z3. and the "slop space" calculation. USED is similar to ALLOC above, except it only counts the data writes once. REFER is referenced dat, the amount of the data stored only on the only on the dataset (sans children). In the example above, my top-level dataset, NX74205 does not have any data on it, as it should be. (It is usually bad form to write data to the top-level dataset, and can be a pain for you down the road.) So the 88k of REFER value is likely metadata of some sort. Thus, zfs list provides a more accurate view of what your storage actually looks like.

A small note about the slop space calculation. Slop space, which is a minimal amount of space reserved by ZFS to keep from filling the pool for the duration of a large operation, is calculated as 3.125% of the total pool size from zpool list. But in this age of petabyte and beyond pools, such a reservation would take an inordinate amount of drive space from your pool. For example, if you had a 1PB pool, 3.125% would reserve 31TB of slop space. So in 2021, the OpenZFS developers set limits on slop space. It was given a min/max range of 128MB to 128GB. The slop space is calculated from the values in zpool list not zfs list.

From a sysadmin perspective, the numbers from zfs list are more practical ones for day-to-day space management, since the ones from zpool list are more abstract values. For instance, zpool list on my data pool on luna, my zVault NAS, shows an ALLOC of 5.58TiB, and a FREE of 49TiB. Whereas zfs list shows USED as 10.7TiB and AVAIL of 25.5TiB. That 10.7T includes a 7T reservation, bringing the USED down to 3.7T. The latter is overall a more accurate number, since zpool list factors in multiple writes and parity across a 6-disk RAID-Z2...So it turns out that, in my experience, total storage from zfs list (USED+AVAIL) is, depending on the disk arrangement and other zFS options, usually ranges from 100% of zpool list SIZE.

The other reason that the zfs list USED number is more accurate is than the zpool list number is that the latter represents the number of logical blocks written. What I mean by this is that if you have a single disk pool, and you write a 1GB file to the filesystem, zpool list will show 1G ALLOC in zpool list, as well as 1GB USED in zfs list. Now, suppose you do the same operation on a 2-disk mirror. While zfs list USED will still show 1GB, zpool list will now show 2GB ALLOC. Why? Because in a mirrored pair, the data is written to each disk so zpool is listing each instance of that 1GB file. zfs list will still show the 1GB USED...Plus or minus....Taking into account all of the ZFS flourishes, like RAIDing and compression and reservations and the like.

The zfs list numbers reflect practical values, for sysadmins and users, rather than the zpool list numbers, which look good on a corporate presentation to the executives (which I have been directed to do before), or to woo potential customers, but is dead useless in the practical management of space on your storage. In my case, the lower total values (36.2T vs 54.5T) represents the peace of mind of having checksums and snapshots and redundancy against disk failures and everything that ZFS makes available to me as a sysadmin...Yes, users and data scientists and researchers benefit as well, but it just keeps them from coming after their friendly, neighborhood sysadmin with pitchforks and torches when things go sideways. And I have had ZFS pools fill up. It can get very very messy cleaning it up.

What's more, I believe that this number (the total of USED + AVAIL, which is total zfs space available to the system, in my case 36.2T), is the number you need to be using when doing your ZFS 80% calculations. Now I could be wrong on this, but since it is a buffer that you can slowly eat away at, while you work on procuring new drives, which, in this day and age can be a bit of a challenge...

But basically, all of the additional complications and used spaces is being used (wisely, IMHO) to ensure data integrity and availability (and confidentiality if you choose to use OpenZFS encryption), eliminate bitrot, compress data on the fly, and a whole plethora of behind-the-scenes features that put ZFS head and shoulders above any of the other modern offerings.

And yes, this is technically a FreeBSD Tribal Knowlege series post, but a) it also applies to anything else that runs ZFS, and b) (and more importantly), I couldn't pass up the title...

Thoughts? Leave a comment