vulcanridr

FreeBSD tribal knowledge: Changes to snapshot strategy

I have been using the new webzfs tool (Thanks, JT!), and one of the things I immediately noticed was that on my laptop, with a mirrored pair of 480GB SSDs, my pool was at 83% utilization. ZFS utilizes the 80% rule for pool utilization. Because ZFS is a copy-on-write filesystem, there is a higher amount of "filing" of data. The more the pool fills up, the more it becomes like a game of Tetris. 45Drives, the company that we buy our big storage units at work from, has one the best explanations of the 80% rule.

I started digging into my snapshots on danube. What I found was that the two largest partitions were my home directory (26GB), and the current base system which I call "default" (22.5GB). On top of this, I had extensive snapshots. To review my snapshot strategy from an earlier blog post, I use the following snapshot layout using zfstools. Note that this is on my FreeBSD boxes, because I haven't found a way to do hierarchical snapshots like this on TrueNAS/zVault.

  • 3 frequent (every 15 minutes) at 15, 30, and 45 after the current hour;
  • 24 hourly;
  • 7 daily;
  • 4 weekly;
  • 12 monthly.

I since reduced the hourly snapshot frequency to a snapshot every other hour, and reduced the number of monthly snapshots to 4. I have found, in over 10 years, that I have never needed any data from a year old snapshot. In fact, most data losses are noticed immediately (e.g. "Oops! I just deleted this directory I needed!" What I have noticed is that the more time that time passes, the less likely you are going to require that piece of data.

Finally, I also reduced the number of boot environments I am keeping. Since 15.0p3 is out, there was really not a reason to keep my older 14.x BEs. This also freed up a bunch of space.

All of this cleanup has dropped my pool utilization went from 83% down to 30%

Having said that, there are three types of calamities from which backups can protect. First is the accidental deletion of data, as noted above. Second, hardware failure, for example, you lose a drive. have a raid corrupted, or the like. Finally, a major catastrophe, like, for instance, a meteor falls on your house or your office or your data center, and you need to reconstitute your data.

This calls for what is generally referred to as a 3-2-1 backup scheme. You should have at least three backups of any system, across two different types of storage media, and at least one off-site copy. This will safeguard your data.

I use ZFS because its enhanced capabilities over traditional backups. For instance, I can delete my entre home directory, and using zfs rollback, I can recover it in under 2 seconds. Since ZFS snapshots start as zero-length copies of the current dataset. As changes occur to the dataset, the snapshot diverges. However, they are only the differences. You can have tens or hundreds of snapshots, for very little space. As an example, I have a dataset that is 650TB. Two weeks worth of snapshots are only consuming 15TB.

In addition, using ZFS replication to copy datasets to another ZFS installation. Unlike rsync, which has to read every file on both sides before it can begin to synchronize files across the network. ZFS, however, already knows the contents on both sides, since each snapshot is a read-only, immutable, point-in-time copy of the state of the filesystem. In addition, the sending side knows what has changed since that snapshot, and is only sending the differnces between the last remote snapshot and the current one, so replication can (and usually is) a much quicker evolution.

At work, I manage multiple NAS. I have them set up in a three-tier arrangement to follow the 3-2-1 model. I schedule regular snapshots, which are replicated to a local DR NAS, which then replicates to the off-site NAS.

It does take a bit of work to set up, but once you get a good snapshot and replication routine, it lifts a lot of weight off the sysadmin's shoulders.

Thoughts? Leave a comment