Geographic distribution with Sanoid and Syncoid

August 3, 2020

Failures happen at multiple levels: a single disk can fail, as well as multiple disks, a single server, multiple servers, a geographic region, a country, the world, the universe. The probability decreases with the number of simultaneous events. Costs and complexity increase with the number of failure events you want to handle. It’s up to you to find the right balance between all those variables.

For my own infrastructure at home, I was able to put storage servers into three different locations. Two in Belgium (with 10Km distance from one another), one in France. They all share the same data. Up to two storage servers can burn or be flooded entirely without data loss. There are different redundant solutions at the host level but I will not cover them in this article.

Backup management

Storage layer relies on ZFS pools. There is a wonderful free software called Sanoid to take snapshots of your datasets and manage their retention. Here is an example of configuration on a storage host:

[zroot]
    hourly = 0
    daily = 0
    monthly = 0
    yearly = 0
    autosnap = no
    autoprune = no

[storage/xxx]
    use_template = storage

[storage/yyy]
    use_template = storage

[storage/zzz]
    use_template = storage

[template_storage]
    hourly = 0
    daily = 31
    monthly = 12
    yearly = 10
    autosnap = yes
    autoprune = yes

Where storage/xxx, storage/yyy, and storage/zzz are datasets exposed to my family computers. With this configuration, I am able to keep 10 years of snapshots. This may change over time depending on disk space, performance or retention requirements. The zroot dataset has no snapshot nor prune policy but is declared in the configuration for monitoring purpose.

Sanoid is compatible with FreeBSD but it requires system changes. You’ll need an “sh” compatible shell to be compatible with mbuffer. I’ve chosen to install and use “bash” because I’m familiar with it on GNU/Linux servers.

To automatically create and prune snapshots, I’ve created a cron job that runs every minute:

* * * * * /usr/local/sbin/sanoid --cron --verbose >> /var/log/sanoid.log

Remote sync

Sanoid comes with a tool to sync local snapshots with a remote host called Syncoid. It is similar to “rsync” but for ZFS snapshots. If the synchronization fails in the middle, Syncoid can resume the replication where it was left, without restarting from zero. It also supports compression on the wire. This is handy for low bandwidth networks like the one I have. To be able to send dataset to remote destination, I’ve set up direct SSH communication (via the VPN) with ed25519 keys.

Then cron jobs for automation:

0 2,6 * * * /usr/local/sbin/syncoid storage/xxxxx root@storage2:storage/xxxxx --no-sync-snap >> /var/log/syncoid/xxxxx.log 2>&1
0 3,7 * * * /usr/local/sbin/syncoid storage/xxxxx root@storage3:storage/xxxxx --no-sync-snap >> /var/log/syncoid/xxxxx.log 2>&1

Beware, I use the “root” user for this connection. This can be a security flow. You should create a user with low privileges and possibly use “sudo” with a restriction to the command. You should disable root login over SSH. The countermeasure I’ve implemented is to disable password authentication on the root user ("PermitRootLogin without-password" in sshd_config file from OpenSSH server). I’ve also restricted SSH connections to the VPN and local networks only. No public network allowed.

Local usage

Now, ZFS snapshots are automatically created and replicated. How can we start using the service? I want to send my data! Every location has its own storage server. The idea is to use the local network and send data to the local server and let the Sanoid/Syncoid couple handle the rest over the VPN for data safety.

At the beginning, all my family members were using Microsoft Windows (10). To provide the most user friendly experience, I thought it was a good idea to create a CIFS share with Samba. The authentication system was a pain to configure but the network drive was recognized and it worked… for a while. Every single Samba update on the storage server broke the share. I’ve lost countless hours debugging this s**t.

I started to show them alternatives to Windows. One day, my wife accepted to change. She opted for Kubuntu. Then my parents-in-law changed too. I was able to remove the Samba share and use NFS instead. This changed my life. The network folder has never stopped working since the switch. For my personal use, I use rsync and cron to automatically send my local folders.

The storage infrastructure looks like this (storage1 example):

Geographic distribution diagram

Syncoid is configured to replicate to other nodes:

Geographic distribution part 2

The most important rule is to strictly forbid writes on the same dataset on two different locations at the same time. This setup is not “multi-master” compliant at all.

In the end, the data management is fully automated. Data losses belong to the past.