This blog posting is obsolete. I’ve moved everything over to Github. There are a lot of new goodies there that aren’t here, so mosey on over and check it out.
I’ve been using CrashPlan for Home for about five years. Alas, their Home product is being discontinued, and people who want to keep using CrashPlan will need to pay 2-3 times as much to switch to their Small Business service. I’m not willing to pay that much.
Why are you still reading this? As I said above, this blog posting is obsolete. I’ve moved everything over to Github. There are a lot of new goodies there that aren’t here, so mosey on over and check it out.
Before using CrashPlan, I implemented my own backup solution, which I documented in an earlier blog posting entitled Free Linux cloud backups for cheap bastards. My solution was gross and complicated and required lots of manual hand-holding and scrounging around looking for cloud services that I could stick backup archives into. I switched to CrashPlan a year later because the price they were charging was more than reasonable for the sake of avoiding all of the time and energy necessary to maintain my hacky, home-grown solution. CrashPlan’s more expensive Small Business price is not reasonable, because both the cloud storage options and the free software available for using them have improved dramatically in the past five years.
After looking at what’s available now, I settled on storing my backups in Backblaze B2 and using Rclone to put the data there. I’ve just about finished implementing that solution, so I’m posting what I’ve done here on the off chance that it might be useful to others. If you decide to use the work I’ve done and you want to be kept inform about updates, subscribe to the RSS comment feed for this blog posting; I’ll post comments below when I make post updates to my scripts.
I’m not the only person to write up a tutorial like this. Here’s another one you might want to look at. There’s more functionality here, though.
My requirements
My computing environment has the following data locations that need to be backed up:
- a Linode server which runs my mail server, my blog, my wife’s blog, an NNTP server, some databases (MongoDB, MySQL) that I care about, and moderation software for several Usenet newsgroups;
- an iMac which my kids keep files on indiscriminately without paying any attention to whether they’re on the local hard drive or in a backed-up location such as Dropbox;
- a Synology NAS which contains our family photo and video archive, other important data that we don’t want to use, and some data that we don’t particularly care about; and
- my Linux desktop computer, which has some other MongoDB and MySQL databases that I care about, and my home directory that I care a lot about.
My requirements for backup are as follows:
- Data that’s worthy of being backed up should be backed up twice: once to on-site storage for quick restores in case of hardware loss or failure, and once to off-site storage in case of catastrophe (e.g., our house is burglarized or destroyed).
- I need to be able to exclude files from on-site backups.
- I need to be able to exclude additional files from off-site backups, over and above the files excluded from on-site backups.
- Backups need to be incremental. I’m worrying about hundreds of gigabytes of data; there’s no way in hell I can do complete backups every day.
- Previous versions of modified files and deleted files need to be preserved for some period of time before they are permanently expunged.
- Off-site backups need to be encrypted.
- I need to be able to see what’s taking up space in my backups, so that I can adjust my exclude rules as needed to stop backing up space hogs that don’t need to be backed up.
- Backups need to run automatically once they’re configured and need to let me know if something is going wrong.
- The integrity of backed-up data needs to be verified automatically as part of the backup process so that I will have confidence that I will be able to restore data from backup when I need to.
- Most people prefer a user-friendly backup service with a graphical user interface built by somebody else. I prefer a command-line backup system that I built myself, so that I know exactly what is doing, I can change its behavior to suit my needs, I can “look under the hood” as needed, and I have confidence that it is doing exactly what I want it to.
- I want to pay as little as possible for storage.
My strategy
Overview
- My Linode server backs itself up daily using rsync (with a large, hand-crafted exclude list) over SSH into a chroot jail on my home desktop, into free space on my Linux desktop. This is all configured via Ansible playbooks so it’s self-documenting and can be reconstructed easily if needed.
- The family iMac backs itself up automatically via Time Machine to an external USB drive, and also backs itself up daily exactly to m Linux desktop exactly the same as the Linode server (though with a different exclude list, obviously).
- I consider our Synology NAS to be its own on-site backup, given that it’s configured to use redundant RAID so I won’t lose any data if one of its hard drives fails as long as I replace the drive before a second one fails.
- I export my PostgreSQL and MongoDB databases nightly into a format on which is easier to do reliable incremental backups than the output of mongodump, mongoexport, or mysqldump.
- The important data on my Linux desktop is backed up nightly using rsync onto a separate drive.
- I wrote a wrapper around Rclone which reads a simple configuration file and follows the instructions in it — including Rclone filter rules — to back up a local directory into an encrypted B2 bucket. The script also knows how to explicitly verify files in the backup (yes, Rclone claims that it does this, but I am paranoid, and while I haven’t read all of the Rclone source code to have confidence that it is doing what it claims, I have read my own verification code, and it is simple enough to be easy to understand).
- A nightly cron job on my Linux desktop calls several instances of the Rclone wrapper script on different configuration files to run several backups and verifications in parallel. Some of these backups are of directories on my NAS which are mounted on my Linux desktop via either CIFS or NFS.
- My B2 bucket is configured to preserve deleted files for a year before purging them automatically.
All the pieces that make up this backup system are documented in the following sections. The code referenced below is all available in this ZIP file. If you only care about the Rclone bit, skip to “Wrapper script around rclone“.
Some of the code I’m providing here should be usable out-of-the-box with essentially no modifications. However, some of it is intended more as an example than as running code, and you’ll probably need to either use it as inspiration for writing your own stuff, or slice and dice it a bit to get it working. While I’m happy to provide this code to people who can benefit from it, I do not have the time or energy to help you get this code working for other you. This is not intended to be plug-and-play; rather it’s intended to be an assist for people who can read this code and know what to do with it themselves.
A note about cron
As described above and in more detail below, all of my automated backup stuff is driven by cron. Cron captures the output of jobs that it runs and emails them to the owner of the job. It’s important to ensure that email on your machine is configured in such a way that these emails will be delivered successfully rather than lost into a black hole. Otherwise, you won’t know if your backup scripts are generating errors and failing!
In addition to ensuring that your computer is configured to deliver email sent by cron, you also need to ensure that cron is sending emails to the correct address. This means putting a MAILTO
setting into the crontab files and/or putting an alias for root
in /etc/aliases, unless you’re the kind of person who actually reads root’s email spool in /var/mail.
Using Jailkit for safe rsync / SSH backups
Several of my machines back themselves up every night automatically using rsync over SSH to my Linux desktop. I want these backups to be automated and unattended, which means that the SSH key needed for these backups needs to be stored without a passphrase on these machines. However, I don’t want to increase my home network’s attack surface by allowing anyone who manages to break into one of these machines to use the unprotected SSH key to log into my Linux desktop. I solve this problem by creating a dedicated user on my Linux desktop for each of these backups and isolating that user inside a minimal chroot jail, such that if someone does manage to get access to that SSH key, all they’ll be able to gain access to is the backups stored inside that jail.
If you didn’t understand the preceding paragraph, you should probably stop reading this and go buy yourself and off-the-shelf backup product. 😉
I use Jailkit for constructing chroot jails on my Linux desktop, i.e., the target host for the backups.
You will find the following files in the ZIP which illustrate how this is done:
- install_jailkit.yml is the Ansible playbook I use to install Jailkit on the backup target host.
- jk_init_fixer.py is a script called by install_jailkit.yml to fix /etc/jailkit/jk_init.ini after it is installed to remove references to nonexistent library paths.
- jailkit_backup.yml shows how to set up the target environment for the backup on the backup target host, and the SSH key for the backup on backup source host. Note that in this file:
- “bkuptarget.example.com” is the name of the backup target host
- “bkupsource.example.com” is the name of the backup source host
- “/mnt/backup” is the directory on the backup target host in which you want to store backups
- this playbook assumes that home directories are in /home on your system, root’s home directory is /root, and filesystems are controlled by /etc/fstab
- rsync-backup.sh is the script run on the backup source host to use rsync to do the backup to the target host. You will probably want to add to and/or modify the exclude list.
- unchanged-rpm-files.pl is a script called by rsync-backup.sh to determine which RPM-controlled files on the source host are unmodified from the versions in the RPMs and add those files automatically to the exclude list for the backup. If You are using a Linux or Unix variant that uses a different package format such as deb or Pacman, then you may want to write your own version of this script. Alternatively, you can just remove the invocation of it from rsync-backup.sh, but then you will probably want to add more paths to the hard-coded exclude list so you don’t waste space backing up OS files that don’t need to be backed up (or what the heck, you can just back them up and not worry about it, since the bandwidth and storage space are probably less valuable than your time).
Exporting MongoDB databases to an incremental-backup-friendly format
The script mongo-incremental-export.py takes one or more MongoDB connection strings as arguments and exports the specified databases into subdirectories of the current directory named after the databases. Every document in every collection in the database is exported into a separate file. Subsequent runs only export the documents that have been modified. Restoring a database from this data should be a simple, straightforward reverse of this export process, though I haven’t bothered to write that script yet since I haven’t actually needed to do such a restore. Some notes about this:
- The script stores a “checkums” file in each collection subdirectory of the database directory. These files are used to make the script itself run faster, and they should be excluded from backups since they’re not needed for restores and are not particularly incremental-backup-friendly.
- The script puts the exported document files in a directory hierarchy that is several levels deep to prevent directories from having too many files in them.
- I’m sure this script is not scalable to extremely large databases, but that’s OK, because if you’ve got databases that large, you probably have a better way to back them up than this silly little thing. It’s certainly good enough for the relatively small databases I work with.
- The script could be made more scalable by adding configuration code to allow it to be told that some collections are write-once, i.e., it’s not necessary for the script to revisit documents that have already been exported, and/or that some collections have timestamp fields that can be used to determine which documents have been modified since the incremental export. If you want to do this, I will happily accept patches to the code. 😉
Note that I include all of /var/lib/mongodb
in my on-site backups done via rsync, since rsync is smart about scanning these files for changed blocks and only copying them over into the backup. This incremental export is only used for the off-site backups done via Rclone to B2. This is necessary (as I understand it) because Rclone isn’t as good as rsync is at doing block-based incremental backups.
I run this script on the databases I want to export in a cron job that runs every night prior to my Rclone backup job.
Exporting MySQL databases to an incremental-backup-friendly format
The script mysql-dump-splitter.pl plays a role similar to the mongo-incremental-export.py, but for MySQL databases. Basically, it reads mysqldump output on stdin or from a file specified on the command line and splits it into separate files in the current directory, such that each table in the dump is in a separate file. These files are numbered and can easily be recombined with cat
to recreate the original dump file which can be executed as a SQL script to recreate the database.
The splitting makes it more likely — albeit not guaranteed — that Rclone will be able to back up the data incrementally.
I run mysqldump and feed the output into this script from a nightly cron job that runs before my Rclone backup job.
Just like for MongoDB, I actually back up all of /var/lib/mysql
in my on-site backups; the purpose of this split backup is for more efficient off-site backup.
Wrapper script around rclone
Note: You must be using either an official release of rclone newer than 1.37, or a beta release / nightly build from July 24, 2017 or newer. The version of rclone provided by your OS may not be new enough (run rclone --version
to check). You can download a newer version from the rclone web site. It’s a Go binary, so you should just be able to drop the downloaded binary right on top of the OS binary with no ill effects.
The script rclone-backup is my wrapper around Rclone. It can use any source or destination type supported by Rclone, so although I’m using local directories as the source and B2 buckets as the destination (well, actually, I’m using an encrypted bucket; see the Rclone documentation about encryption), you should be able to use this script with other source and destination types if you want.
The configuration files read by this script look like this:
[default] source=source-directory-or-rclone-location destination=target-directory-or-rclone-location archive_specials=yes|no copy_links=yes|no [filters] list rclone filters here, as documented at https://rclone.org/filtering/ [test-filters] list rclone filters here, as documented at https://rclone.org/filtering/ (see below for what these are for)
The archive_specials
setting is a hack to work around the fact that Rclone doesn’t know how to handle special files (e.g., devices and named pipes). When it’s set to a true-ish value (the default), Before rclone-backup does the sync it’ll find all of the special files in the source and save a tar file containing them called “special-files.tar.gz” at the root of the source directory.
The copy_links
setting tells rclone-backup whether to tell rclone to attempt to copy symbolic links. It defaults to false if not specified. It can also be specified on the command line as --copy-links
.
In addition to reading the configuration file to find out what to do, rclone-backup also takes the following command-line options:
- –help — print a usage message and exit
- –verbose — be more verbose itself and also tell rclone to be verbose
- –quiet — tell rclone to be quiet
- –dryrun — show what would be done without actually doing it
- –copy-links — try to copy symbolic links rather than skipping them
- –rclone-config=file — use the specified rclone configuration file instead of the default ~/.rclone.conf
- –ls — call
rclone ls
on the source directory instead of doing a sync - –verify=verify-condition — verify the backup as described below instead of doing a sync
Using test filters to reduce overhead when auditing space consumption in backups
The story behind [test-filters]
in the configuration file revolves around how one makes sure that one isn’t backing up large data that doesn’t need to be backed up, wasting bandwidth, storage space and (potentially) money. To do this properly, you also need another script of mine called tar-ls-du.pl which is also in the ZIP (the name is an historical artifact; when I originally wrote this script it only supported ls -l
and tar tvf
output, but now it also supports rclone ls
).
[UPDATE [2017-08-25]: Release 1.37 of rclone adds the “ncdu” command, which provides an ncurses interface for exploring the space taken up by the various files and directories If you have version 1.37 or newer of rclone (and if you don’t, consider getting it!), instead of using “tar-ls-du.pl” as shown below, you may wish to consider doing something like: rclone --filter-from <(grep '^[-+]' configuration-file) --fast-list ncdu backup-source-directorybar
.]
I will illustrate this by way of example.
If you have an rclone-backup configuration file as shown above with a [filters]
section indicating which files to include in or exclude from the backup, then you might run this to find out what’s going to take up the most space in the backup:
rclone --filter-from <(grep '^[-+]' configuration-file) ls backup-source-directory | tar-ls-du.pl --rclone | sort -n
The output produced by this command will show you how much space is taken up by the files and directories that will be included in the backup, with the space taken up by subdirectories and files in directories included in their parents’ totals.
Now, suppose you’re reviewing this output looking for space hogs, and you see some stuff in the output that yes, it’s taking up a lot of space, but yes, you know that and you want it to be in the backup anyway, and you don’t want to have to keep skipping over it every time you’re doing one of these space audits. You can then put filters to include this stuff in the [test-filters]
section of the configuration file, and those files will no longer be listed in the audit output.
If you’re backing up to B2, then big files and directories aren’t all you have to worry about when auditing your backups to reduce waste. You also have to worry about smaller files that are modified frequently, because rclone will preserve previous versions of those files and not clean them up until you tell it to or your bucket policy says to purge the old versions. Files that change frequently could therefore cost you a lot in storage costs even if they aren’t terribly large.
Here’s an example of how I would audit for that when setting up a backup:
rclone --max-age 7d --filter-from <(grep '^[-+]' configuration-file) ls backup-source-directory | tar-ls-du.pl --rclone | sort -n
This will audit only files modified within the past seven days. Of course you can use a longer or shorter time window if you’d like.
Verifying backups
As noted above, if you specify “–verify” to rclone-backup with a verify condition, then it will verify that the contents of the backup destination match the source, by downloading the backed up files from the destination and comparing them to the source files. The exact form the verification will take depends on what you specify as the argument to –verify. You can specify multiple verify conditions to enforce them all In particular:
- If you specify “all”, then every single file in the backup is verified. Clearly, this can take a lot of time and bandwidth if there’s a lot of data, not to mention money if your backup destination charges for downloads as, e.g., S3 and B2 do. So think carefully before using this.
- If you specify “data=number”, then up to that many bytes of data in the backup will be verified. You can prefix the number by “<” to enforce a hard limit (otherwise, the final verified file may push the verify over the specified number of bytes) and/or suffix the number by “%” to indicate that the specified number should be interpreted as a percentage of the total number of bytes of all files in the backup.
- If you specify “files=number”, then up to that many files in the backup will be verified. You can suffix the number by “%” to indicate that it’s a percentage of the total number of files in the backup.
- If you specify “age=rclone-age-spec”, then only files up to the specified age (using the same syntax as rclone’s “–max-age” argument) will be verified.
Nightly backup cron job
The file z-rclone-backup-cron in the ZIP is installed in /etc/cron.daily on my Linux desktop (the name starts with “z” to ensure that it runs after all of the other daily login tasks. In addition, rclone-backup configuration files for each of the directories I want to back up to B2 are in the directory /etc/rclone-backups. The script does the following:
- Count the number of backups it is going to run.
- Calculate the amount of data we want to verify from each backup, starting with the 1GB of free data downloads that B2 allows per day, and dividing by the number of backups.
- Launch a separate background process for each backup, which first runs and then verifies the backups.
- Wait until all of the background processes exit.
- If the backup is successful and the CANARY variable is set in /etc/default/rclone-backups, then fetches the specified canary URL (see Coal Mine).
Just to give you some idea of how I’m using this, here are some of the backups in my /etc/rclone-backups directory:
- my wife’s CloudStation drive folder from a mounted NAS filesystem.
- a local “isos” directory containing CD and DVD images that I don’t want to lose because I may not be able to obtain them again.
- the local directories containing the backup sent from my Linode server and the family iMac
- the local directory containing the local backup of my desktop (i.e., as noted above, the desktop backs up itself nightly via rsync to an internal drive that is separate from the drive being backed up, to protect against hard drive failure, and then that backup is what’s being backed up to B2 by the nightly rclone job)
- my music archive, mounted from the NAS
- the family photo / video archive, mounted from the NAS
Note that all of these backup sources are stable, i.e., none of them is being actively modified while the nightly rclone backups are running. This is important to avoid false errors during the backup verification step.
A note about offline backups
Off-site backups are not the same as offline backups.
When all of your backups are online, you’re vulnerable to an attacker who gains access to your computer deleting (or encrypting, if it’s RansomWare) not only your canonical data, but also your backups. This is not necessarily something that a “mass-market” attacker would bother doing, but if someone is out to get you specifically, they may very well do this.
For this reason, it’s usually wise to periodically write your backups to offline media such as DVDs or BluRay discs. How to do this is left as an exercise to the reader.
In conclusion…
Don’t forget to subscribe to the comments RSS feed if you use my code and want to find out about updates.
If you benefited from this, leave a comment below (or, if you’re feeling generous, a donation).
This blog posting is obsolete. I’ve moved everything over to Github. There’s a lot of stuff there that isn’t available here, so check it out.
New zip file uploaded. Changes:
New ZIP file uploaded here. Changes:
Replace some deprecated ansible syntax.
Add –filters command-line option to rclone-backup which tells it to display the configured filters and then exit.
Add a pre_command setting to rclone-backup config files which tells it to run the specified command before performing the backup and abort the backup if the command fails. I use this to confirm that the filesystems being backed up are actually mounted.
Revamp the logic in rclone-backup for verifying backups and make it verify only 1,000 backed-up files by default, because rclone performance chokes when it’s given a filter with too many files in it.
Fix a bug in tar-ls-du.pl when the listing output being processed didn’t include all intermediate directories.
Fix the backup cron job to filter out some rclone messages that aren’t useful and can’t be suppressed.
Pingback: CrashPlan for Home: disputing renewal with credit-card co. – Something better to do
#3 isn’t a solution. Example: Burglar steals Synology. Fire or other natural event destroys your Synology device.
I’m not really sure what your point is. As I said above, My Synology NAS is my on-site back-up. The whole point of this blog posting is that I also have an off-site backup of my data to protect against exactly the possibilities you mentioned.
New ZIP uploaded. Fixed uninitialized value error in last night’s upload. Diff:
New ZIP file just uploaded.
Blog text above updated to indicate which version or rclone you need to be using:
Changes in the new ZIP file:
Diff:
New ZIP posted with updates, as well as updates to the text above about –verify and the nightly backup cron script:
Diff:
ZIP update: correct rclone-backup exit status when verify fails. Diff:
Three new changes in the ZIP:
* When rclone-backup is printing the rclone command we’re about to execute, wrap arguments with spaces in quotes to make it easier to cut and paste the command into a shell when that’s needed during debugging the behavior of the script.
* Rclone-backup needs to specify –delete-excluded, so changes to filters are propagated into the backed-up files.
* Exclude *~ and .#* files from backups automatically.
Here’s what the changes look like:
Just posted an update to the ZIP. Two fixes:
* Rclone-backup needs to close the “rclone ls” file handle when verifying so there isn’t a defunct rclone process hanging around.
* Fix a bug in tar-ls-du.pl with parsing file paths with spaces in them when the input type is specified on the command line rather than auto-detected.
* Fix a bug in tar-ls-du.pl which was causing parent directories to sometimes show up below their children in the output.
The changes look like this:
I wouldn’t be comfortable with backups of deleted files being purged after one year. I almost lost a GPG key when I discovered that it had been truncated to 0 bytes at an unknown time in the past, and the truncated file had been backed up for a long time. All of my online backups had the truncated file. I only recovered it from an old CD-R backup, and I had to go through a few disks to find one that was readable.
It’s also way too easy to delete something by accident and not realize it for a long time, especially if it’s a file you rarely access. And then there’s bitrot, which tends to go unnoticed for a long time, and could result in corrupted files being backed up (e.g. if metadata changed) and replacing old, intact files.
I’m fully convinced that a serious backup system has to support flexible retention schedules. For example, with Obnam (may it RIP–though it’s still perfectly usable), I use a schedule like 5y,12m,8w,30d, so, if I’d been using Obnam that long, I’d have a yearly backup snapshot going back 5 years. I’m evaluating Restic now, and it also supports this.
Those are totally legit concerns. Right now the kind of logic you’re describing can’t be built into rclone because it doesn’t support removing individual versions of old files (bug). Also, it would be complicated to implement when using an encrypted rclone bucket backed by B2, as I’m doing, because the crypt code in rclone doesn’t know how to deal with B2 versions (bug). I’m hoping that rclone will evolve over time to make it possible for me to implement complex retention policies using rclone directly.
If I really wanted to do it right now, I could, using a combination of running “rclone –crypt-show-mapping” on the encrypted bucket to get the mappings between encrypted and unencrypted file names, and then using the Backblaze B2 CLI to implement the version purging logic.
I don’t care enough about this problem to spend time on it right now. Maybe later.
Maybe I should put all of my scripts into a public Github repo so other people can submit patches for stuff like this. That is, assuming that anybody is actually using my code. 😉
Just posted another update ZIP to account for the fact that the newest rclone uses multiple transactions unless --fast-list is specified, and this is more expensive and much slower on B2.
New ZIP posted with rclone-backup enhanced to account for the fact that the current version of rclone increases its verbosity level each time –verbose is specified:
I just posted an update above about the new “ncdu” command in rclone version 1.37.
I just posted a new version of the ZIP file. One minor change — a message which in rclone-backup which should have only been printing in –verbose mode was instead printing all the time:
Pingback: CrashPlan jumps the shark – Something better to do