How I lost my family’s email, recovered most of it, and made sure it wouldn’t happen again

Introduction

I’ve been running my own family mail server for over thirty years. I take a lot of precautions to ensure we don’t lose any email, but last week a series of unfortunate events led to all the email on the server going *poof*, forcing a restore from backup. This revealed some fragility in my email infrastructure and inadequacies in my backup infrastructure, both of which I’ve since addressed. Perhaps you will find this retelling worth reading, if you like sysadmin cautionary tales, post-mortems with juicy lessons learned, or merely laughing at other people’s misfortune.

The working environment

I run sendmail (yes, really, sendmail, I’ve been writing sendmail configuration files since before you were born, you whippersnapper) and Cyrus imapd on a CentOS Stream 9 Linode with 4GB of RAM and 80GB of disk space.

Cyrus imapd supports two different mail indexing backends, Squat and Xapian. Squat is the older and less functional backend. Xapian is much more functional but consumes more resources and is, as you will find out if you keep reading, somewhat more… fragile than Squat. Up until last week I was using Xapian.

Several of the people in my family never delete any emails, which means the disk usage on the server gradually increases over time. I have a monitor set up to warn me when the disk is getting close enough to full that I should go in and see what I can clean up.

In my experience, Cyrus imapd’s documentation is often confusing, internally inconsistent, and outdated. There’s a lot of important information you should probably know to successfully administer Cyrus imapd that isn’t documented anywhere in the official documentation. Sometimes you can find what you need by searching the mailing list archives, or you can ask a question on the list and someone may point you in the right direction. However, either of those assumes that you know there’s a question you should be asking, which isn’t always obvious.

The triggering event

I received an email alert warning me that disk usage on the mail server had reached 90%. It was yet again time to dig around looking for stuff to clean up.

I did what I always do when I get that email: I logged into the server and ran sudo du -ax / | sort -nr > /tmp/du, which gives me a full accounting of everything on the disk sorted in reverse order by how much space it’s taking up.

This revealed, as usual, that email storage was the biggest guilty party, but I noticed something I hadn’t before: the Xapian search indexes seemed to be taking up significantly more space than the emails they were supposed to be indexes of, which seemed wrong. Perhaps, I said to myself, I could free up some space by rebuilding or compacting these indexes.

Learning how to rebuild Xapian search indexes

A bit of research on the mailing list revealed that I was correct: Cyrus imapd’s Xapian search indexes are never compacted or rebuild unless you configure something to explicitly compact or rebuild them, a fact that I did not find documented anywhere in the official documentation when I first switched from Squat to Xapian years ago, and which as far as I can tell still isn’t documented anywhere in the official documentation for how to set up Xapian, the (very small) extent of which appears to be here.

There is a hint at the end of that section of the documentation (emphasis added): “If you want to do more complex search tiers and repacking, you’ll want to read: [link to a message in the mailing list archive]”. Note that this doesn’t explain what repacking is or say you need to do repacking or your indexes will grow without bound forever. Furthermore, the language is ambiguous, and I read it wrong years ago when I set up Xapian, leading me to conclude, incorrectly, that I didn’t need to worry about whatever that message in the mailing list archive was talking about. What that sentence actually means to say, and should have been written to say is, “If you want to do search tiers or repacking, you’ll want to read:…” Also, this section should certainly link to, but for some inexplicable reason does not, the section of the documentation about search tiers.

But I’m getting a bit ahead of myself. Before it even occurred to me to go back to the web documentation, I first consulted with the man page for the Cyrus squatter program, which I knew already was responsible for building the search indexes. Unfortunately, the squatter man page is rather bad. Some examples:

In the “SYNOPIS” section it gives this usage syntax: squatter [ -C config-file ] [mode] [options] [source]. Then, below that, it says, “mode is one of indexer, search, rolling, synclog, compact or audit.” You would think, therefore, that when invoking the program you would specify one of those words (“indexer”, “search”, etc.) in place of the word “mode” in the usage syntax, but nope! You would be wrong.
The statement, “This feature is only available on the master branch,” appears seven times in the man page. I leave it as an exercise to the reader to guess how many of those “master branch features” have been included in multiple Cyrus imapd production releases at this point.

Anyway, after trying and failing to figure out from the man page how to compact my indexes, I went searching online, found the message in the mailing list archive that the documentation links to, and read through it and the example shell script attached to it to try to figure out what the correct syntax was for repacking mailboxes. From this, I learned just enough to be dangerous.

The catastrophic mistake(s)

Cyrus imapd stores users’ email messages in a hierarchy organized by username and folder name (this actually changed in recent versions of Cyrus, but the old structure is still supported and is what I use). For example, my inbox is stored in /var/spool/imap/j/user/jik, and my Sent folder is /var/spool/imap/j/user/jik/Sent.

Old-style Cyrus Squat indexes are stored in the same location, i.e., within each folder. For example, /var/spool/imap/j/user/jik/cyrus.squat indexes my inbox, /var/spool/imap/j/user/jik/Sent/cyrus.squat indexes my Sent folder.

When I read the documentation for switching to Xapian and it said I had to set up at least one search tier, and I saw that (a) Xapian search tiers were organized in the same structure as the mail spool, e.g., <root>/j/user/jik/xapian and <root>/j/user/jik/Sent/xapian, I concluded it would be fine to overlay my single source tier with email storage, i.e., to store the Xapian indexes within /var/spool/imap in the same directories as the email messages. Nothing in the documentation said that, or says now, that this would be a bad idea. Nevertheless, it turns out that this was my first mistake, made many years ago without realizing it. Because here’s the thing I just learned: when you tell squatter to compact your indexes, after doing that apparently it deletes all the files in the target storage tier that aren’t in the new index it just built. Including, apparently, mail message and other metadata files unrelated to the index. You see where this is going, right?

Armed with the information gleaned from the mailing list archives, I attempted to compact the mailbox of my family member who uses email the least and doesn’t really use it socially. This was a good mailbox to test on because even if I screwed something up completely they wouldn’t really lose anything important, and it had by far the fewest email messages of any mailbox on the server so it was the fastest one to test.

That first compact seemed to work, so I went ahead and ran the slightly different squatter command to rebuild the indexes for all the mailboxes, not just one of them.

Looking back, here’s what I still don’t know: did I fail to notice that when I rebuilt the indexes for that first mailbox, it deleted all of the email messages in the mailbox? Or is the behavior of the squatter command different when run on a single mailbox vs. across all mailboxes, such that it didn’t actually delete any messages until I ran it without specifying a single mailbox to operate on? If it’s the former, then I made another mistake here: not noticing that the command had deleted all the messages in the mailbox. I may never know, because (as I explain below) I have since switched back to Squat and have no intention of using Xapian again time soon and I don’t have the time or inclination to set up a test mail server to experiment on just to understand exactly how dumb squatter is.

Anyway, soon after running the reindex command across all mailboxes, imapd started logging I/O errors, and it was then that I discovered that all of the message and metadata files had been deleted for all of my family’s mailboxes. Well, all but one… For some reason I still do not understand and never will, even though squatter supposedly rebuilt the search indexes for all of the mailboxes, one mailbox’s messages were not deleted. I have no idea why, and as noted above, I am not going to doing any further experiments to try to figure it out.

The recovery

At this point, I shut down Cyrus imapd, shut down sendmail, and sent out a message in the family chat notifying everyone of what I had done and explaining that email would be down until I recovered what I was able to from backups.

I have daily backups of the mail server, yay! The daily backups are regularly tested to confirm they are accessible and the data in them aren’t corrupted, yay! I have never tested a full restore from backup of the mail server data, uh oh.

It turns out that I had mistakenly configured my backups to exclude some Cyrus imapd files which are actually needed to fully reconstruct mailbox contents. As a result, when I restored the backup onto the mail server and told Cyrus to reconstruct the mailboxes, I lost message stars and flags, read/unread status, and deleted/undeleted status, so every message in every folder showed up as unread and previously deleted messages reappeared. Not great, but not the end of the world.

Also, I’d lost more than half a day’s worth of everybody’s emails because the backups run in the middle of the night and this all went down in the evening. For the family members who use Thunderbird (myself, my wife, one of my kids), I was able to write a script to pull additional emails out of their Thunderbird profiles’ local storage and copy them back into their IMAP mailbox, but some mails were still lost.

The last straw for me and Xapian

After restoring everybody’s mailboxes and changing the Xapian configuration to store the indexes somewhere else (!!), I attempted to rebuild the Xapian indexes, but I had a really hard time getting it to work because there is a bug in squatter which was causing it, in at least some circumstances, to reindex the same twenty messages over and over forever, causing the indexes to rapidly grow without bound without actually indexing most of the mailbox. Awesome!

I did finally manage to find a particular invocation of squatter that seemed to work properly, but it wasn’t the “right” way to use it and I was concerned the fact that I was struggling so much meant that I was doing something wrong, so I sent a message to the mailing list explaining what I was seeing and asking what I was doing wrong and how to fix it. I had assumed, incorrectly, that surely this must be user error and not an obvious bug in essential functionality.

The first person who responded on the mailing list asserted that there must be a problem with a specific message that was causing the indexing loop (even though I had already explained that it was taking place in any folder I tried to index, so obviously it wasn’t being caused by a specific message), or perhaps the problem was that I was using an old version of Cyrus so I should upgrade.

The latter certainly seemed plausible, and I’m a big believer in not reporting bugs in an application until you’ve tested the current version to see if the bug is still present, so I spent several hours figuring out how to get the current version of Cyrus imapd installed onto my mail server, only to discover that not only was the problem still present in the current version, it was worse: my workaround that had worked with the old version was no longer working with the new version, so it was now impossible for me to rebuild any of the Xapian indexes for any of my users. Downgrading to the old version of Cyrus was too difficult to be worth the effort, and frankly wouldn’t be guaranteed to make things better, so that wasn’t an option.

Only then did someone else reply to the mailing list and say, oh, yeah, there’s a longstanding issue with Xapian indexing, here’s the workaround. *fml* I tried the workaround, but it had poor performance, and as far as I could tell it wouldn’t prevent messages from continuing to be repeatedly reindexed, negatively impacting the server’s performance and storage capacity. At this point I switched back to Squat message indexing. I don’t plan on switching back to Xapian until these bugs are fixed.

I did file bug reports [1], [2], which have received no response a week later. 🤷

Future-proofing

I don’t like the idea of losing most of a day’s worth of email again, so I’ve switched from daily backups of the mail server to hourly backups.

I actually have two layers of mail server backups: the server is first backed up onto a computer on my home network, and then that backup is mirrored into my encrypted cloud backup. I didn’t think hourly backups into the cloud were necessary, nor did I want to have to pay for them, , so I needed to figure out a performant way to keep full hourly snapshots on my home network without them taking up too much space.

I probably could have come up with an adequate way to do this with rsync --link-dest, but I decided to use this as an opportunity to learn how to use btrfs snapshots, which I’d never before gotten around to. So I:

set up a separate btrfs filesystem for backups;
created separate subvolumes for each backup source on the btrfs filesystem;
modified the backup scripts for each backup source to create a flag file at the root of the backup directory at the end of the backup [example];
set up a cron job to watch for those flag files and then, each time one of them appears, create a read-only subvolume snapshot of its backup subvolume and then delete the flag file; and
made that same cron job automatically delete any snapshots more than a week old.

I also changed my backup script so that it’s backing up the Cyrus index files I should have been backing up before but wasn’t:

all the global and user-specific databases in imapd’s configdirectory (in my case, /var/lib/imap) except for the conversations databases, which can be reconstructed from the contents of email message files; and
the cyrus.index file in each mail folder, which I was previously incorrectly excluding from backups (I was already backing up cyrus.annotations and cyrus-header).

Lessons learned

Cyrus imapd is a pain in the ass to administer, its documentation is subpar, and I don’t think its maintainers care all that much about making it accessible for small-server administrators vs. huge organizations like Fastmail (which, it appears, is responsible for much of its maintenance nowadays). I should probably bite the bullet and switch to a different IMAP server at some point. Maybe Dovecot? Not sure.

It would be better if I had a test rig I could easily spin up for playing with IMAP administration changes before rolling them out in production. For this to be practical I’d have to automated my IMAP server deployment and configuration with Ansible, something I haven’t gotten around to doing (I originally set up the server before I even knew how to use Ansible, heck, perhaps before Ansible even existed) and may not bother doing until I bite the bullet and switch to a different IMAP server.

It would have been better if I had tested a full restore of email server data from my backups. However, this is dependent on the aforementioned test rig which I don’t really have or have the time right now to set up.

All things considered I think losing less than a day’s worth of email (though metadata was lost) and being able to recover back to a working state in under two hours was pretty good, actually. I’m glad about the changes I’ve made to be more resilient moving forward, but I’m not terribly disappointed with how able I was to recover this time around.

Notwithstanding all the drama currently playing out in the Linux kernel developer community about bcachefs, btrfs, etc., btrfs was easy to set up and use for hourly snapshots of mail server backups. I haven’t taken the time to understand in depth the technical concerns people have with btrfs. I am crossing my fingers and hoping I don’t regret that, but I’m not too worried about it, since I’m only using btrfs for storing backups. If something does go wrong with my btrfs filesystem it’s easily recreated backups, rather than master data, that will be lost.

How I lost my family’s email, recovered most of it, and made sure it wouldn’t happen again

Introduction

The working environment

The triggering event

Learning how to rebuild Xapian search indexes

The catastrophic mistake(s)

The recovery

The last straw for me and Xapian

Future-proofing

Lessons learned

Related

Leave a Reply Cancel reply

Introduction

The working environment

The triggering event

Learning how to rebuild Xapian search indexes

The catastrophic mistake(s)

The recovery

The last straw for me and Xapian

Future-proofing

Lessons learned

Share this:

Related

Leave a Reply Cancel reply

Discover more from Something better to do