Rambling about hacking on my creaky old email archival infrastructure

By | January 26, 2024

I have archived emails going back more than 30 years, stored as compressed Mbox files. I periodically transfer old emails from my mail server into the archive and index them with mairix for easy searching later.

Yesterday my mail server (which I also maintain myself) was running a bit low on disk space, so I decided to archive some messages. Thus began an absurd odyssey involving many hours of work to track down and fix multiple bugs…

First, the script I use to pull messages from the mail server into an Mbox file wouldn’t work because the maintainer of one of the Perl modules it uses made a breaking change to the module which he couldn’t possibly have tested before release. I found and fixed the bug and submitted a bug report. It is worth noting that this breaking change was made in response to an earlier bug report I also submitted. In trying to fix one bug he introduced another one because he didn’t test his fix. *sigh*

I found and fixed multiple parsing issues with how my scripts parse the “From” lines that separate messages in Mbox files (e.g., the day of the month is sometimes space-padded; there are sometimes multiple spaces after the sender’s email address; the email address can be an arbitrary string that includes spaces enclosed in quotation marks). The most obvious impact of these bugs in that the script weren’t properly recognizing all the From lines and therefore weren’t separating messages properly, so multiple consecutive messages were being treated as one. Not good!

When something fails during the archiving process, the script that does the archiving is supposed to restore things to their previous state, but I discovered an edge case in which it was not doing that, with the net result that I probably unexpectedly lost some messages over the years when I was trying to archive them and they went astray. D’oh!

I had an off-by-one bug in the script I wrote to sort messages by date in Mbox files, with the result that although messages weren’t lost, they were being stored out of order in some cases. This isn’t a huge deal since I usually search with mairix which doesn’t care what file messages are located in, but it’s still a bit of a pain.

My email archive goes back far enough that the contents of the Date header weren’t standardized, and so it turns out I had many messages in my archive (over 6,000) with invalid or ambiguous dates. It turns out my scripts don’t like that, so I wrote a script to process the entire archive, use heuristics to fix most of the dates automatically, and alert me to which ones needed to be fixed by hand.

One of my scripts wasn’t properly handling the fact that all email header fields, including specifically the Date field, are allowed to be split over multiple lines. I mean, anybody who’s generating emails with multi-line Date fields is a jerk who should feel bad, but alas, it turns out there are in fact jerks of this sort, so my code had to be tweaked to accommodate them.

I have some old, old email archive files that are in Babyl format instead of Mbox format. It made sense to convert some of them during all this. In the process of doing that I discovered that the b2m conversion tool that comes with Emacs is bad (it doesn’t put a blank line before each From line, which is required by the format). I tried using the Python mailbox module instead, but I discovered that it has two significant problems: it doesn’t know that when the “full header” section of a Babyl message is empty the program is supposed to use the “visible header” to represent the full one, and it can’t cope with messages that don’t have a body. Absent these two issues I could have written a 4-line converter script using mailbox, but instead I had to write a 78-line script to work around the bugs. I submitted bug reports about the two bugs.

Once all the bugs were fixed and the utility scripts written I had to clean up a ton of files in the archive, fixing dates, resorting, and reindexing them. Fun times.

On the one hand all this was kind of a pain, and I definitely had other things I needed to be doing, but on the other hand, I do derive some satisfaction from tracking down bugs and making things work better.

Print Friendly, PDF & Email
Share

Leave a Reply

Your email address will not be published. Required fields are marked *