I am an email pack-rat. I have saved just about every email message I’ve sent or received at work or at home in the past twenty years (excluding messages sent to public mailing lists). That’s a lot of email, over 400,000 messages.
I don’t just do this for the heck of it. I look for things in my archived email on a regular basis. Most of the time I’m only looking back a few months or perhaps a couple of years, but I do occasionally find it necessary for one reason or another to go digging through the really old stuff. To make that feasible, I need to be able to quickly search hundreds of thousands of messages.
Over the years I have imposed a lot of requirements on acceptable solutions for this problem:
- I want to be able to store the email in a compressed form, so the search engine needs to understand how to decompress the email archives when indexing or retrieving messages from them.
- To achieve decent compression, the email has to be stored in files that hold multiple messages, rather than storing each message in its own file. Therefore, the search engine needs to understand how to break up the mailbox files into separate messages, index them separately, and retrieve them separately during searches.
- I don’t want to use a proprietary or binary mailbox format — I want to be able to look at the mailboxes in a text editor and manipulate them easily with tools such as Perl. I used to store my email in BABYL Files, but now I use mbox format (which Thunderbird and Eudora also use for local folders).
- I don’t want to be locked into a GUI — I need to be able to update the index, do searches, and retrieve results through the command line.
For many years, I was unable to find any actively maintained open-source software package that satisfied all of these requirements. I was therefore stuck using freeWAIS 0.5, one of the very first Internet search engines, which was developed and released by Thinking Machines Corporation over 20 years ago. I was an active developer on the project; my efforts were focused on making the indexing code faster and less of a disk-hog and fixing a myriad of bugs and memory leaks (freeWAIS was written in C; its primary authors, all of them Lisp programmers, were so used to automatic garbage collection that they were very bad about cleaning up after themselves). Every since the freeWAIS project went defunct, I’ve maintained my own personal version of the code, layering hack on top of hack to keep it compiling and running on new versions of Linux. It was gross, but it was good enough.
That is, it was good enough until I upgraded my Linux box at home about a month ago and went from 32-bit Linux to 64-bit Linux. The freeWAIS code is very dependent on things like the size of an integer, the size of “time_t” and “off_t”, etc. Furthermore, to say that the code is not particularly clean or portable would be a gross understatement. When I rebuilt it for 64-bit, it stopped working. After spending several hours trying unsuccessfully to yet again nurse it back to health, I decided to take another look around to see if any new search engines that would do what I need had come onto the scene.
I was delighted to discover mairix, a package written and maintained by Richard Curnow which bills itself as “a program for indexing and searching email messages stored in maildir, MH or mbox folders.” Wow, someone went and wrote exactly the tool I needed. w00t!
Well, actually, it’s not exactly the tool I needed, because when I set it up last night I discovered some minor issues with its parsing of mbox files. But I fixed those issues and sent Richard my patches, and I hope they’ll be incorporated into the next release so I don’t have to go down the maintain-my-own-software road again :-). (If you’re trying to use version 0.21 of mairix to index mbox files, email me and I’ll send you my patches.)
Mairix indexed my >400,000 email messages in 12 minutes (3.5GHz CPU, 7,200 RPM SATA hard drive). The mairix index consumes only 144MB of space, despite the fact that my email archives take up 1.1GB compressed. It takes mairix less than 0.2 seconds to do an AND search for two search terms and save the 8 matching messages into an mbox. That is simply incredible.
One of the ways Richard made mairix so fast is by using another tool he wrote, dfasyn, which he describes as “a tool for building general deterministic finite automata (DFAs) given a description as a non-deterministic finite automaton.” You use a high-level syntax to describe the various legal state transitions, and dfasyn compiles that into some big numeric arrays and a very trivial function which transitions between machine states using nothing but pointer and integer arithmetic. The nerd in me thinks this is très cool.
I’m posting about mairix both to let other people know about this great tool, and to give Richard the kudos he deserves for implementing it.
Share and enjoy!