A case study in troubleshooting the email pipeline

By | September 9, 2017

I maintain the email infrastructure for my family. This includes:

  • an SMTP server (sendmail; I’m old-school!) for outside entities to send email into our domain;
  • an MX record indicating the host name of the SMTP server for our domain;
  • DNSBL checks of incoming SMTP connections against the Spamhaus ZEN blocklist
  • an SMTP server for people in my family to send outbound email;
  • an IMAP server (Cyrus) with mailboxes for the people in my familiy
  • several plugins (a.k.a. “milters”) hooked into the SMTP server:
    • bogofilter-milter (which I wrote) for Bayesian spam filtering using bogofilter
    • mimedefang, for generic email safety filtering
    • archivesmtp, which I use for capturing copies of my outgoing emails into my “Sent Items” IMAP folder
    • opendkim for adding DKIM signatures to outbound email
  • A script called spamitarium which is used by bogofilter-milter to clean up messages before feeding them into bogofilter.
  • procmail for user-level message filtering and final message delivery to the IMAP server (again: old-school!)
  • several scripts (described on the bogofilter-milter page linked above) which allow me to use special IMAP folders to tell bogofilter when a message was miscategorized as spam or non-spam so that it can be recategorized properly

In short, it’s complicated.

Here’s what happens when someone sends me an email message:

  1. Their SMTP server looks up the MX record for our domain to get the host name of our SMTP server.
  2. Their SMTP server connects to our SMTP server.
  3. Our SMTP server verifies that their SMTP server isn’t on the ZEN blocklist (if it is, the email is rejected).
  4. Their SMTP server transmits the message to ours.
  5. Our SMTP server feeds the message through the milters listed above, each of which has the ability to modify or reject the message as appropriate.
  6. In particular, bogofilter-milter:
    1. filters the message through spamitarium to clean it up before feeding it into bogofilter;
    2. feeds the message into bogofilter, which indicates whether it classifies the message as spam, non-spam, or unsure;
    3. adds a header to the message indicating its classification; and
    4. saves a copy of the message in a file called “notspam” or “bogospam” in my home directory, depending on whether the message was determined to be spam or not.
  7. If the message was determined to be spam, bogofilter-milter tells the SMTP server to reject it.
  8. Otherwise, our SMTP server hands off the message to procmail, which applies the filter rules specified in my .procmailrc file, and then, assuming that none of the filter rules indicate special treatment, saves the message into my IMAP Inbox.
  9. Our SMTP server informs their SMTP server that the message has been delivered successfully.

Obviously, there are a lot of steps in this workflow where something can go wrong. When that happens, it can be challenging to determine the root cause of the problem. I recently had to troubleshoot an email delivery issue which led me to eventually discover (spoiler!) that Constant Contact is sending email messages which violate the SMTP protocol [UPDATE: Nope, it wasn’t Constant Contact. See my followup blog posting]. After finally getting to the bottom of it, I thought it would be interesting to document the troubleshooting process I went through. Maybe this will be of interest to no one but me ;-), but even so, writing it down will help me spot ways to troubleshoot more efficiently in the future, and it will also help me to preserve the details of my troubleshooting which I might need again if I have a similar problem in the future.

The first hint that something is wrong

Occasionally, my spam filter starts acting up and classifying too many messages incorrectly, and I have to tune it using the “bogotune” utility that comes with bogofilter. Bogotune takes a bunch of spam and non-spam messages, analyzes them in the context of your current bogofilter word list, and spits out a set of configuration parameters you can put in your .bogofilter.cf file to optimize the accuracy of future classification.

The script that I use for running bogotune takes the bogospam and notspam archives mentioned above and feeds them into bogotune. Recently, for reasons that are not important here, I decided it was necessary to modify this script to sort the message archives in chronological order before feeding them into bogotune. I have another script which does just this, i.e., time-sorts an mbox-format email archive file. Unfortunately, when I added the code to my bogotune script to do this sorting, it started spitting out a bunch of warning message I hadn’t seen before, like this one:

Couldn't parse RFC822 date Fri, 8 Sep 2017 13:44:33 -0400 (EDT) From: "Example Name" <example@example.com>

I went and found in notspam the message that caused that particular warning, and part of its header looked like this:

Date: Fri, 8 Sep 2017 13:44:33 -0400 (EDT) From: "Example Name" <example@example.com>
From: [no-from]
Reply-To: example@example.com
To: jik@kamens.us

What you see on the first line above is not a transcription error in this blog posting. For some reason, in notspam, the line break which should have occurred at the end of the “Date:” line in the header had been replaced by a space.

Let the hunt for the root cause commence!

I read that message, and it didn’t look like that

“That’s weird,” I said to myself. “I don’t remember that message looking broken or wrong when I saw it while reading my email.” I therefore went and found the message in my IMAP Inbox in my mail client, and indeed, it looked entirely normal. Not only that, but when I viewed the header of the IMAP message, its header was correct. Here’s the same piece of the header as above, this time from the IMAP message instead of notspam:

Date: Fri, 8 Sep 2017 13:44:33 -0400 (EDT)
From: "Example Name" <example@example.com>
Reply-To: example@example.com
To: jik@kamens.us

So, in the IMAP copy of the message, there was a line break at the end of the “Date:” header, and there was no “From: [no-from]” header at all. Bizarre!

My first wrong hypothesis

At this point I needed a hypothesis for how the message ended up corrupted in notspam. I posited that perhaps the message had initially been miscategorized as spam, and so I found it in my “isspam” IMAP folder, noticed that it should not have been spam, and moved it to my “despam” folder. Then, the “spamtrain” script that runs periodically to check for miscategorized messages saw it in “despam” and did what it was supposed to do, i.e., (a) locate the message in bogospam, (b) tell bogofilter to recategorize it as non-spam, and (c) move the message from the bogospam to notspam. My hypothesis was that there was a bug in this script which caused the message to get corrupted in the process of performing those steps, such that the version of it that was saved in notspam was unfortunately corrupted from the version that had been taken out of bogospam.

I didn’t specifically remember this message having been categorized as spam, but this did not dissuade me from this hypothesis, because I get so much spam and deal with it all so quickly at this point that it’s entirely possible for me to de-spamify a message and then forget that I’d done so just a few minutes later.

What should have dissuaded me from this hypothesis, but didn’t because I wasn’t paying good enough attention, was this line at the end of the header of this message in notspam:

X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.3, milter_id=268f.59b2d703.32866855

This is the line that bogofilter-milter added to the message in step 6c of the above workflow. The fact that it says “Ham” there rather than “Spam” indicates that in fact this message was never categorized as spam, never saved into bogospam, and never moved from bogospam into notspam. In short, my hypothesis was totally wrong, and I should have known that from the available data.

But for whatever reason I didn’t realize that, so I proceeded to spend some time poring over the (somewhat gross) code in spamtrain to try to figure out where it had corrupted the message, ripping out pieces of the script and feeding input into them in isolation to try to reproduce the problem, etc. All to no avail, because spamtrain wasn’t actually the problem.

Follow the clues!

When my spamtrain theory didn’t pan out and I noticed the “X-Bogosity: Ham” header line and realized that I had been on a wild goose chase, I decided that I needed to regroup. Rather than trying to come up with a hypothesis for the root cause of the problem, I needed to first try to identify the cause of one of the symptoms of the problem, which would bring me closer to identifying the root cause.

In particular, what was up with “From: [no-from]” in the message header in notspam? Which piece of the email delivery workflow inserted that line into the message? If I could figure that out, then hopefully that would narrow down the possibilities for where I should be looking for the root cause.

I tried to answer this question by searching for the string “[no-from]” in all of the files associated with all of the steps in the delivery pipeline. I.e., I searched the sendmail binary, the bogofilter-milter source code, the spamitarium source code, the mimedefang source code, etc. No dice… I just couldn’t find that string in anything.

OK, fine, so presumably whatever inserted “From: [no-from]” into the message header did so because it believed that the message was missing a “From:” line. All I needed to do, then, was feed a message without a “From:” line into the various steps of the workflow and see which of them inserted the “[no-from]” line.

Well, that’s easier said than done, since some of the steps of the workflow are much easier to invoke in isolation than others. It made sense to do the easy ones first, and fortunately, I got lucky. Lo and behold, when I fed a message with no “From:” line into spamitarium, the cleaned-up message it spit out had “From: [no-from]” in it. Progress!

One of these things is just like the others

At this point it occurred to me that a useful thing to do would be to look in notspam for the other messages with the same problem as this one and see if I could identify any commonalities between them which might further narrow down the scope of my search for the root cause.

I was not disappointed. It turns out that every single one of the messages in notspam with “Date:” and “From:” merged together into a single header line came from Constant Contact. Furthermore, the affected messages were sent by three different Constant Contact customers, making it far more likely that this is a problem with Constant Contact’s code than human error by their customers.

But what could Constant Contact be doing to cause incoming messages to have concatenated Date: and From: headers in notspam but not in the copies of the messages in IMAP? [UPDATE: The Constant Contact messages were indeed formatted in a way that was triggering the problem, but they weren’t formatted incorrectly. The problem was on my end, as described here. Just to be clear: it wasn’t Constant Contact’s fault! We now return you to the original narrative already in progress…] I needed another hypothesis.

My second (right!) hypothesis

[UPDATE: Nope, this hypothesis wasn’t right either. It just so happens that the behavior of this hypothesis mirrored the behavior of the actual root cause of the problem, which I’ve documented in my followup blog posting.]

It helps when debugging issues like this to have pretty deep knowledge and experience about email protocols and how applications that process email tend to behave. Drawing on my knowledge and experience, I reasoned as follows:

  • The SMTP protocol requires transmitted lines of data to be terminated with CRLF.
  • It is not uncommon for people to write software that incorrectly terminates some lines of data with just CR or just LF.
  • If a header line in a message being transmitted via SMTP is separated from the next header line with just CR or LF rather than CRLF, it’s possible that something in the email delivery pipeline will interpret the second header line as a continuation of the first one.
  • It’s also possible — likely, in fact — that something later in the pipeline will fix the bad line terminators, such that by the time the message ends up in my IMAP Inbox, it will be formatted properly.

My hypothesis: Constant Contact is sending some email messages with Date: headers ending in just CR or LF rather than CRLF. [UPDATE: Wrong! See above.]

This hypothesis is consistent with the fact that I was unable to reproduce the problem in any of my earlier troubleshooting and testing, since I was using proper line endings in all of the test messages and input I was using. Therefore, to prove or disprove my hypothesis, all I had to do was send messages through my mail server with adjacent “Date:” and “From:” headers and with the “Date:” header ending in just CR or LF rather than CRLF.

This is easier said than done, since all of the ready-made software one would typically use to easily send test messages is smart enough to ensure that line terminators are correct, as well it should. So I had to write my own script to generate the test message:

#!/usr/bin/env python

from email.utils import formatdate, make_msgid
import re
import socket
import sys
import time

msg_template = """\
Date: {date}{term}From: Jonathan Kamens <jik@kamens.us>
To: Jonathan Kamens <jik@kamens.us>
Message-ID: {msgid}
Subject: Test message with {which} terminator

This is a test message sent over a socket from a Python script.
""".replace('\n', '\r\n')

def r(s):
    m = s.recv(1024).strip()
    print(m)

def doit(template, terminator, description):
    date = formatdate(time.time())
    msgid = make_msgid()
    msg = msg_template.format(date=date, msgid=msgid, term=terminator,
                              which=description)

    s = socket.create_connection(('email.kamens.us', 25))
    r(s)
    s.send('MAIL FROM:<jik@kamens.us>\r\n')
    r(s)
    s.send('RCPT TO:<jik@kamens.us>\r\n')
    r(s)
    s.send('DATA\r\n')
    r(s)
    s.send(msg)
    s.send('.\r\n')
    r(s)
    s.close()

doit(msg_template, '\r\n', 'CRLF')
doit(msg_template, '\r', 'CR')
doit(msg_template, '\n', 'LF')

And here are the relevant bits of what ended up in notspam for these three messages:

Date: Fri, 08 Sep 2017 21:30:13 -0000
From: Jonathan Kamens <jik@kamens.us>
To: Jonathan Kamens <jik@kamens.us>
Subject: Test message with CRLF terminator
...
Date: Fri, 08 Sep 2017 21:40:14 -0000 From: Jonathan Kamens <jik@kamens.us>
From: [no-from]
To: Jonathan Kamens <jik@kamens.us>
Subject: Test message with CR terminator
...
Date: Fri, 08 Sep 2017 21:40:15 -0000
From: Jonathan Kamens <jik@kamens.us>
To: Jonathan Kamens <jik@kamens.us>
Subject: Test message with LF terminator

Success! I was able to reproduce the problem by sending a message through my mail server whose “Date:” header was terminated by CR rather than CRLF.

So, what should I do about it?

At this point I knew that Constant Contact was violating Internet RFC 5321 [UPDATE: No, it wasn’t! See above], which specifies SMTP (Simple Mail Transfer Protocol). In particular, section 2.3.8 reads in part:

Lines consist of zero or more data characters terminated by the sequence ASCII character “CR” (hex value 0D) followed immediately by ASCII character “LF” (hex value 0A). This termination sequence is denoted as <CRLF> in this document. Conforming implementations MUST NOT recognize or generate any other character or character sequence as a line terminator.

I decided to try to let the people at Constant Contact know that their software is generating broken email messages so that they can fix it. However, as I write this blog posting, I haven’t done that yet. I decided to wait until after I’ve published the blog posting, so that I can email a link to it to Constant Contact to explain fully what led me to conclude that their software is broken.

Plugging the leak

[UPDATE: The real fix is described here.]

In the meantime, what could I do now — not waiting for me to successfully contact Constant Contact, and not waiting for them to get around to fixing the problem — to fix my infrastructure to stop corrupted messages from ending up in my notspam and bogospam archives?

Because any change I make to my infrastructure to fix this would be, technically, a violation of RFC 5321, I wanted to make the minimal change necessary to solve this particular problem. Therefore, I decided to modify spamitarium to detect Date: headers ending in CR rather than CRLF and add the missing LF. I did that and tested the change to spamitarium successfully in isolation, but when I then reran the test script shown above, the CR message in notspam was still corrupted.

Next, I modified the script that calls spamitarium to save a copy of its input and output, which enabled me to discover that by the time spamitarium gets the message from bogofilter-milter, the CRLFs in it have already been converted to LF, but my changes to spamitarium were assuming CRLF line endings. I enhanced my changes to make them work with either CRLF or LF line endings, then tested spamitarium successfully in isolation and end-to-end, thus confirming that the problem was fixed.

In conclusion

Here, have some cute cats as a reward for reading this far.

Share

One thought on “A case study in troubleshooting the email pipeline

  1. Pingback: A case study in troubleshooting the email pipeline, chapter 2 – Something better to do

Leave a Reply

Your email address will not be published. Required fields are marked *