Python fails Postel’s Law parsing email messages, with a workaround

Jon Postel, one of the early architects of TCP, coined what is now known as Postel’s Law: “Be liberal in what you accept and conservative in what you generate.” In other words, computer software should make every effort to accept input that can be accurately understood, even if it is not fully compliant, while at the same time it should be careful to generate only compliant output for others’ consumption. Today I ran into an example of Python failing at this when parsing MIME email messages. Here’s what I learned, and how I fixed it.

I have a Python script which looks for large email messages in an IMAP folder; saves the large attachments in those messages to the filesystem; and rewrites the messages on the IMAP server to replace the large attachments with plain-text stubs that say where they were saved. The goal is of doing this is twofold: (1) I don’t want these attachments taking up space on every computer I sync my mail to; (2) my mail server is a Linode, where disk space is relatively expensive, and I don’t want to pay to store years’ worth of large email attachments there.

I ran the script today and noticed that there were several large messages that it was failing to save attachments from. Further investigation revealed that the all had one thing in common: the file names associated with these attachments had an apostrophe in them and were not quoted properly in the Content-Disposition and Content-Type headers. For example:

Content-Type: application/pdf;
  name=Jon's document.pdf
Content-Disposition: attachment;
  filename=Jon's document.pdf;
  size=3304291

I’ve been using this script for many years so it was surprising to me that I had not noticed this problem before. It turns out that a couple of years ago I changed my Python script from using the default, Python-3.2-compatible email message parser to the newer one added to Python in 3.3. I made this change because the new parser handles encoding and folding of filenames automatically, so my script wouldn’t have to, but in fixing that problem I introduced the one this post is about.

Let’s be clear here: the headers shown above are invalid MIME. And yet, there’s software out there generating headers like that, and I can’t really do anything about it, so I have to do my best to accommodate it. Postel’s Law in action.

Here’s how I ended up fixing the problem after diving deep into the bowels of Python email message parsing to understand how it works and how its behavior can be tweaked:

# I cried when I wrote this code.

# Some email generators allow apostrophes in the "name" Content-Type parameter
# or the "filename" Content-Disposition parameter without quoting it. The
# email.policy.compat32 policy parses these just fine but the newer parser
# based on the header registry doesn't; it simply loses the parameter and
# refuses to return it, so we don't get back a file name for these attachments.

# I don't want to just fall back on the old parser because it is presumably
# worse in other ways and may eventually be deprecated, so I need to patch the
# header-registry-based parser to repair these headers before they are parsed.

# It took me more than three hours to figure out exactly what was going wrong
# here and come up with this solution, hence the crying (for that, and for the
# fact that this is a gross solution which should not be necessary because
# people shouldn't be generating non-compliant emails.

class MyContentDispositionHeader(ContentDispositionHeader):
    @classmethod
    def parse(cls, value, kwds):
        new_value = re.sub(r'filename=([^;\"\']+\'[^;\"\']+)(;|$)',
                           r'filename="\1"\2', value)
        return super().parse(new_value, kwds)


class MyContentTypeHeader(ContentTypeHeader):
    @classmethod
    def parse(cls, value, kwds):
        new_value = re.sub(r'name=([^;\"\']+\'[^;\"\']+)(;|$)',
                           r'fame="\1"\2', value)
        return super().parse(new_value, kwds)


class MyHeaderRegistry(HeaderRegistry):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.registry['content-disposition'] = MyContentDispositionHeader
        self.registry['content-type'] = MyContentTypeHeader


def do_message(args, imap, uid):
    policy = email.policy.EmailPolicy(header_factory=MyHeaderRegistry())
    parser = email.parser.BytesParser(policy=policy)
    # etc.

I wanted to be very conservative here, so the modified behavior of my script is extremely limited. The code above overrides the classes that parse Content-Type and Content-Disposition header values, replacing them with subclasses which check for filenames that are not quoted and contain a single apostrophe. When found, quotation marks are added to the filename so that the actual parser will be able to handle them without error.

I will monitor the behavior of the script moving forward and if there are additional parsing issues I will adjust the code to make the minimum changes necessary to accommodate those as they occur.

Another thing I could have done while addressing this but decided not to was change the behavior of the parser to raise an exception when encountering invalid MIME input instead of silently ignoring it. That could catch other issues that I don’t care about, not just the one described above, and I decided this script isn’t high-stakes enough to be worth dealing with that.

I contemplated whether I should report this issue to the maintainers of the Python email parsing code, and I decided not to. There’s a good chance they’re not believers in Postel’s Law and I don’t have the time or energy to get into an argument about whether it’s more important for library code to strictly adhere to the relevant standards or to successfully handle the data that’s out there flitting around in the real world.

Python fails Postel’s Law parsing email messages, with a workaround

Related

Leave a Reply Cancel reply

Share this:

Related

Leave a Reply Cancel reply

Discover more from Something better to do