Yet another night debugging email delivery problems

I’ve run my own mail server for 30+ years. It’s a pain sometimes, but I’m a stubborn old cuss and I think it’s worth it both because I value my privacy and don’t want my emails being stored on somebody else’s servers, and because I’m a sysadmin at heart and I love a good sysadmin challenge, which running a mail server definitely is.

Every once in a while the corporate email service providers come up with a new way to screw things up for small mail server operators like me. Most recently, both Proofpoint and Cisco IronPort SMTP servers started having trouble delivering emails to my server. Maybe you’ll find the explanation of why amusing, or at least maybe you’ll find the bits below about how to get sendmail to accept more TLS ciphers useful.

The behavior I was seeing was that whenever someone using Proofpoint or IronPort for their outbound email delivery attempted to send me email, it would be delayed for many hours before finally being delivered. I started digging into my sendmail logs and saw messages like this every hour:

Nov 17 00:33:33 jik4 sendmail[372431]: STARTTLS=server, error: accept failed=-1, reason=no shared cipher, SSL_error=1, errno=0, retry=-1, relay=esa12.hc6077-55.iphmx.com [139.138.46.199]
Nov 17 00:33:33 jik4 sendmail[372431]: 4AH5XXPf372431: esa12.hc6077-55.iphmx.com [139.138.46.199] did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA

My mail server runs on CentOS Stream 9, which is pretty strict about which TLS ciphers its servers accept by default. Therefore, my first theory about what was causing the delivery issues was that Proofpoint’s and IronPort’s servers are running old SSL libraries that don’t support modern, secure ciphers.

However, this raises an obvious question: if that’s true, then why were these messages eventually being delivered at all. Shouldn’t the servers just keep trying and failing to connect with their old TLS ciphers and then eventually give up? I.e., shouldn’t delivery just fail completely, rather than being delayed?

I hand-waved away this question by observing that either of the following might be true:

Some of their servers run more modern code, so eventually one of those servers attempts to deliver the message and it works.
They have a fallback built into their delivery pipeline, such that if a certain number of delivery attempts with TLS fail, they fall back on delivering without TLS.

Neither of these is true, and I shouldn’t have hand-waved away this question, but I’ll get to that later since I’m telling the debugging story in chronological order.

Since at this point my theory was that the problem was that Proofpoint and IronPort servers didn’t support modern SSL ciphers, I decided to try figuring out how to get my sendmail to accept a larger list of SSL ciphers. I have attempted to do this in the past and failed, but this time I decided to be more persistent.

The first problem I had to solve was how to identify which TLS ciphers were supported by the running sendmail process, so I could easily test whether my attempts to make it accept more were working. I dig around a bit online and found the tool “sslscan” which works great for this purpose: “sslscan –starttls-smtp hostname:25″. Sslscan is conveniently available for installation from Debian’s repositories.

First attempt at expanding the list of accepted ciphers: I captured the list of supported ciphers with sslscan, ran “update-crypto-policies --set LEGACY“, restarted sendmail, and tested again with sslscan. No change, unfortunately.

Next, I said to myself, “Hmm, is there some way to explicitly tell sendmail in its configuration file to accept more ciphers.” After consulting /usr/share/sendmail-cf/README and the openssl-ciphers man page, I concluded that adding “define(`confCIPHER_LIST', `ALL')dnl” to the M4 source file for my sendmail.cf, then rebuilding and reinstalling it and restarting sendmail, might yield a bigger list of supported ciphers. And indeed, sslscan confirmed that this added 5 additional supported ciphers to the original list of 12.

However, those ciphers looked mostly like the old ones, so I was pretty sure I had to go further. Next I tried using “@SECLIST=1:ALL” as the cipher list instead of just “ALL“. I suspected that this might make a difference because /etc/crypto-policies/back-ends/openssl.config starts with “@SECLIST=2:“. Alas, it didn’t help.

I tried again with “@SECLIST=0” instead of “@SECLIST=1“, and the added an additional 14 ciphers. Some of these looked quite different, so maybe this would be enough to get Proofpoint and/or IronPort to be able to deliver email? I went back to one of the websites where I knew how to trigger an email message sent to me through Proofpoint, and tried it out. No luck, still “no shared cipher” in the log.

At this point I needed another solution, so I asked myself whether I might have been wrong to hand-wave away the question of why the email messages were eventually being delivered successfully. Just allowing my brain that small amount of space to question my priors caused me to have a revelation. It suddenly occurred to me that there were other log messages related to Proofpoint and IronPort delivery fails that I had been dismissing as irrelevant which in fact where not irrelevant at all. They looked like this:

Nov 17 00:34:34 jik4 sendmail[372437]: 4AH5XXIZ372437: timeout waiting for input from esa12.hc6077-55.iphmx.com during server cmd read
Nov 17 00:34:34 jik4 sendmail[372437]: 4AH5XXIZ372437: lost input channel from esa12.hc6077-55.iphmx.com [139.138.46.199] to MTA after rcpt
Nov 17 14:33:44 jik4 sendmail[614551]: 4AHJH3eD614551: collect: premature EOM: Connection timed out with mx0a-001e6701.pphosted.com
Nov 17 19:15:28 jik4 sendmail[701129]: 4AI0ESSZ701129: timeout waiting for input from mx0a-001e6701.pphosted.com during server cmd read
Nov 18 06:58:18 jik4 sendmail[908302]: 4AIBvHH3908302: lost input channel from mx0a-001e6701.pphosted.com [148.163.149.215] to MTA after mail

When I first saw these messages, I just assumed they indicated that Proofpoint’s and IronPort’s servers were being stupid about something. That was a mistake. In fact, it was my server that was being stupid.

I use fail2ban configured on my server, and I have it configured to aggressively block suspicious SMTP connections. It blocks a server for an hour after a single suspicious log message from that server, and the “did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA” messages shown above, that are generated after Proofpoint or IronPort attempts and fails to initiate TLS and then disconnects, count as suspicious.

So while the fact that Proofpoint and IronPort don’t support modern TLS ciphers was the root cause of the issue, my server was exacerbating the problem by banning their IP address immediately after they tried to do TLS, so that they were then unable to reconnect and deliver the message without TLS.

The reason why the messages from these servers were eventually being delivered is that there’s sometimes a 1–2 second delay between when a suspicious log message appears and when fail2ban bans the associated server IP, and after trying over and over eventually the Proofpoint or IronPort managed to sneak in a successful message delivery during that short delay.

The solution, therefore, is to make fail2ban less aggressive about banning suspicious SMTP servers. However, if I do that, then my server is much more vulnerable to brute-force SMTP authentication attacks? What to do, what to do?

Maybe there’s another suspicious log message I can match against in the logs instead of the “did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA” messages which are caused innocuously when a TLS connection fails? With just a little bit of digging I found these:

Nov 21 23:07:50 jik4 sendmail[58510]: 4AM47hXh058510: AUTH failure (LOGIN): authentication failure (-13) SASL(-13): authentication failure: checkpass failed, user=postmaster, relay=[109.172.142.229]

Wait a minute, why the heck isn’t fail2ban banning based on this message which is obviously, unambiguously a brute-force login attempt?

Well, you see, it turns out that I originally configured fail2ban to ban based on the “did not issue MAIL/EXPN/VRFY/ETRN during connection to MTA” messages way back in 2007 (like I said, I’ve been running my own mail server for a long time), and back in 2007, the “AUTH failure” log messages did note have the “relay=” at the end of them. That wasn’t added until release sendmail 8.14.4, which was released in December 2009 (I figured out which release this was added in by doing a binary search through all of the old sendmail source archives on ftp.sendmail.org).

Putting this all together:

One root cause of the incoming email delivery issues is Proofpoint and IronPort being stupid about which TLS ciphers they support.
A second root cause is my fail2ban configuration being too aggressive about banning suspicious SMTP servers, resulting in servers being banned for engaging in innocuous activities.
Configuring sendmail to accept more TLS ciphers is the wrong solution.
The right solution, which I’ve implemented, is to make fail2ban more selective about which log messages it considers suspicious, so it won’t ban the Proofpoint and IronPort servers for failing to initiate a TLS connection.

See, now, this is actually one of the reasons I love running my own mail server: there is immense satisfaction in getting to the bottom of one of these problems and successfully solving it.

Yet another night debugging email delivery problems

Related

Leave a Reply Cancel reply

Share this:

Related

Leave a Reply Cancel reply

Discover more from Something better to do