You’re validating email addresses all wrong!

VP of Technology, Opreto

7 minute read

Ever tried entering your email address, only to receive a “Please enter a valid email address” error? Frustrating, right?

There’s a lot more to email validation than just spotting an ‘@’ and a ‘.com’. Dive into the world of RFCs where email addresses are not just strings, but a complex interplay of rules and standards. From the subtle nuances of RFC 5321 and 5322 to the curious world of quoted strings and dot-atoms, we’re tackling the real challenge of validating emails. Because let’s face it, nobody wants to be told their perfectly good email is “invalid” – especially not by a machine.

Imagine you’re tasked with building a form to collect email addresses for a client. Simple enough, right? At first glance, it appears trivial, but the real challenge lies beneath the surface: ensuring the validity of these email addresses. This goes beyond the basic checks for an ‘@’ symbol or a domain suffix. It delves into the intricacies of what constitutes a valid email address according to established standards and protocols. Understanding this complexity is key to accurate and effective email validation.

Let me share a personal experience. Gmail allows you to append a + sign followed by alphanumeric characters to your email username. It’s a nifty trick for creating custom filters and tracking who shares your information. I use this feature frequently, but not without hiccups. A few years back, while signing up at my local gym, I encountered an error message stating my email was invalid. Surprised, I reached out to their support team, who escalated the issue to their development team. I had a direct email exchange with the application’s developer, who insisted that my email address wasn’t valid because it contained a + character, and promptly closed the ticket.

Understanding RFCs

RFCs, or Requests for Comments, are documents that define standards for various Internet systems and infrastructure. For email address validation, two key RFCs are prominent: RFC 5321 and RFC 5322.

RFC 5321 covers the Simple Mail Transfer Protocol (SMTP), the protocol used for sending emails. It specifies the syntax for email addresses in SMTP transactions. It focuses on the operational aspects of email addresses, how they’re used in sending emails, and the response codes related to address validation.

RFC 5322 is about the Internet Message Format, detailing the syntax of email messages themselves. It’s an evolution of RFC 2822 and brings several updates. The major changes from RFC 2822 to RFC 5322 include updated definitions for date and time, allowance for international characters in email headers, and refined specifications for message syntax and structure. This evolution reflects the growing diversity and global nature of email communication.

The Anatomy of an Email Address

Section 3.4.1 of RFC 5322 provides the formal specification for email addresses as follows:

   addr-spec       =   local-part "@" domain

   local-part      =   dot-atom / quoted-string / obs-local-part

   domain          =   dot-atom / domain-literal / obs-domain

   domain-literal  =   [CFWS] "[" *([FWS] dtext) [FWS] "]" [CFWS]

   dtext           =   %d33-90 /          ; Printable US-ASCII
                       %d94-126 /         ;  characters not including
                       obs-dtext          ;  "[", "]", or "\"

In RFC 5322, the specification for the local-part of an email address is defined as dot-atom / quoted-string / obs-local-part. This notation indicates that the local-part can be composed of one of three formats:

  1. Dot-Atom:

    • The dot-atom format consists of alphanumeric characters and certain special characters that are not enclosed in quotes.
    • The special characters allowed in this format include: ! # $ % & ' * + - / = ? ^ _ { | } ~`
    • These characters can be combined with periods (.), but a period cannot be the first or last character, and two periods cannot appear consecutively.
    • This format is typically what most people recognize as the standard part of an email address, like example.part in example.part@example.com.
  2. Quoted-String:

    • The quoted-string format is enclosed in double quotes (" ").
    • This format allows a wider range of characters, including spaces, tabs, and characters that might otherwise be interpreted as controls or separators.
    • Within a quoted string, characters can be escaped with a backslash (\). For example, "abc\"def" is a valid quoted-string.
  3. Obs-Local-Part:

    • obs-local-part refers to the obsolete local-part format from earlier specifications (like RFC 2822 and RFC 822).
    • This format is included in RFC 5322 for backward compatibility and to accommodate email addresses that were valid under the older standards.
    • It includes some formats that are no longer recommended, but might still be encountered in practice. For example, addresses with consecutive dots or addresses that use a combination of quoted strings and unquoted characters.

In practical terms, when implementing email validation, most modern systems focus on the dot-atom and quoted-string formats, as they are more commonly used and align with current email standards. The obs-local-part is typically supported for backward compatibility but is not recommended for new email addresses.

The formats are documented in the RFC-5322 Section 3.2 - Lexical Tokens

Best Practices in Email Validation

Given the complexity and evolving nature of email formats, this is one use case where I consider it prudent to rely on third-party libraries that adhere to RFC standards for email validation. Libraries in languages like Python, Java, and JavaScript often offer RFC-compliant validation methods. Choosing well-maintained and regularly updated libraries is vital, as RFCs are living documents that change over time.

If you don’t want to rely on third-party libraries for security or regulatory compliance reasons, try to understand the RFC and design a set of validation rules that comply with the established standard.

Here’s a regular expression that captures the details of the spec and all relevant lexical tokens:

(?:[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?

Note: Python supports verbose mode for regular expression, and CoffeeScript also natively supports multi-line regular expressions with comments, which results in far cleaner and more usable complex regular expressions. I really wish more languages supported that feature.

Here are the components of the regular expression:

  1. Local Part:

    • (?:[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-\]+(?:.\[a-zA-Z0-9!#$%&'\*+/=?^\_`{|}~-]+)*|
      • This section matches the local part of the email address.
      • It includes alphanumeric characters and special characters (!#$%&'*+/=?^_`{|}~-\).
      • The + after the character set [] means one or more occurrences of these characters.
      • (?:\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-\]+)\*\ allows for dot-separated parts in the local part, ensuring dots are not at the beginning or end and not consecutive.
  2. Quoted Strings:

    • "(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*"
      • This part allows for quoted strings with special characters and white spaces.
      • The [\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f] includes a range of control characters and special characters.
      • \\[\x01-\x09\x0b\x0c\x0e-\x7f] permits escaped characters.
  3. Domain:

    • @(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?
      • This section matches the domain part of the email address.
      • It starts with @, following the standard email format.
      • The domain must start and end with an alphanumeric character. It can contain hyphens but not consecutively or at the beginning or end.
      • The domain allows for subdomains, separated by dots, adhering to the same rules as the main domain.

The expression is cryptic and validates email addresses that most SMTP servers will fail to parse. For instance, including control characters (ASCII codes 0-31 and 127) in email addresses is generally not recommended and often unsupported in practice. While RFC 5322 provides a formal syntax that technically allows for their inclusion under certain conditions (like within quoted strings or when escaped), the practical implementation and support for such addresses can vary widely across email systems.

If we want to exclude control characters from the regular expression, we can craft a simplified regular expression that will provide validation logic that is suitable for most modern applications:

(?:[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[!#$%&'*+/=?^_`{|}~\-\x20-\x7E]|\\[!#$%&'*+/=?^_`{|}~\-\x20-\x7E])*")@(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?

Here’s the breakdown:

  1. Local Part (Unquoted):

    • [a-zA-Z0-9!#$%&'*+/=?^_`{|}~-]+
      • Matches one or more alphanumeric characters or special characters listed.
      • Dots (.) are allowed but cannot be the first or last character, and not consecutively.
  2. Local Part (Quoted):

    • "(?:[!#$%&'*+/=?^_`{|}~-\x20-\x7E]|\[!#$%&'*+/=?^_`{|}~\-\x20-\x7E])*"
      • Matches a sequence of characters within double quotes.
      • Includes most visible ASCII characters (\x20-\x7E) and allows for escaping characters (\\).
  3. Domain:

    • @(?:[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?\.)+[a-zA-Z0-9](?:[a-zA-Z0-9-]*[a-zA-Z0-9])?
      • Matches a domain consisting of alphanumeric characters and hyphens, separated by dots.
      • Each domain label must start and end with an alphanumeric character.

This version of the regex is more practical for typical email address validation scenarios. It omits support for control characters, focusing instead on the characters and formats most commonly used in email addresses. Remember, though, that while this regex is comprehensive, it may still not cover every edge case defined in RFC 5322, and for critical applications, using a well-maintained library is recommended.

Fin

As we’ve explored the complexities of validating email addresses, it’s evident that adhering to standards such as RFC 5322 is not just about conforming to rules; it’s about removing ambiguity and ensuring robustness in our systems. These standards are the pillars that support interoperability across different systems and platforms, allowing us to create software that reliably works in a wide array of contexts.

Understanding and implementing these standards can be intricate, and the best way to ensure compliance is often through specialized libraries. These libraries, maintained by experts and constantly updated, take the guesswork out of compliance, ensuring that our implementations are up-to-date and adhere to the latest specifications.

Moreover, it’s crucial to acknowledge that the world of email has expanded beyond the ASCII character set. Internationalized emails and Unicode support are covered under RFCs such as RFC 6530, RFC 6531, and RFC 6532. These documents provide guidelines for handling email addresses with non-ASCII characters, a necessity in our increasingly globalized digital landscape. Evaluating and supporting these RFCs is not just a technical requirement; it’s a commitment to inclusivity and global communication.

Updated:

Comments