How AI weeds the spam out of our inboxes

Of more than 300 billion emails sent every day, at least half are spam. Email providers have the huge task of filtering out spam and making sure their users receive the messages that matter.

Spam detection is messy. The line between spam and non-spam messages is fuzzy, and the criteria change over time. From various efforts to automate spam detection, machine learning has so far proven to be the most effective and favored approach by email providers. Although we still see spammy emails, a quick look at the junk folder will show how much spam gets weeded out of our inboxes every day thanks to machine learning algorithms.

How does machine learning determine which emails are spam and which are not? Here’s an overview of how machine learning-based spam detection works.

The challenge

Spam email comes in different flavors. Many are just annoying messages aiming to draw attention to a cause or spread false information. Some of them are phishing emails with the intent of luring the recipient into clicking on a malicious link or downloading a malware.

The one thing they have in common is that they are irrelevant to the needs of the recipient. A spam-detector algorithm must find a way to filter out spam while and at the same time avoid flagging authentic messages that users want to see in their inbox. And it must do it in a way that can match evolving trends such as panic caused from pandemics, election news, sudden interest in cryptocurrencies, and others.

Static rules can help. For instance, too many BCC recipients, very short body text, and all caps subjects are some of the hallmarks of spam emails. Likewise, some sender domains and email addresses can be associated with spam. But for the most part, spam detection mainly relies on analyzing the content of the message.

Naïve Bayes machine learning

Machine learning algorithms use statistical models to classify data. In the case of spam detection, a trained machine learning model must be able to determine whether the sequence of words found in an email are closer to those found in spam emails or safe ones.

Different machine learning algorithms can detect spam, but one that has gained appeal is the “naïve Bayes” algorithm. As the name implies, naïve Bayes is based on “Bayes’ theorem,” which describes the probability of an event based on prior knowledge.