By Joel Shore
May 8, 2003
Spam sucks. Over the last month I’ve been taking a close look at my
incoming e-mail. A whopping 47 percent of it is junk. Multiply that by
the hundreds or thousands of people in a large corporation, and you’ve
got a major headache.
How widespread is spam? According to a forecast from research firm IDC, the spam
glut will propel the worldwide daily volume of e-mail from 31 billion
messages in 2003 to 60 billion in 2006. My personal guess is that spam
is growing even faster than that.
Cloudmark Authority not only
breaks the rules of fighting spam, it doesn't even
bother with them
The problems with spam are far-flung. Individual workers waste time
looking at and deleting junk e-mail, taking a toll on productivity.
The network infrastructure has to be robust enough to deal with this
incoming deluge. And workers offended by the content in some spam may
seek recourse, exposing the employer to potential litigation for not
The answer is pretty simple; you’ve got to stop spam before it
reaches users. And that means blocking spam at the server handling the
incoming message stream.
It turns out there are a couple of different ways to do this. Some
spam-blocking software relies on rules to hunt down and delete spam.
Not surprisingly, these rules need continuous updating. Another method
is to examine the structure of an e-mail message, not its content, to
determine if it’s spam.
Cloudmark, a small company based in San Francisco, has adopted the
latter approach. Its gateway spam-blocking product, called Authority,
uses predictive statistical analysis methods to keep up with the
ever-changing nature and sophistication of spam.
The basis for
Authority is Bayesian statistics, based on the work of one Thomas Bayes.
If you think this guy is some whiz kid from MIT, Cal Tech, or
Stanford, then think again. Thomas Bayes, who turned 300 years old in
2001 (1701–1761), was,
of all things, an English Presbyterian minister.
Bayesian statistics says that you can combine current data (like
today’s incoming e-mail) with historical data (a database of spam and
spam characteristics) to predict an outcome—though with varying
levels of confidence.
And that’s just what Authority does. It looks at incoming mail and
assigns a “confidence factor” to it. If Authority is 99 percent sure
that a message is spam, you can take several actions including
deleting it or sending a refusal error message back to the sender.
Mail can be sent to the addressee with a “SPAM ALERT” warning stuffed
into the subject field. Or you might not send it to the addressee at
all. Alternatively, if Authority is only one percent sure the message
is spam (meaning it almost certainly is not), it is untouched and
routed to the addressee.
You’ve already seen Bayesian theory at work. That obnoxious
animated little paper clip character in Microsoft Office is an
example. The help engine observes a user’s usage patterns and pops up
to offer help, based on its internal database and user analysis. As time passes and the
help engine continually adds its observations, it pops up less and less often. (Of course, you probably
disabled the darn thing right after it popped up for the first time.)
What’s cool about Authority is that it doesn’t need daily updating
to a set of rules, potentially an enormous headache. Instead of tens
of thousands of rules, Authority has only about 200 Bayesian
statistical models. Cloudmark refers to each of these as a spamGene.
Collectively, they constitute spamDNA. Instead of daily updates,
spamDNA is updated about once a month.
Authority examines not just content, but the structure of an e-mail
the IP relay path from sender to all servers to addressee, and more. IP spoofing or
forging is always caught as is any attempt to bypass the e-mail server
by encoding the body of a message in base64. The use of nonsense text, embedded
spaces in words, upper case, exclamation points, presence of external
links, presence of graphics, recipient’s name embedded in the subject
or sender name fields, and a whole lot more are all considered.
Authority's spamDNA looks at these factors as a whole, making judgments based on
its ever-changing experience.
The genetic reference is apt. As human genes sometimes mutate, so
too does spam. And because spamDNA is predictive, it can modify
itself to a certain degree.
And if you're wondering where Cloudmark gets a never-ending torrent
of spam messages to add to its database, look no further than it's own
SpamNet product. This inexpensive plug-in for
Microsoft Outlook filters out most incoming spam. Whatever slips
through is reported back to Cloudmark with a single mouse click. The
worldwide SpamNet community is nearly 450,000 strong and growing.
Testing the Product
Authority is less than a megabyte in size. It runs at the message
transport authority (MTA) as either a .dll on
a Windows SMTP server or as an .so shared object on a Unix or Linux
Sendmail server. The spamDNA module, which Cloudmark oddly calls a
“cartridge,” is about 460Kbytes. That’s miniscule.
In a UNIX or Linux Sendmail environment, Authority is installed via the Milter
interface. The product plugs into Sendmail, using it to relay
messages. In a Windows server environment, Authority uses
the SMTP Server that is part of Windows 2000 Server. Authority
interoperates with all SMTP-compliant e-mail servers (Lotus, Exchange,
Eudora, PostFix, etc.) via standard SMTP relay methods
The way Authority works is pretty simple: It intercepts the
incoming stream of SMTP mail traffic through either the Sendmail
Milter interface (UNIX and Linux) environments, or the Windows SMTP
Server (Windows 2000 Server or later). It examines the stream of
SMTP messages, filters the spam, and then returns the filtered stream
back to the same SMTP source. Spam never reaches the corporate e-mail
We looked at Authority on a Windows 2000 Server.
Installation was simple. Authority installs via the included
InstallShield utility. In a Unix/Linux environment, it installs from a
console command. One caveat for Unix/Linux environments: If milter is
not installed, you’ll have to recompile the operating system to enable
it. Not a big deal, but an extra task nonetheless.
Next, we defined confidence levels and how messages at each level
should be handled. Normally, you’d simply delete messages with a high
spam likelihood, but since we wanted to keep track of each message, we
chose to save each message to a quarantine folder. That allowed us to
keep an overall tally. Several actions are possible: delete, delete
and send a refusal to the sender, save to a quarantine area and do not
route to the addressee, insert a warning message and deliver, or take
no action and deliver.
We used three different message streams. The first was all
legitimate mail. No messages from this stream should be filtered out.
The second stream was all spam. All the messages in this stream should be filtered
out. The third was a combination, pretty much what you’d expect to see in
day-to-day operation of a business.
Filtering out legitimate mail can be, at best, a mere inconvenience
or, at worst, a very, very bad thing. There’s the
“false positive,” in which, say, a newsletter you subscribe to is
considered spam. Though that’s incorrect, it's not a big problem. Far worse is
the dreaded “false critical.” Suppose the company CEO sends an urgent e-mail
message with “Fix this problem now!!!” in the subject field. Because
three consecutive exclamation points raises the likelihood that this
message is spam, filtering it out could have dire consequences—not
just for the person who failed to receive it, but for the IT
department that removed it. Of course, many other factors are taken
To see where the mail is going, we used a client PC running
Outlook 2002. For Unix/Linux power users who’d scoff at a client e-mail
viewer, the server-based Mutt utility does just fine.
First we detected no human-obvious delay in processing the mail
stream. Even if a mail message is held up by a quarter millisecond,
it’s more than made up for in the prevention of productivity loss at
the end user level and with the cost avoidance in perhaps otherwise
necessary network infrastructure expansion.
In our test, 98.5 percent of known spam was filtered out. That’s
really good. For every 10,000 incoming spam messages, only 150 will
get through. Even better was the stream of all-legitimate mail. Only
three legit messages were considered potential spam: one was a
newsletter, one a routine e-mail blast from Expedia, and a survey from
a major airline sent to its frequent fliers. That’s a “false positive”
success rate exceeding 99.99 percent. Simply assign a low enough
confidence level, and these messages will get delivered—perhaps with
a potential spam warning, but delivered nonetheless. Best of all, not
one “false critical” occurred.
Face it, fighting spam is a fact of life. Who these people are and
what profits they may reap from inundating every e-mail user with spam
beats the hell out of me, but legions of them are out there.
For now, Authority is sold by a small direct sales team. It will be
interesting to see if demand will overwhelm that team, pushing the
product into the distribution channel and the hands of solutions
Pricing is by the seat, based on an annual subscription. As the
number of seats climbs, the price per seat drops.
Cloudmark Authority is a forward-thinking product that not only
breaks all the spam-fighting rules, it doesn’t even bother with them.<