Why anti-spam effectiveness testing sucksApril 16th, 2008, by ksimpson
InfoWorld have released a review of various anti-spam systems and along with that a comparison chart of effectiveness based on their long-term (2 week) testing of each of the systems. The report ends with the common issue of how to determine which one is the best given that there are multiple variables involved. Terry Zink has taken the results a step further and attempted to resolve the capture rate and false positive results to a single value. I agree that a single figure would help compare but it makes it even more important to get the underlying data right and to measure the right things. I think we need to consider variation in effectiveness as an overall more important measure of spam protection than capture rate.
Anti-spam effectiveness tests suck because:
a) nobody seems to be able to analyze and report statistics these days and
b) they test the wrong thing. Outbreak response time is the issue not long-term capture rates.
First lets talk about statistics. Initially I was going to rant about the general poverty of meaning in statistical reporting in terms of no standard deviations and excessive significant digits but then I realized that even the capture rate calculations are wrong. If you’re going to go to all the effort of testing at least put some quality into your statistical analysis.
Looking at these results I see a wildly divergent volume of mail and spam being received by each of the anti-spam systems during their test period. The author reports that each of the systems received similar amounts of mail (13000~14000 messages) but that systems varied in the amount of messages they rejected at the connection level (using reputation filtering or DNSBL’s) because they were spam. If that’s true the results of this test are reported incorrectly because the dropped connections are not reported or factored into the spam capture rate.
If I’m barracuda and I drop 10,000 spam messages at the connection level and then another 1750 with content filtering thats a capture rate of 98% not 88%. It also means I’m doing a lot more to reduce load on the server since those dropped messages are never received and scanned. So the results are wrong, which is especially annoying since these results are going to be quoted and used in sales calls for the next 3 years and will affect some people’s lives or at least livelihoods.
But I have a bigger concern with these tests, which are the same as every report on spam testing I’ve seen for the last 5 years of watching these things. The tests look at the wrong issue.
Spam is not a two week issue, it is a NOW issue. What matters is the amount of spam am I getting right now. How much of it is getting through my filters, hammering my email servers, annoying my users and filling up my archiving system.
If we want a single number or any measure it needs to be useful and long term capture rates are not very meaningful, especially when they are based on medium term tests.
What I want to know is what were the spammers doing during the time of each of those tests. Which vendors were hit with big new spam campaigns and which were sitting there during a lull in spam activity. Which were hit with a whole lot of new spam techniques during their test and which received all stale old spam campaigns anyone should detect.
We can’t tell what was actually happening because all the data is rolled up into one nice neat number 9x.xxx% spam detection. A real world comparison of anti-spam effectiveness would measure the capture rate every 10 minutes, plot it and look at how often the capture rate dropped below some threshold, say 80% for the sake of argument, and then measure how long it took to recover back up to a 95% or so capture rate. That measure of the number of outbreaks that hit and the response time gives us a measure of the resiliency of the anti-spam system to new campaigns and the ability of the vendors labs to respond to those issues.
The key element of anti-spam protection is how organizations respond to new outbreaks, the sorts of outbreaks that cause the noticeable dips in effectiveness that in turn result in server load peaks, help desk calls and significant spam impacts. These are the spam concerns an ISP or an IT manager needs to plan for, not the ongoing general spam level which most people just put up with.
If we are comparing anti-spam effectiveness lets compare the systems capability to deal with the outbreaks not the ability to deal with the every day junk that most vendors get 95+% of.