Anti-SPAM experience at LAL Michel Jouvin LAL / IN2P3
26/5/2004 Anti-SPAM at LAL - HEPix - Edinburgh 2004 LAL Context Message Router : Sendmail –Milter API to call an external program for filtering before delivery Message Store : Execmail IMAP –Derived from Cyrus v1 Mail clients capable of message filtering –Mulberry, Pine, Outlook, Netscape/Mozilla, Entourage…
26/5/2004 Anti-SPAM at LAL - HEPix - Edinburgh 2004 Policy Decisions… Do virus and SPAM detection at server level Let the user choose final processing if not a security problem –Only for SPAM, not for virus Virus : forbidden extensions rather than antivirus –Virus main threat during first hours/days : antivirus not up to date –+ : Proactive, low resource consumption –- : some useful extensions (ex :.zip) –Anti-virus run on desktop SPAM : tagged at server level with a SPAM probability (score) –Some predefined filters proposed for supported clients
26/5/2004 Anti-SPAM at LAL - HEPix - Edinburgh 2004 … Policy Decisions Avoid black / grey list –Effective no more than a few months (work around by spammers) –Negative side effects on users (black listed ISPs) –Relying on an uncontrolled critical service (black list maintainer)
26/5/2004 Anti-SPAM at LAL - HEPix - Edinburgh 2004 Virus Protection : MIMEDefang Configured to remove suspect parts based on their extensions –Recipient still receive a message with a text replacing the attachment –One header (X-MIMEdefang-action) added to help filtering 2 classes of suspect extensions –Always junk mails (.scr,.pif…) : just thrown away… –Sometimes useful (.exe,.zip) : quarantined, retrieval possible MIMEDefang can call other modules –Embedded Perl interpreter to ease call of external modules –Can be used to call Amavis (Antivirus), SpamAssassin… –Can restrict call of external modules to certain messages Dont call SpamAssassin for large messages (> 100K) : never a SPAM Provides significant performance enhancement
26/5/2004 Anti-SPAM at LAL - HEPix - Edinburgh 2004 SPAM Detection : SpamAssassin... At LAL : Perl module called by MIMEDefang –No extra process, no starting cost for every message –Dependent on other Perl modules Experienced a bad problem with HTML because of an old HTML::Parse Several types of filtering –Rules based –Bayesian analysis : based on message tokenization and statistics –Black / grey lists
26/5/2004 Anti-SPAM at LAL - HEPix - Edinburgh 2004 … SPAM Detection : SpamAssassin Compute a score (probability to be a SPAM) –Score >= 5 can be considerered as SPAM –Very few false positive : always related to misconfigured clients Add headers (X-Spam-Score/Status) and attachement (SpamAssasin.Report) –Header and attachment lists the reasons behind the score –Possibility to modify the subject LAL : prefix the subject with (SPAM ****) : number of * = score / 5 –Efficient filtering possible looking at the headers
26/5/2004 Anti-SPAM at LAL - HEPix - Edinburgh 2004 Bayesian Analysis… Rules based analysis less and less efficient –Spammers very responsive to rules improvements –LAL : 30% of undetected SPAM last winter Bayesian analysis inactive because of some misconfiguration Bayesian analysis : based on an (old) text analysis method –Message is tokenized : tokens in one set of chars, token separator in another set –Learning phase : for each token, counts everytime it appears in a SPAM or HAM (non SPAM), compute a probability (stored in a DB) –Analysis : compute a probability for the message according to the probability of each token in the message
26/5/2004 Anti-SPAM at LAL - HEPix - Edinburgh 2004 … Bayesian Analysis Uses message headers and content –Important to teach the filter with original (not forwarded) message Not language sensitive Very difficult for spammers to work it around –Every token database is unique Very few false positive –False positive : valid message with score >= 5 –LAL : no false positive so far (a few weeks)
26/5/2004 Anti-SPAM at LAL - HEPix - Edinburgh 2004 Bayesian Filter Administration Learning phase is critical –Initial learning with 1000s of SPAM ad HAM LAL initial set of message : 5000 messages (2/3 HAM, 1/3 SPAM) –Must cover message diversity to avoid side effect (language, topic…) –Messages used for learning must be (manually) carefully sorted between SPAM and HAM Learning must be renewed periodically –Token expiration protects against evolving patterns and limits DB size –Auto-learn feature helps maintain the database accurate –Need to manually feed the filter with incorrectly detected SPAMs to refine the database (false positive or false negative)
26/5/2004 Anti-SPAM at LAL - HEPix - Edinburgh 2004 Conclusions Pattern matching not enough, Bayesian looks promising –Raised SPAM detection efficiency to > 90% with initial learning –Hope to reach at least 95% while refining learning Take time to converge, dont make changes every day –SPAM profile / volume not the same every day –Need time to stabilize (auto-learning curve) Validate changes –Keep a reference set of SPAM and HAM (need to be updated) Administration load still a question –How to collect / process false positive / negative from users ?