Filtering Spam Under Attack: some notes from the field

Slides:

Advertisements

Similar presentations

Microsoft ® Office Outlook ® 2003 Training Outlook can help protect you from junk Upstate Technology Services presents:

Advertisements

Geneva Public Library February 15th, What is ? How many of you have had accounts before?

It’s not enough to be busy, so are the ants. The question is, what are we busy about? -Henry David Thoreau GMAIL AND YOUR BUSINESS!

Microsoft ® Office Outlook ® 2003 Virtually Working for You presents:

Dealing With Spam The kind, not the Food product.

What do I need to know?.   Instant Messages  Social Networking.

Text Categorization Moshe Koppel Lecture 10: Spam Detection Some slides from Joshua Goodman.

Fighting Spam Randy Appleton Northern Michigan University

HUNTINGTON BEACH PUBLIC LIBRARY Basics. What is ? short for electronic mail send & receive messages over the internet.

Managing and Avoiding Junkmail. Junk  Where does Junk Mail come from? People with whom you do business  Pepsi Friends of people with whom you.

AND SPAM BY OLUWATOBI BAKARE

Personalized Spam Filtering for Gray Mail Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Robert McCann Microsoft Corporation.

1 Abusing the Network: Spam in All its forms Joshua Goodman, Microsoft Research with slides from Geoff Hulten and all the hard work done by other people,

Unit 9 Communication Services

Job Search 101 Free Geek Instructor: Wayne Flower.

How to Get Permission and Avoid Being Spam Jill Bastian Training and Education Manager.

A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.

Name: Ryan Lugg Form: 10B . How can businesses make use of . (P) can be a very useful tool, it can be very cost effective and efficient.

Data Structures & Algorithms and The Internet: A different way of thinking.

advantages The system is nearly universal because anyone who can access the Internet has an address. is fast because messages.

1 Fighting Comment Spam Employing the site’s audience, coding skills, and free distributed solutions to fight back.

1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.

Marketing Amanda Freeman. Design Guidelines Set your width to pixels Avoid too many tables Flash, JavaScript, ActiveX and movies will not.

It’s not enough to be busy, so are the ants. The question is, what are we busy about? -Henry David Thoreau GMAIL AND YOUR BUSINESS!

Copyright ©2005 CNET Networks, Inc. All rights reserved. Practice safety Learn how to protect yourself against common attacks.

Messages 1. Outline Fields of an Subject line One point per The expected response Be a good correspondent Final tips 2.

OCR Nationals Unit 1 – ICT Skills for Business. Using in business What bad practice can you see in this ? Annotate your copy.

Basics What is ? is short for electronic mail. is a method for sending messages electronically from one computer.

Advanced Guide to ing. Introduction In this guide you and explain will learn how to use ing in an advanced way. I will go through on.

Spoofing The False Digital Identity. What is Spoofing?  Spoofing is the action of making something look like something that it is not in order to gain.

Spam By Dan Sterrett. Overview ► What is spam? ► Why it’s a problem ► The source of spam ► How spammers get your address ► Preventing Spam ► Possible.

REGION IV-A “The NET Endeavor ” By: VW Avelino “Billy” Mendoza Sumagui, DDGM R4A1.

Deliverability and IP Warming

Cyber Info Gathering Techniques

Creating your online identity

3.02H Publishing a Website 3.02 Develop webpages..

Learn how to protect yourself against common attacks

Introduction to Computers

Live Customer Support Solution

How to use the internet safely and How to protect my personal data?

Advantages of ICT over Manual Methods of Processing Data

Welcome to Week 3 in the computer lab

How to make an .

WRITE MARKETING COPY and EXECUTE TARGETED S

Unit 11 Communication & Collaboration

Introduction to Computers

Phishing is a form of social engineering that attempts to steal sensitive information.

Spoofing Basics Presentation developed by A.F.M Bakabillah Cyber Security and Networking Consultant MCSA: Messaging, MCSE RHCE ITIL CEH.

Huntington Beach Public Library

Cybersecurity Awareness

GMAIL AND YOUR BUSINESS!

Information Security Session October 24, 2005

August 17, 2015 J. Boles, J.Burnias and M.Garcia Office 2013

Introduction to Web-Based

WRITE MARKETING COPY and EXECUTE TARGETED S

Basics HURY DEPARTMENT OF COMPUTER SCIENCE M.TEJASWINI.

Spam Fighting at CERN 12 January 2019 Emmanuel Ormancey.

4.02 Develop web pages using various layouts and technologies.

Online Safety! Created by Educational Technology Network

Setting up a Gmail Account & Safety Kamlesh Singh Bisht IT Specialist.

Machine Learning with an Adversary

How to manage your s Tips and tricks.

NETIQUETTE Pn. Jamilah Binti Yusof.

How to manage your s Tips and tricks.

Week 7 - Wednesday CS363.

Do You Have Multiple Amazon Seller Accounts? Amazon Knows it! By EsellersCare Contact : +1 (855)

Cybersecurity Simplified: Phishing

Founded in 2002, Credit Abuse Resistance Education (CARE) educates high school and college students on the responsible use of credit and other fundamentals.

The photo app every contractor & supplier needs

Presentation transcript:

Filtering Spam Under Attack: some notes from the field Aleksander Kołcz Microsoft Live Labs (Slides stolen from lots of people, especially Josh Goodman and Geoff Hulten)

Source: Pew Internet & American Life Project

Email addiction 41% check email first thing in the morning Source: AOL Email Addiction Survey 41% check email first thing in the morning 23% have checked in bed in their pajamas 40% of email users have checked their email in the middle of the night. 26% say they haven't gone more than two to three days without checking their email.

Too bad we have spam SPAM is the number one problem for email systems Estimates from about 71% to 87% of mail is spam At 71%, if you stop 90% of the spam, 1/5 of your mail will be spam Over a billion spam a day will get past filters worldwide. http://www.tekrati.com/research/News.asp?id=6933

Overview Email Spam An important application Lots of great research problems Spam Definitional problems Challenges in applying standard ML/TM procedures Techniques spammers use Other solutions to Spam Other kinds of communication abuse (SPIM)

Email : some interesting problems Finding what’s important Priorities Task Flags Organizing mail Auto foldering Auto tagging Finding what’s interesting Automatic search Contact finding

Other interesting email research Not all email research is language oriented Social Network Analysis Calendar Research HCI Visualization Email Storage Next generation email protocols

Solutions to Spam Filtering Postage (Disposable Email Addresses) Machine Learning Matching/Fuzzy Hashing (Blackhole Lists (IP addresses)) Postage Turing Tests, Money, Computation (Disposable Email Addresses) Smart Proof

Machine Learning/Text Mining A labeled iid sample from the spam/non-spam distribution Well defined performance criteria: false positive/negative rates, misclassification costs, processing/storage costs Word features (eg, breaking by whitespace) possibly enhanced with email specific ones (eg, header features) The favorite linear model (Naïve Bayes, SVM, Logistic Regression)

What is spam ? Unsolicited commercial mail? A user signs up for ebay and signs an agreement to receive communications from various parts of ebay An email arrives from half.com The user classifies is as spam Yet half.com is truly part of ebay Users are not likely to play detective and trace back affiliations and business relationships of email senders

What is spam ? The emails individual users don’t like? RE: how much stuff can I safely load on my rear rack? After receiving the 15th posting to this mailing list thread (cycling) I was ready to consider further follow ups as spam I like some offers from Amazon.com, but not others What I like right may depend on what I bought recently Emails with certain content or from certain senders can sometimes be considered spam, but not always

What is spam ? Emails people agree are spam? There are a lot of emails users are split about – should we just let them go (graymail)? Agree on what? Personalization and per-message randomization make it hard to figure out who gets the same campaign Thanks to botnets, a single spam campaign comes from a lot of different senders Emails with certain content or from certain senders can sometimes be considered spam, but not always

Spam campaign randomization

What is spam ? Emails people agree are spam? There are a lot of emails where users are split about – should we just let them go (graymail)? Agree on what? Personalization and per-message randomization make it hard to figure out who gets the same campaign Thanks to botnets, a single spam campaign comes from a lot of different senders Emails with certain content or from certain senders can sometimes be considered spam, but not always

What is spam ? Emails caught in “honeypot” accounts? Long dead accounts and never used accounts should not receive legitimate email Everything they get is most likely spam, but... Mistyping and old subscriptions could deliver legitimate mail occasionally Users who engage in real communication may be getting spam that is different Honeypots are great but ones has to consider false positives and sample selection bias

Examples of good-mail Even if noisy, labeled examples of non-spam are relatively easy to acquire (people are eager to help) Examples of good mail are much tougher to come buy Nobody wants to share sensitive material, but such messages are arguably the most expensive to misclassify Good-email datasets often suffer from sample-selection bias, where certain important content areas are under-represented

Label noise People make old fashioned mistakes when donating email samples The error rates can run as high as 1-5% This complicates ensuring a low enough FP rate With sampling bias, misclassifying a few legitimate messages as spam may translate into a very large FP rate Setting a very low FP rate over noisy data may lead to an unacceptably low recall

Operating point worries Most machine learning focuses on accuracy Assumes all errors equally bad For spam (and most other problems) cost of deleting good mail much higher than cost of spam in inbox (No missed spam) Some research on optimizing area under the curve – so you get good performance everywhere Almost no research on how to optimize for a specific point. (All spam missed) 1 (No good caught) 1 (All good caught)

The cost formula and its problems Classic cost-sensitive decision making The prevalence of spam, π , is highly time and user dependent The misclassification costs tend to be hard to quantify Setting the FP rate < th is often more sensible

Adversarial attacks and why size matters Spammers respond to new filter defenses, but not uniformly Attacking my personal filter is a waste of time/money Attacking a corporate filter may lead to a few thousand successful deliveries A successful attack against Hotmail, AOL, Yahoo! or Gmail can lead to millions of eyeballs!

Adversarial attacks: agility counts Spammers can respond very fast with new tactics Large systems are often hampered by lengthy deployment procedures Solutions need to be naturally adaptive or allow for easy manual intervention Constant monitoring and rapid response are paramount

What Happened When we Shipped an Adaptive Spam Filter The first spam filter we shipped was adaptive If user corrected mistakes, we improved the filter. What to do if the user does not correct mistakes? We assumed the filter was correct For users who rarely fixed mistakes, this lead to catastrophically bad results – the filter got worse and worse and worse

Threshold Drift Conservative Threshold Setting Separator: 50/50 mark We are conservative in our filtering. For instance, maybe we need to be 96% certain that mail is spam before we classify as spam Conservative Threshold: 96% sure

Threshold Drift Lots of Spam Classified as Good Separator: 50/50 mark Conservative Threshold: 96% sure

Threshold Drift Old Conservative Threshold: 96% sure Old Separator: New Separator: New Conservative Threshold: 96% sure

Threshold Drift Old Conservative Threshold: 96% sure Old Separator: New Separator: New Conservative Threshold: 96% sure

Adaptation with partial user feedback is hard Users may correct all errors, or only all spam, all good, 50% spam, 10% spam, no errors, etc. Need to work no matter what the user correction rate is Great problem that you find when you try to build a real system

Attack vectors/techniques Reputation based attacks: Sending (or pretending) to send from sources (IPs, domains accounts) with good or unknown reputation Content based attacks: Chaff Invisible ink Encoding Image spam Spelling

The Hitchhiker Chaffer Content Chaff Random passages from the Hitchhiker’s Guide Footers from valid mail “This must be Thursday,” said Arthur to himself, sinking low over his beer, “I never could get the hang of Thursdays.” Express yourself with MSN Messenger 6.0…

Hitchhiker Chaffer’s Later Work: invisible ink Can use hidden text, e.g. white on white or many other tricks User sees only spammy text Spam filter sees everything, including good words.

Hitchhiker Chaffer’s Later Work: invisible ink Can use hidden text, e.g. white on white or many other tricks Also included a number of unusual statements made by candidates during, ‘On display? I eventually had to go down to the cellar to find them.’ http://join.msn.com/?Page=features/es

Weather Report Guy Content in Image Good Word Chaff Weather, Sunny, High 82, Low 81, Favorites…

Secret Decoder Ring Looks easy Is it? Viagra – Proven sexual aid to enhance performance…

Secret Decoder Ring Dude Character Encoding HTML word breaking Pharmacy Produc<!LZJ>t<!LG>s

Diploma Guy Word Obscuring Dplmoia Pragorm Caerte a mroe prosoeprus

Diploma Guy Word Obscuring Dipmloa Paogrrm Cterae a more presporous

Diploma Guy Word Obscuring Dimlpoa Pgorram Cearte a more poosperrus

Diploma Guy Word Obscuring Dpmloia Pragorm Caetre a more prorpeosus

Diploma Guy Word Obscuring Dplmoia Pragorm Carete a mroe prorpseous

More of Diploma Guy Diploma Guy is good at what he does

Trends in Spam Exploits (Hulten et al.) 2003 Spam 2004 Delta (Absolute %) Description Word Obscuring 4% 20% 16% Misspelling words, putting words into images, etc. URL Spamming 0% 10% Adding URLs to non-spam sites (e.g. msn.com). Domain Spoofing 41% 50% 9% Using an invalid or fake domain in the from line. Token Breaking 7% 15% 8% Breaking words with punctuation, space, etc. MIME Attacks 5% 11% 6% Putting non-spam content in one body part and spam content in another. Text Chaff 52% 56% Random strings of characters, random series or words, or unrelated sentences. URL Obscuring 22% 17% -5% Encoding a URL in hexadecimal, hiding the true URL with an @ sign, etc. Character Encoding Pharmacy renders into Pharmacy.

Economy considerations Building complex systems (eg combinations of system wide and personal filters) can improve filtering accuracy The implementation costs can be substantial though E.g., a fully personalized service-side spam filtering complex: Per user model storage Per user message processing (important for messages with multiple recipients)

Solutions to Spam Filtering Postage (Disposable Email Addresses) Machine Learning Matching/Fuzzy Hashing (Blackhole Lists (IP addresses)) Postage Turing Tests, Money, Computation (Disposable Email Addresses) Smart Proof

Matching/Fuzzy Hashing Use “Honeypots” – addresses that should never get mail All mail sent to them is spam Look for similar messages that arrive in real mailboxes Exact match easily defeated Use fuzzy hashes How effective? Dedicated attacks can defeat near-duplicate detection Make Earn thousands of dollars lots of money working at home in the comfort of your own house !!! .

MSN blocks e-mail from rival ISPs Blackhole Lists MSN blocks e-mail from rival ISPs By Stefanie Olsen Staff Writer, CNET News.com February 28, 2003, 2:34 PM PT Microsoft's MSN said its e-mail services had blocked some incoming messages from rival Internet service providers earlier this week, after their networks were mistakenly banned as sources of junk mail. The Redmond, Wash., company, which has nearly 120 million e-mail customers through its Hotmail and MSN Internet services, confirmed Friday it had wrongly placed a group of Internet protocol addresses from AOL Time Warner's RoadRunner broadband service and EarthLink on its "blocklist" of known spammers whose mail should be barred from customer in-boxes. Once notified of the error by the two ISPs, MSN moved the IP addresses "over to a safe list immediately," according to a Microsoft spokeswoman. Lists of IP addresses that send spam Open relays, Open proxies, DSL/Cable lines, etc… Easy to make mistakes Open relays, DSL, Cable send good and spam… Who makes the lists? Some list-makers very aggressive Some list-makers too slow

Postage Basic problem with email is that it is free Multiple kinds of Force everyone to pay (especially spammers) and spam goes away Send payment pre-emptively, with each outbound message, or wait for challenge Multiple kinds of payment: Turing Test, Computation, Money

Turing Tests (HIP, CAPTCHA) (Naor ’96) You send me mail; I don’t know you I send you a challenge: type these letters Your response is sent to my computer Your message is moved to my inbox, where I read it

Computational Challenge (Dwork and Naor ’92) Sender must perform time consuming computation Example: find a hash collision Easy for recipient to verify, hard for sender to find collision Requires say 10 seconds (or 5 minutes?) of sender CPU time (in background) Can be done preemptively, or in response to challenge

$$$ Money Pay actual money (1 cent?) to send a message Interesting variation: take money only when user hits “Report Spam” button Otherwise, refund to sender Free for non-spammers to send mail, but expensive for spammers Requires multiple monetary transactions for every message sent – expensive Who pays for infrastructure?

SmartProof: Most challenge-response approaches challenge every message Use machine learning. Challenge only suspicious messages (avoids annoying challenges) Can auto-respond with computation Least annoying to sender – may never see challenge Can respond by solving a Turing Test Works for people with old or incompatible computers Can respond with micro-payment

Other kinds of abuse Email spam Chat rooms (SPAT) Instant Messenger (SPIM) Blog spam Web spam SMS spam IP phone spam

Chat Room Spam MSN closed its free chat rooms Spambots come in and pretend to chat But really just advertising porn sites Some spambots trivial Don’t talk at all, but take up space Link to porn spam in their profile Some spambots very sophisticated You can have a short conversation with them before they try to convince you to go to their website Randomized conversations so hard for users to spot

Chat Bot joshuagood9: hi there superchristina: hey there how u doin? joshuagood9: doing fine, and you? superchristina: hey there how u doin? joshuagood9: are you a bot? superchristina: im not a bot are u? lol joshuagood9: are you a bot? superchristina: i hate bots lol joshuagood9: how old are you? superchristina: whats up? joshuagood9: asl? superchristina: im 21 f usa and u? joshuagood9: I am fine, thank you superchristina: right on asl?... im 20 f usa joshuagood9: 74/M, WA superchristina: nice age joshuagood9: thank you superchristina: yw sweety..could u do me a favor..check out my homepage and my profile see if my cam works? brb

Instant Messenger Spam “SPIM” Send messages to people via IM Microsoft solved this by requiring people to get permission before IMing Spammers put spam in their “name” – so permission request message now has spam!

Blog Spam Post comments with links in blogs The links used to be used by search engines as part of rankings Most search engines now completely ignore these links (throwing away valuable information) Spammer posts links from his blog to victim blog Trackback software shows victim that there is a link to his blog Victim uses trackback to see who linked Many providers disabling trackbacks

SPIM, etc. are great NLP problems Tons of ways to obfuscate email spam, because you can send pictures and arbitrary HTML IM, chat rooms, blog comments all basically restricted to plain text NLP techniques may be more appropriate for these domains than for email spam Other kinds of abuse in chat rooms Pedophiles, phishing, etc. MSN and Yahoo have both closed off large parts of their chat room systems because of pedophiles

Conclusion: lots of exciting research Email Priorities, Task Flags, Auto-folder, Auto-Tag, Automatic Search Spam Still haven’t solved it – can keep improving New problems like phishing Apply to other domains (SPIM, etc) The conference on Email and Anti-Spam CEAS <www.ceas.cc>