Sender Reputation in a Large Webmail Service by Bradley Taylor (2006) Presented by : Manoj Kumar & Harsha Vardhana
Overview Introduction Primitive reputation service Gmail reputation service - Authentication - Reputation calculation Results Problems Simple rules Conclusion Discussion
Introduction Gmail, a free service Gmail is very concerned about detecting spammy s and eliminating them Reputation systems and spam filters
Some interesting stats 50 million Gmail users in all Maximum disk space 130,000 terabytes Assuming 20% usage 104,000 terabytes (backups included) 5-30 % of all the s received is spam Considering an average of 17.5% spam A minimum of 18,200 terabytes of data stored by gmail is spam
Reputation-Based & Content-Based classification The reputation based classification uses the senders reputation instead of the content in the to classify the mail either to be spam or not. Contrastingly, the content-based classification uses the contents of the to classify.
Rudimentary Reputation System Whitelists Block lists
Working Use connecting IP (crude authentication) Check if in whitelist Check if in block list If not in any list send to spam filter (now a content based filtering is done by the filter)
Problems with the rudimentary system Removing false positives Manual whitelist management Figuring out the true sender Multiple domains sharing a set of IP addresses
Gmail reputation system Detecting solicited and bulk s What is spam & what is not?
Some definitions for spam To indiscriminately send unsolicited, unwanted, irrelevant, or inappropriate messages, especially commercial advertising in mass quantities. Noun: electronic "junk mail". Spam refers to electronic junk mail or junk newsgroup postings. Some people define spam even more generally as any unsolicited . In addition to being a nuisance, spam also eats up a lot of network bandwidth. Because the Internet is a public network, little can be done to prevent spam, just as it is impossible to prevent junk mail. However, the use of software filters in programs can be used to remove most spam sent through . nces.ed.gov/pubs2003/secureweb/glossary.asp nces.ed.gov/pubs2003/secureweb/glossary.asp To crash a program by overrunning a fixed-site buffer with excessively large input data. Also, to cause a person or newsgroup to be flooded with irrelevant or inappropriate messages. "SPAM" mail is the practice of sending massive amounts of promotions or advertisements (and scams) to people that have not asked for it. Spam mail is controversial and there are many levels of definitions for it. Many times, spam lists are created by "harvesting" addresses from discussion boards and groups, chat rooms, IRC, and web pages. Pugmarks strictly prohibits sending spam from accounts on our servers.
Authenticating a domain IP addresses don’t represent sender Domain-based authentication systems –SPF (Sender Policy Framework) –Domain Keys
Working of SPF Domains use DNS to direct requests All domains publish (MX) records to determine which machines receive mail for the domain. SPF works by domains publishing "reverse MX" records to determine which machines send mail from the domain. The recipient can check those records to make sure mail is coming from where it should be coming from. With SPF, those "reverse MX" records are easy to publish: one line in DNS is all it takes. - www. openspf.org
Working of DomainKeys DomainKeys adds a header named "DomainKey-Signature" that contains a digital signature of the contents of the mail message. Parameters : –SHA-1 (cryptographic hash) –RSA (Public key encryption) –Base64 (to encode encrypted hash)
Signature Header DKIM-Signature: a=rsa-sha1; q=dns; d=example.com; s=jun2005.eng; c=relaxed/simple; t= ; x= ; h=from:to:subject:date; b=dzdVyOfAKCdLXdJOc9G2q8LoXSlEniSb av+yuU4zGeeruD00lszZVoG4ZHRNiYzR DNS query will be made to: jun2005.eng._domainkey.example.com
SPF Authentication Method Plain-SPF Best-guess SPF DNS PTR zone
An example - DNS PTR Zone Messages arrives abc.xyz.com DNS PTR of the above message’s IP (using reverse DNS) xyz.com (or) foo.xyz.com AUTHENTICATE
SPF Authentication Method Breakdown ~half of SPF authenticated messages used plain SPF
Authentication Breakdown : Nonspam Most of Gmails’s incoming mail that is not spam is authenticated (~75%)
Authentication Breakdown : spam Most of Gmail’s spam is not authenticated (~60%)
When an arrives… When an arrives, it is classified & an event is logged. Authentication associated with the message is also logged. Manual reclassification is also recorded. (either Report spam or Not spam.)
Reputation calculation Four variables involved in the calculation: - Autospam -Autononspam -Manualspam -Manualnonspam Reputation is calculated as a number between 0 and 100
A simplified formula good = autononspam + manualnonspam - manualspam total = autononspam + autospam reputation = (100*good) / total
An Example weliketospam.com sends 100 mails in a day autospam 40 autononspam 60 manualspam 8 manualnonspam 14 good= =66 total=60+40 reputation=(100*66)/100=66
More on reputation Unfortunately false positives also occur Reputation calculation is done over many days Only a subset of users are considered for reputation calculation No bias towards heavy users over light users.
How it works define threshold T while(1) { wait for new mail / collect from queue if current mail reputation < T then send to SPAM folder else if current mail reputation fairly high then send to INBOX else send to spam_filter(reputation) } end
RESULTS Distribution of SPF reputations Distribution of DomainKey reputations
Results Contd. Selected SPF domains and their reputations Selected DomainKey domains and their reputations
Some of the good bulk sending practices Use a consistent IP address to send bulk mail Automatically unsubscribe users whose addresses bounce multiple pieces of mail Provide a 'List-Unsubscribe' header which points to a web form where the user can unsubscribe easily from future mailings Messages should indicate that they are bulk mail, using the 'Precedence: bulk' header field You must terminate, in a timely fashion, all users and/or clients who use your service to send spam mail
Problems Forwarding spam to Gmail using tools that modify envelope sender Mailing lists (granularity) Corporate bulk senders who rarely send spammy bulk messages Users who report spam on a mailing list they are subscribed to
Simple rules for senders Authenticate using both SPF and DomainKeys Forwarding s should not be authenticated unless the spam is filtered Try to keep spammers off the network Observe good bulk sending practices
CONCLUSIONS Using this reputation system spammers and good senders are easily identified There are surely some problems, but which can be easily solved eventually. Both SPF and DomainKey techniques should be used, one cannot replace the other.
DISCUSSION Are the reputation systems vulnerable to attacks? What kinds of attacks can be expected? Is the wisdom of gmail, in non-sharing spam information (whitelists/block lists), questionable? How feasible is a more granular reputation system?