Download presentation
Presentation is loading. Please wait.
Published byRosalind Williamson Modified over 8 years ago
1
Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356- 8502, Japan {fujikawa,yamazaki}@kddilabs.jp Fuminori ADACHI Takashi WASHIO Hiroshi MOTODA ISIR, Osaka University 8-1, Mihogaoka, Ibarakishi, Osaka 567-0047,Japan {adachi,washio,motoda}@ar.sanken.osaka-u.ac.jp Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data/text mining RESEARCH TRACK INDUSTRIAL TRACK (Acceptance rate = 40/337 = 12%)
2
Density-Based Spam Detector A new spam detection method which use document space density information –The use of document space density –An efficient implementation through the use of a direct-mapped cache Purpose –For the spam filter which is used in the mail server, it has to be: High processing speed maintain Easily High accuracy Privacy protection
3
System Architecture Mail corpus Find feature by N-gram Hash Hash table feature Hash DB Read features calculate similarity SPAM threshold Write/update features, similar email, email An incoming email calculate similarity Hash DB > SPAM Unsupervised learning-- solve privacy problem, maintenance problem Hash table-- solve processing speed problem Similarity, threshold-- solve spam filter accuracy problem
4
Related work Bayesian-like approach Rule-based approach Checksum data base –http://www.dcc-servers.net/dcc/ Vector representation Hash-based text representation –Text retrieval 、 text compression 、 spam filtering –Direct-mapped cache is used to replace for LRU
5
Density-based spam detector Document space density –Count the number of similar e-mails –By counting the number of similar emails, the simple threshold is enough to distinguish spam from other emails.
6
Mail System Design Mass Mail Detector Monitoring network packets Analyzes SMTP traffic between mail servers and reconstructs the text of emails Transfers the text into vector representation
7
Hash design Hash based representations –Hash values of each length L substring are calculated, and then the first N of them are used as vector representation of the email
8
Caching Architecture Direct-mapped cache architecture –The hash data base store hash values of email and number of similar emails. –The direct-mapped cache is a simple hash table and the algorithms can find the entries of the of emails which have the same hash value through this cache.
9
Similar emails To check a single piece of email, in order to find similar previous e-mail which share S% of the same hash values –Algorithm
11
Experimental results Summary of experimental results Distribution of similar e-mail follows Zipf’s law
12
Experimental results Recall rate Cache usage Log
13
Experimental results Comparison with other methods Testing method
14
Experimental results Effect of topic change Bsfilter -2 group of mail list S:528 mails H:1538 mails Training:H+1/2S Testing:1/2S Result: After some period, bsfilter missclassified most of mails Reason: change topics
15
Experimental results Effect of On-line Learning
16
Maintenance and privacy Supervised learning methods require a positive and negative example of spam –Someone has to check the contents of the mail manually and therefore has the potential to violate privacy. Although our method requires a white list, maintaining such a white list is relatively easy, especially comparing it to maintaining a black list.
17
Conclusion High processing speed –13000 emails per second (1.25 billion emails per day) Maintenance free 98% recall rate and 100% precision Privacy protection
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.