Presentation is loading. Please wait.

Presentation is loading. Please wait.

Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356- 8502, Japan

Similar presentations


Presentation on theme: "Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356- 8502, Japan"— Presentation transcript:

1 Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356- 8502, Japan {fujikawa,yamazaki}@kddilabs.jp Fuminori ADACHI Takashi WASHIO Hiroshi MOTODA ISIR, Osaka University 8-1, Mihogaoka, Ibarakishi, Osaka 567-0047,Japan {adachi,washio,motoda}@ar.sanken.osaka-u.ac.jp Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data/text mining RESEARCH TRACK INDUSTRIAL TRACK (Acceptance rate = 40/337 = 12%)

2 Density-Based Spam Detector A new spam detection method which use document space density information –The use of document space density –An efficient implementation through the use of a direct-mapped cache Purpose –For the spam filter which is used in the mail server, it has to be: High processing speed maintain Easily High accuracy Privacy protection

3 System Architecture Mail corpus Find feature by N-gram Hash Hash table feature Hash DB Read features calculate similarity SPAM threshold Write/update features, similar email, email An incoming email calculate similarity Hash DB > SPAM Unsupervised learning-- solve privacy problem, maintenance problem Hash table-- solve processing speed problem Similarity, threshold-- solve spam filter accuracy problem

4 Related work Bayesian-like approach Rule-based approach Checksum data base –http://www.dcc-servers.net/dcc/ Vector representation Hash-based text representation –Text retrieval 、 text compression 、 spam filtering –Direct-mapped cache is used to replace for LRU

5 Density-based spam detector Document space density –Count the number of similar e-mails –By counting the number of similar emails, the simple threshold is enough to distinguish spam from other emails.

6 Mail System Design Mass Mail Detector Monitoring network packets Analyzes SMTP traffic between mail servers and reconstructs the text of emails Transfers the text into vector representation

7 Hash design Hash based representations –Hash values of each length L substring are calculated, and then the first N of them are used as vector representation of the email

8 Caching Architecture Direct-mapped cache architecture –The hash data base store hash values of email and number of similar emails. –The direct-mapped cache is a simple hash table and the algorithms can find the entries of the of emails which have the same hash value through this cache.

9 Similar emails To check a single piece of email, in order to find similar previous e-mail which share S% of the same hash values –Algorithm

10

11 Experimental results Summary of experimental results Distribution of similar e-mail follows Zipf’s law

12 Experimental results Recall rate Cache usage Log

13 Experimental results Comparison with other methods Testing method

14 Experimental results Effect of topic change Bsfilter -2 group of mail list S:528 mails H:1538 mails Training:H+1/2S Testing:1/2S Result: After some period, bsfilter missclassified most of mails Reason: change topics

15 Experimental results Effect of On-line Learning

16 Maintenance and privacy Supervised learning methods require a positive and negative example of spam –Someone has to check the contents of the mail manually and therefore has the potential to violate privacy. Although our method requires a white list, maintaining such a white list is relatively easy, especially comparing it to maintaining a black list.

17 Conclusion High processing speed –13000 emails per second (1.25 billion emails per day) Maintenance free 98% recall rate and 100% precision Privacy protection


Download ppt "Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356- 8502, Japan"

Similar presentations


Ads by Google