Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356- 8502, Japan

Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356- 8502, Japan {fujikawa,yamazaki}@kddilabs.jp Fuminori ADACHI Takashi WASHIO Hiroshi MOTODA ISIR, Osaka University 8-1, Mihogaoka, Ibarakishi, Osaka 567-0047,Japan {adachi,washio,motoda}@ar.sanken.osaka-u.ac.jp Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data/text mining RESEARCH TRACK INDUSTRIAL TRACK (Acceptance rate = 40/337 = 12%)

Density-Based Spam Detector A new spam detection method which use document space density information –The use of document space density –An efficient implementation through the use of a direct-mapped cache Purpose –For the spam filter which is used in the mail server, it has to be: High processing speed maintain Easily High accuracy Privacy protection

System Architecture Mail corpus Find feature by N-gram Hash Hash table feature Hash DB Read features calculate similarity SPAM threshold Write/update features, similar email, email An incoming email calculate similarity Hash DB > SPAM Unsupervised learning-- solve privacy problem, maintenance problem Hash table-- solve processing speed problem Similarity, threshold-- solve spam filter accuracy problem

Related work Bayesian-like approach Rule-based approach Checksum data base –http://www.dcc-servers.net/dcc/ Vector representation Hash-based text representation –Text retrieval 、 text compression 、 spam filtering –Direct-mapped cache is used to replace for LRU

Density-based spam detector Document space density –Count the number of similar e-mails –By counting the number of similar emails, the simple threshold is enough to distinguish spam from other emails.

Mail System Design Mass Mail Detector Monitoring network packets Analyzes SMTP traffic between mail servers and reconstructs the text of emails Transfers the text into vector representation

Hash design Hash based representations –Hash values of each length L substring are calculated, and then the first N of them are used as vector representation of the email

Caching Architecture Direct-mapped cache architecture –The hash data base store hash values of email and number of similar emails. –The direct-mapped cache is a simple hash table and the algorithms can find the entries of the of emails which have the same hash value through this cache.

Similar emails To check a single piece of email, in order to find similar previous e-mail which share S% of the same hash values –Algorithm

Experimental results Summary of experimental results Distribution of similar e-mail follows Zipf’s law

Experimental results Recall rate Cache usage Log

Experimental results Comparison with other methods Testing method

Experimental results Effect of topic change Bsfilter -2 group of mail list S:528 mails H:1538 mails Training:H+1/2S Testing:1/2S Result: After some period, bsfilter missclassified most of mails Reason: change topics

Experimental results Effect of On-line Learning

Maintenance and privacy Supervised learning methods require a positive and negative example of spam –Someone has to check the contents of the mail manually and therefore has the potential to violate privacy. Although our method requires a white list, maintaining such a white list is relatively easy, especially comparing it to maintaining a black list.

Conclusion High processing speed –13000 emails per second (1.25 billion emails per day) Maintenance free 98% recall rate and 100% precision Privacy protection

Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356- 8502, Japan

Similar presentations

Presentation on theme: "Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356- 8502, Japan"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356- 8502, Japan

Similar presentations

Presentation on theme: "Density-Based Spam Detector Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356- 8502, Japan"— Presentation transcript:

Similar presentations

About project

Feedback