Project Presentation B 王 立 B 陳俊甫 B 張又仁

Slides:



Advertisements
Similar presentations
Microsoft ® Office Outlook ® 2003 Training Outlook can help protect you from junk Upstate Technology Services presents:
Advertisements

Anti-SPAM experience at LAL Michel Jouvin LAL / IN2P3
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time.
HUNTINGTON BEACH PUBLIC LIBRARY Basics. What is ? short for electronic mail send & receive messages over the internet.
Pro Exchange SPAM Filter An Exchange 2000 based spam filtering solution.
23 October 2002Emmanuel Ormancey1 Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002.
Spam Reduction Techniques Using greylisting and SpamAssassin.
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
Spam Filtering Techniques Arnold Perez Joseph Tilley.
An Exercise in Machine Learning
Sending Mark Kruger Coldfusionmuse.com Cfwebtools.com.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
Introduction to Risk Analysis in Healthcare Farrokh Alemi Ph.D. Professor of Health Administration and Policy College of Health and Human Services, George.
A Technical Approach to Minimizing Spam Mallory J. Paine.
CSC 556– DBMS II, Spring 2013, Week 7 Bayesian Inference Paul Graham’s Plan for Spam, + A Patent Application for Learning Mobile Preferences, + some text.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski
Tired of Spam? The solution is MailWasher
Filtering Mail with Mail::Audit and Mail::SpamAssassin Creede Lambard penguinsinthenight.com 20 August 2002.
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
Project Presentation B 王 立 B 陳俊甫 B 張又仁 B 李佳穎.
Source pictures for document ”Thoughts about increasing spam annoyance” by License: This material may be distributed only subject.
1 Fighting Against Spam. 2 How might we analyze ? Identify different parts – Reply blocks, signature blocks Integrate with workflow tasks Build.
By Ankur Khator Gaurav Sharma Arpit Mathur 01D05014 SPAM FILTERING.
牛津版 高一模块一 Unit 1 Reading School life in the UK Lead in What is your dream school life like?
Spam: An Analysis of Spam Filters Joe Chiarella Jason O’Brien Advisors: Professor Wills and Professor Claypool.
3.06 Understand the use of direct marketing to attract attention and to build brand.
Welcome to Our Home College Station, Texas
A step-by-Step Guide For labels or merges
How do Web Applications Work?
Databases vs the Internet
Queensland University of Technology
Step 1 lead-in If you want to take the October GRE test, you need to finish the computer-based Analytical Writing section between________. A. September.
Databases vs the Internet
Text Based Information Retrieval
Step 1 lead-in If you want to take the October GRE test, you need to finish the computer-based Analytical Writing section between________. A. September.
Huntington Beach Public Library
Machine Learning. k-Nearest Neighbor Classifiers.
Design open relay based DNS blacklist system
CSSE463: Image Recognition Day 27
Discrete Event Simulation - 4
Business Communication
CSSE463: Image Recognition Day 17
Basics HURY DEPARTMENT OF COMPUTER SCIENCE M.TEJASWINI.
Machine Learning in Practice Lecture 26
iSRD Spam Review Detection with Imbalanced Data Distributions
Ensembles.
Spam Fighting at CERN 12 January 2019 Emmanuel Ormancey.
Naïve Bayes Classifiers
Fighting the WebBots A webbot is a program that visits web sites for all kinds of purposes. For example, Google webbots make copies of all web sites for.
Cpanel for the CS Officer
CSSE463: Image Recognition Day 31
CSSE463: Image Recognition Day 27
Machine Learning with an Adversary
Inside a PMI Online Course
Ensemble learning Reminder - Bagging of Trees Random Forest
Adrian McElligott CEO Geobytes, inc. Boston, March 2008
Lecture 5: Writing Page
thanksGIVINGback JJMS FUNdraiser
Timed Writing Exam (20%) Preparation
How to manage your s Tips and tricks.
Evaluating Classifiers
Murray S. Kucherawy REPUTE Extra Topics Murray S. Kucherawy
Timed Writing Exam Preparation
Welcome to the Second Tutorial
This presentation document has been prepared by Vault Intelligence Limited (“Vault") and is intended for off line demonstration, presentation and educational.
How to manage your s Tips and tricks.
Why do we need a controlled experimental stock market(CESM)?
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Presentation transcript:

Project Presentation B92902041 王 立 B92902051 陳俊甫 B92902092 張又仁

Outline Spam filter technology Personal issue Statistic Our approaching

Spam filtering technology basic structured text filters whitelist/verification filters distributed blacklists Pyzor rule-based rankings SpamAssasin Bayesian word distribution filters Bayesian trigram filters

Table 1. Quantitative accuracy of spam filtering techniques Good corpus (correctly identified vs. incorrectly identified) Spam corpus (correctly identified vs. incorrectly identified) "The Truth" 1851 vs. 0 1916 vs. 0 Trigram model 1849 vs. 2 1774 vs. 142 Word model 1847 vs. 4 1819 vs. 97 SpamAssassin 1846 vs. 5 1558 vs. 358 Pyzor 1847 vs. 0 (4 err) 943 vs. 971 (2 err)

Bayesian filtering The first two using bayesian method Pantel and Lin Bayesian filtering 92% spam, 1.16% false positive at 1998 Bayesian doesn’t use in the begin Why?

Bayesian filtering (cont.) Someone we find later Jonathan Zdziarski The main problem of previous work is making false positive too high Bayesian filtering 99.5% spam, 0.03% false positive at 2002 Why so different?

Possible Reasons less of training data: 160 spam and 466 non spam mails. ignore message headers stemmed the token, reduce words in bad way using all tokens is not good compared with using 15 most significant no bias against false positives

Personal issue Some good advantages about personalization make filters more effective let users decide their own spam filter hard for spammer to tune the mail

Statistics The fifteen most interesting words in this spam, with their probabilities, are: madam 0.99 promotion 0.99 republic 0.99 shortest 0.047225013 mandatory 0.047225013 standardization 0.07347802 sorry 0.08221981 supported 0.09019077 people's 0.09019077 enter 0.9075001 quality 0.8921298 organization 0.12454646 investment 0.8568143 very 0.14758544 valuable 0.82347786

Our approaching Data Set Sparse format machine learning (training) (testing)

Data set Source Lingspam PU1 PU123 Enron-spam http://iit.demokritos.gr/skel/i-config/downloads/ Lingspam PU1 PU123 Enron-spam

Ling-spam Collected from a mailing list “Ling-spam” With 481 spam messages and 2412 non-spam messages Topics of legitimate mails are alike. May be good for training, but not enough generalized. 4 versions of the corpus Using Lemmatiser or not Using stop-list or not

Example Subject: want best economical hunt vacation life ? want best hunt camp vacation life , felton 's hunt camp wild wonderful west virginium . $ 50 . 0 per day pay room three home cook meal ( pack lunch want stay wood noon ) cozy accomodation . reserve space . follow season book 1998 : buck season - nov . 23 - dec . 5 doe season - announce ( please call ) muzzel loader ( deer ) - dec . 14 - dec . 19 archery ( deer ) - oct . 17 - dec . 31 turkey sesson - oct . 24 - nov . 14 e - mail us 110734 . 2622 @ compuserve . com

Features ‘Words’ as features Collected from only spams Sequence of alpha, number and some symbols Only consider subject and body field Not supporting CJK for now Collected from only spams Unlimited feature set Use only features that appear often enough

Example for Features Collected from the spams of lemm_stop section 104 please 104 free 103 our 95 mail 91 address 86 send 81 one 80 information 77 us 77 list 74 receive 74 name 73 money … Collected from the spams of lemm_stop section

Sparse Format Some result from lemm_stop/part1 : 0, 2:1, 3:1, 4:1, 5:1, 6:1, 10:1, 12:1, 15:1, 16:1, 20:1, … 0, 0:1, 4:1, 5:1, 6:1, 7:1, 8:1, 12:1, 16:1, 20:1, 22:1, … 0, 0:1, 4:1, 5:1, 7:1, 8:1, 11:1, 13:1, 25:1, 41:1, 53:1, … 0, 0:1, 4:1, 5:1, 6:1, 8:1, 9:1, 11:1, 12:1, 13:1, 14:1, … 1, 0:1, 3:1, 6:1, 10:1, 17:1, 18:1, 23:1, 26:1, 28:1, … 1, 3:1, 4:1, 5:1, 6:1, 8:1, 9:1, 11:1, 13:1, 14:1, 15:1, … 1, 0:1, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, …

Training method Naïve bayes k-NN ,k=3 or less CART tree

Training and testing Ling-spam is splitted into 10 parts Use 9 parts for training Use 1 parts for testing

Reference data spam filtering technology Better bayesian filtering http://www-128.ibm.com/developerworks/linux/library/l-spamf.html Better bayesian filtering http://www.paulgraham.com/better.html a plan for spam http://www.paulgraham.com/spam.html