Project Presentation B92902041 王 立 B92902051 陳俊甫 B92902092 張又仁 B92902095 李佳穎.

Slides:



Advertisements
Similar presentations
Microsoft ® Office Outlook ® 2003 Training Outlook can help protect you from junk Upstate Technology Services presents:
Advertisements

Anti-SPAM experience at LAL Michel Jouvin LAL / IN2P3
Document Filtering Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial- ShareAlike.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
What is Spam  Any unwanted messages that are sent to many users at once.  Spam can be sent via , text message, online chat, blogs or various other.
----Presented by Di Xu  Introduction  Overview of Spam  Solutions to Spam  Conclusion.
NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.
Preventing Spam: Today and Tomorrow Zane Bonny Vilaphong Phasiname The Spamsters!
Deep Belief Networks for Spam Filtering
Spam Filters. What is Spam? Unsolicited (legally, “no existing relationship” Automated Bulk Not necessarily commercial – “flaming”, political.
Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Major Points Formal Tests of Mean Differences Review of Concepts: Means, Standard Deviations, Standard Errors, Type I errors New Concepts: One and Two.
HUNTINGTON BEACH PUBLIC LIBRARY Basics. What is ? short for electronic mail send & receive messages over the internet.
Pro Exchange SPAM Filter An Exchange 2000 based spam filtering solution.
23 October 2002Emmanuel Ormancey1 Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002.
Spam Reduction Techniques Using greylisting and SpamAssassin.
Filtron: A Learning-Based Anti-Spam Filter Eirinaios Michelakis Ion Androutsopoulos
. If the PHP server is an server or is aware of which server is the server, then one can write code that s information. –For example,
Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.
Spam Filtering Techniques Arnold Perez Joseph Tilley.
An Exercise in Machine Learning
Conducting Usability Tests ITSW 1410 Presentation Media Software Instructor: Glenda H. Easter.
Sending Mark Kruger Coldfusionmuse.com Cfwebtools.com.
Countering Spam Using Classification Techniques Steve Webb Data Mining Guest Lecture February 21, 2008.
Welcome to the wonderful world of……. . A Quick & Easy Guide.  What IS ?  A quick, easy and convenient way to send a letter to friends, family.
(or ?) Short for Electronic Mail The transmission of messages over networks.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Group 2 R 李庭閣 R 孔垂玖 R 許守傑 R 鄭力維.
Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.
Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Information and Computer Science University of California, Irvine.
Welcome to the Second Tutorial Welcome to the second part of this information system website tutorial! This tutorial is for church planters. If you’d like.
Introduction to Risk Analysis in Healthcare Farrokh Alemi Ph.D. Professor of Health Administration and Policy College of Health and Human Services, George.
A Technical Approach to Minimizing Spam Mallory J. Paine.
CSC 556– DBMS II, Spring 2013, Week 7 Bayesian Inference Paul Graham’s Plan for Spam, + A Patent Application for Learning Mobile Preferences, + some text.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
©2008 Secure Computing Corporation. All Rights Reserved. 1 10/20/2015 Adaptive Language Parsing Teaching Parsers to Program Themselves J. Zdziarski
Tired of Spam? The solution is MailWasher
Filtering Mail with Mail::Audit and Mail::SpamAssassin Creede Lambard penguinsinthenight.com 20 August 2002.
9/20031 Classifying and Filtering Spam Using Search Engines Oleg Kolesnikov College of Computing Georgia Tech.
Bayesian Spam Filter By Joshua Spaulding. Statement of Problem “Spam now accounts for more than half of all messages sent and imposes huge productivity.
Adapting Statistical Filtering David Kohlbrenner IT.com TJHSST.
Computing Science, University of Aberdeen1 Reflections on Bayesian Spam Filtering l Tutorial nr.10 of CS2013 is based on Rosen, 6 th Ed., Chapter 6 & exercises.
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
Understanding the network level behavior of spammers Published by :Anirudh Ramachandran, Nick Feamster Published in :ACMSIGCOMM 2006 Presented by: Bharat.
Source pictures for document ”Thoughts about increasing spam annoyance” by License: This material may be distributed only subject.
Go to log in page at Enter your User ID Password.
Distance Learning Course Orientation Mrs. B. Farber Bucks County Community College Click to advance the Slides.
1 Fighting Against Spam. 2 How might we analyze ? Identify different parts – Reply blocks, signature blocks Integrate with workflow tasks Build.
By Ankur Khator Gaurav Sharma Arpit Mathur 01D05014 SPAM FILTERING.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
An Exercise in Machine Learning
A False Positive Safe Neural Network for Spam Detection Alexandru Catalin Cosoi
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
How to manage your s Tips and tricks. Use Folders Folders are used to manage files in your hard disk drive. Similarly you can create folders in your.
Phishing and Internet Scams. Definitions and recent statistics Why is it dangerous? Phishing techniques and identifiers Examples of phishing and scam.
Spam: An Analysis of Spam Filters Joe Chiarella Jason O’Brien Advisors: Professor Wills and Professor Claypool.
CS 430: Information Discovery
Huntington Beach Public Library
Basics HURY DEPARTMENT OF COMPUTER SCIENCE M.TEJASWINI.
iSRD Spam Review Detection with Imbalanced Data Distributions
Spam Fighting at CERN 12 January 2019 Emmanuel Ormancey.
Naïve Bayes Classifiers
Cpanel for the CS Officer
Project Presentation B 王 立 B 陳俊甫 B 張又仁
How to manage your s Tips and tricks.
Wells Fargo Toolkit – CreativeBuilder Reference Guide
How to manage your s Tips and tricks.
Presentation transcript:

Project Presentation B 王 立 B 陳俊甫 B 張又仁 B 李佳穎

Outline Spam filter technology Personal issue Statistic Our approaching

Spam filtering technology 1.basic structured text filters 2.whitelist/verification filters 3.distributed blacklists Pyzor 4.rule-based rankings SpamAssasin 5.Bayesian word distribution filters 6.Bayesian trigram filters

Table 1. Quantitative accuracy of spam filtering techniques Technique Good corpus (correctly identified vs. incorrectly identified ) Spam corpus (correctly identified vs. incorrectly identified ) "The Truth"1851 vs vs. 0 Trigram model1849 vs vs. 142 Word model1847 vs vs. 97 SpamAssassin1846 vs vs. 358 Pyzor1847 vs. 0 (4 err)943 vs. 971 (2 err)

Bayesian filtering The first two using bayesian method  Pantel and Lin Bayesian filtering 92% spam, 1.16% false positive at 1998 Bayesian doesn’t use in the begin  Why?

Bayesian filtering (cont.) Someone we find later  Jonathan Zdziarski The main problem of previous work is making false positive too high Bayesian filtering 99.5% spam, 0.03% false positive at 2002  Why so different?

Possible Reasons 1.less of training data: 160 spam and 466 non spam mails. 2.ignore message headers 3.stemmed the token, reduce words in bad way 4.using all tokens is not good compared with using 15 most significant 5.no bias against false positives

Personal issue Some good advantages about personalization 1.make filters more effective 2.let users decide their own spam filter 3.hard for spammer to tune the mail

Statistics The fifteen most interesting words in this spam, with their probabilities, are:  madam 0.99  promotion 0.99  republic 0.99  shortest  mandatory  standardization  sorry  supported  people's  enter  quality  organization  investment  very  valuable

Our approaching machine learning (testing) machine learning (training) Sparse format Data Set

Data set Source  Lingspam PU1 PU123 Enron-spam

Ling-spam Collected from a mailing list “Ling-spam” With 481 spam messages and 2412 non- spam messages Topics of legitimate mails are alike.  May be good for training, but not enough generalized. 4 versions of the corpus  Using Lemmatiser or not  Using stop-list or not

Example Subject: want best economical hunt vacation life ? want best hunt camp vacation life, felton 's hunt camp wild wonderful west virginium. $ per day pay room three home cook meal ( pack lunch want stay wood noon ) cozy accomodation. reserve space. follow season book 1998 : buck season - nov dec. 5 doe season - announce ( please call ) muzzel loader ( deer ) - dec dec. 19 archery ( deer ) - oct dec. 31 turkey sesson - oct nov. 14 e - mail us compuserve. com

Features ‘Words’ as features  Sequence of alpha, number and some symbols  Only consider subject and body field  Not supporting CJK for now Collected from only spams Unlimited feature set  Use only features that appear often enough

Example for Features  104 please  104 free  103 our  95 mail  91 address  86 send  81 one  80 information  77 us  77 list  74 receive  74 name  73 money  … Collected from the spams of lemm_stop section

Sparse Format Some result from lemm_stop/part1 : 0, 2:1, 3:1, 4:1, 5:1, 6:1, 10:1, 12:1, 15:1, 16:1, 20:1, … 0, 0:1, 4:1, 5:1, 6:1, 7:1, 8:1, 12:1, 16:1, 20:1, 22:1, … 0, 0:1, 4:1, 5:1, 7:1, 8:1, 11:1, 13:1, 25:1, 41:1, 53:1, … 0, 0:1, 4:1, 5:1, 6:1, 8:1, 9:1, 11:1, 12:1, 13:1, 14:1, … 1, 0:1, 3:1, 6:1, 10:1, 17:1, 18:1, 23:1, 26:1, 28:1, … 1, 3:1, 4:1, 5:1, 6:1, 8:1, 9:1, 11:1, 13:1, 14:1, 15:1, … 1, 0:1, 1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, …

Training method 1.Naïve bayes 2.k-NN,k=3 or less 3.CART tree

Training and testing Ling-spam is splitted into 10 parts Use 9 parts for training Use 1 parts for testing

Reference data spam filtering technology  ibm.com/developerworks/linux/library/l- spamf.html ibm.com/developerworks/linux/library/l- spamf.html Better bayesian filtering  a plan for spam 