© 2003 Franz J. Kurfess Spam Filtering 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.

Slides:



Advertisements
Similar presentations
Anti-SPAM experience at LAL Michel Jouvin LAL / IN2P3
Advertisements

Basic Communication on the Internet:
Programming Paradigms and languages
© 2002 Franz J. Kurfess Expert System Examples 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Rulebase Expert System and Uncertainty. Rule-based ES Rules as a knowledge representation technique Type of rules :- relation, recommendation, directive,
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Dealing With Spam The kind, not the Food product.
Rule Based Systems Michael J. Watts
Systems Analysis and Design 9th Edition
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
Presented by: Alex Misstear Spam Filtering An Artificial Intelligence Showcase.
CPE/CSC 481: Knowledge-Based Systems
6/1/2015 Spam Filtering - Muthiyalu Jothir 1 Spam Filtering Computer Security Seminar N.Muthiyalu Jothir – Media Informatics.
© 2002 Franz J. Kurfess Expert System Design 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.
CPE/CSC 481: Knowledge-Based Systems
Chapter 14: Usability testing and field studies. 2 FJK User-Centered Design and Development Instructor: Franz J. Kurfess Computer Science Dept.
1 © Franz J. Kurfess Constrained Access Franz J. Kurfess Cal Poly SLO Computer Science Department.
© 2005 Franz J. Kurfess Expert System Examples 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.
© Franz J. Kurfess Approximate Reasoning 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.
© Franz Kurfess Project Topics 1 Topics for Master’s Projects and Theses -- Winter Franz J. Kurfess Computer Science Department Cal Poly.
Chapter 2: Pattern Recognition
Knowledge Acquisitioning. Definition The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
© 2002 Franz J. Kurfess Introduction 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.
1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.
© 2002 Franz J. Kurfess Approximate Reasoning 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.
© C. Kemke1Reasoning - Introduction COMP 4200: Expert Systems Dr. Christel Kemke Department of Computer Science University of Manitoba.
© 2002 Franz J. Kurfess Introduction 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 7: Expert Systems and Artificial Intelligence Decision Support.
© 2001 Franz J. Kurfess Introduction 1 CPE/CSC 580: Knowledge Management Dr. Franz J. Kurfess Computer Science Department Cal Poly.
© Franz J. Kurfess Expert System Design 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.
EXPERT SYSTEMS Part I.
© 2001 Franz J. Kurfess Introduction 1 CPE/CSC 580: Knowledge Management Dr. Franz J. Kurfess Computer Science Department Cal Poly.
Goal: Goal: Learn to automatically  File s into folders  Filter spam Motivation  Information overload - we are spending more and more time.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
12 -1 Lecture 12 User Modeling Topics –Basics –Example User Model –Construction of User Models –Updating of User Models –Applications.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
23 October 2002Emmanuel Ormancey1 Spam Filtering at CERN Emmanuel Ormancey - 23 October 2002.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Artificial Intelligence (AI) Addition to the lecture 11.
CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.
Computer Networking From LANs to WANs: Hardware, Software, and Security Chapter 12 Electronic Mail.
Knowledge Acquisition. Concepts of Knowledge Engineering Knowledge engineering The engineering discipline in which knowledge is integrated into computer.
Comment Spam Identification Eric Cheng & Eric Steinlauf.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Reporter: Li, Fong Ruei National Taiwan University of Science and Technology 9/19/2015Slide 1 (of 32)
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
A Neural Network Classifier for Junk Ian Stuart, Sung-Hyuk Cha, and Charles Tappert CSIS Student/Faculty Research Day May 7, 2004.
The Internet 8th Edition Tutorial 2 Basic Communication on the Internet: .
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
A Technical Approach to Minimizing Spam Mallory J. Paine.
Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Filtering Mail with Mail::Audit and Mail::SpamAssassin Creede Lambard penguinsinthenight.com 20 August 2002.
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
© 2002 Franz J. Kurfess Introduction 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
1 Fighting Against Spam. 2 How might we analyze ? Identify different parts – Reply blocks, signature blocks Integrate with workflow tasks Build.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
By Andrew McDaniel. Bloom’s Revised Taxonomy Remembering Understanding Applying Analyzing Evaluating Creating.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Some Thoughts to Consider 5 Take a look at some of the sophisticated toys being offered in stores, in catalogs, or in Sunday newspaper ads. Which ones.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Text Categorization Assigning documents to a fixed set of categories
Spam Fighting at CERN 12 January 2019 Emmanuel Ormancey.
Text Mining Application Programming Chapter 9 Text Categorization
Presentation transcript:

© 2003 Franz J. Kurfess Spam Filtering 1 CPE/CSC 481: Knowledge-Based Systems Dr. Franz J. Kurfess Computer Science Department Cal Poly

© 2003 Franz J. Kurfess Spam Filtering 2 Course Overview u Introduction u Knowledge Representation u Semantic Nets, Frames, Logic u Reasoning and Inference u Predicate Logic, Inference Methods, Resolution u Reasoning with Uncertainty u Probability, Bayesian Decision Making u Expert System Design u ES Life Cycle u CLIPS Overview u Concepts, Notation, Usage u Pattern Matching u Variables, Functions, Expressions, Constraints u Expert System Implementation u Salience, Rete Algorithm u Expert System Examples u Conclusions and Outlook

© 2003 Franz J. Kurfess Spam Filtering 3 Overview Spam Filtering u Motivation u Objectives u Chapter Introduction u Spam u Terminology u Dealing with Spam u Laws and Regulations u Filtering via Keywords u Filtering via Rules u Learning u Spam and Bayes u Binary Classification of Documents u N-ary Classification u Implementation u SpamBayes Project u Related Projects u Important Concepts and Terms u Summary

© 2003 Franz J. Kurfess Spam Filtering 4 Logistics u Introductions u Course Materials u textbooks (see below) u lecture notes u PowerPoint Slides will be available on my Web page u handouts u Web page u u Term Project u Lab and Homework Assignments u Exams u Grading

© 2003 Franz J. Kurfess Spam Filtering 5 Bridge-In

© 2003 Franz J. Kurfess Spam Filtering 6 Pre-Test

© 2003 Franz J. Kurfess Spam Filtering 7 Motivation  dealing with spam “manually” is very time-consuming, tedious, and prone to errors  various methods have been tried to “filter” spam, with varying success  early results with Bayesian approaches look very promising

© 2003 Franz J. Kurfess Spam Filtering 8 Objectives  be familiar with the terminology  spam  Bayesian approaches  to understand  elementary methods for handling spam automatically  more advanced methods  scenarios and applications for those methods  important characteristics  differences between methods, advantages, disadvantages, performance, typical scenarios  to evaluate the suitability of approaches for specific tasks  binary classification  n-ary classification  to be able to apply Bayesian filtering  spam  similar problems

© 2003 Franz J. Kurfess Spam Filtering 9 Evaluation Criteria

© 2003 Franz J. Kurfess Spam Filtering 10

© 2003 Franz J. Kurfess Spam Filtering 11 Spam u broadly: any that is not wanted by the recipient u similar to paper “junk” mail u easily recognized by recipients u unsolicited bulk u not requested by the recipients u automatically sent out to a large number of recipients u “optional” characteristics u disguised or forged sender, return addresses and forwarding information u questionable contents u illegal, unethical, fraudulent,... u hidden activities u acknowledgement of receipt, spyware (“Web bugs”), virus

© 2003 Franz J. Kurfess Spam Filtering 12 Terminology u spam terms u spam: negative (bad stuff) u ham: positive (good stuff) u Filtering terms u false negative u spam incorrectly classified as ham spam “gets through” u false positive u ham incorrectly classified as spam valid messages are blocked u corpus u body of documents ( messages) u hapax, hapax legomenon u unique word in a specific message u sample or training set u messages used to train the system u test set u messages used to evaluate the system

© 2003 Franz J. Kurfess Spam Filtering 13 Filtering Spam u Keywords u Rules u Learning

© 2003 Franz J. Kurfess Spam Filtering 14 Keywords u identify keywords that frequently occur in spam u simple and efficient u all incoming messages are checked for the occurrence of these keywords u if a message contains any or several of them, it is blocked u the list of keywords can be modified easily u not very accurate u many false positives u legitimate messages that happen to include “forbidden” words u many false negatives u can be easily circumvented u used in some early filtering and Web blocking tools u little to moderate success

© 2003 Franz J. Kurfess Spam Filtering 15 Rules u characteristics of spam messages are described through if... then rules u not too complicated, moderately efficient u characteristics can be combined u not only keywords u also formatting, headers u more accurate u fewer false positives u allows a better description of spam messages u fewer false negatives u somewhat more difficult to circumvent

© 2003 Franz J. Kurfess Spam Filtering 16 Learning u samples of good (ham) and bad (spam) messages are given to the system before it is deployed u the system analyses various criteria, and tries to determine which criteria are most valuable for the distinction u used earlier for general categorization u assignment of messages to folders u suggestion of actions to be performed (e.g. reply, delete, forward) u spam was not a problem at that time

© 2003 Franz J. Kurfess Spam Filtering 17 Spam and Bayes u Binary Classification of Documents u two bins: u spam, ham u sometimes an implicit “undecided” bin is used u N-ary Classification u uses n bins u “sure spam”, ”probably spam”, “maybe spam”, “unclear” “maybe ham”, “probably ham”, “sure ham” u Related Approaches u neural networks instead of Bayesian filtering u essentially also uses statistical techniques

© 2003 Franz J. Kurfess Spam Filtering 18 Binary Classification of Documents u documents are parsed, and tokens extracted u pieces of the message that may serve as classification criteria u determined by the developer u the number of occurrences for each token is calculated u done for two corpora: one ham, one spam u results in two tables with occurrences of tokens in ham and spam u a third table is created that reflects the probability of a message being ham or spam

© 2003 Franz J. Kurfess Spam Filtering 19 Calculation of Probabilities  Tokenizer  Scoring  Training  Testing

© 2003 Franz J. Kurfess Spam Filtering 20 Tokenizer  breaks up a mail message into a series of tokens  usually words or word stems  sometimes complete phrases  may consider non-textual elements  message headers, HTML constructs, images, comments  it can be difficult to identify meaningful tokens  message body tokens  embedded URLs  message headers  correlation between different types of clues

© 2003 Franz J. Kurfess Spam Filtering 21 Scoring  assigns a number to each message  0 definite ham  1 definite spam  most difficult and sensitive part of the system  incorrect scores  false positives  false negatives  unjustified confidence  scores are mostly close to 0 and 1, and rarely in between  improvements through using two separate probabilities  ham probability  spam probability  allows better treatment of unknown cases as “unsure”  substantial reduction of false positives and false negatives

© 2003 Franz J. Kurfess Spam Filtering 22 Training  presentation of examples for ham and spam  generates the probabilities used by the scoring system to assign values to new messages  corpus size  usually the larger, the better  too large may lead to overtraining  the number of ham and spam examples should be roughly equal  corpus quality  representative samples are very valuable  better quality can make up for lack of quantity  avoid misleading cues  e.g. recent spam vs. old ham; tags added by the mail system

© 2003 Franz J. Kurfess Spam Filtering 23 Testing  messages categorized as ham or spam are used for testing the performance of the system  frequently the existing collection of categorized messages is divided into a training and a testing set  intuitive insights often don’t work well  HTML tags  exclamation marks in the header  MESSAGES WRITTEN IN CAPITALS  cross-validation  formal technique that systematically divides the corpus into various combinations of training and test sets

© 2003 Franz J. Kurfess Spam Filtering 24 Results  performance results are notoriously difficult to compare  message corpus  training methods  threshold  cut-off value for spam  “magic numbers”  parameters adjusted by the developer or user

© 2003 Franz J. Kurfess Spam Filtering 25 Selected Results  based on Paul Graham’s article “Better Bayesian Filtering” [Graham, 2003b][Graham, 2003b]  filtering rate on 1750 messages over 1 month  4 false negatives: spam got through  usage of mostly legitimate words  neutral text with an innocent-sounding URL  3 false positives: ham got blocked  newsletters sent through commercial ers  almost spam  that happens to have features typically associated with spam  ALL CAPITALS,, in-line images, URLs

© 2003 Franz J. Kurfess Spam Filtering 26 Token Probabilities  based on Paul Graham’s article “Better Bayesian Filtering” [Graham, 2003b] [Graham, 2003b] Subject*FREE free!! To*free Subject*free free! Free Url*free FREE From*free free

© 2003 Franz J. Kurfess Spam Filtering 27 N-ary Classification u more than two categories u similar techniques as in the binary approach u can be substantially more complex

© 2003 Franz J. Kurfess Spam Filtering 28 Related Approaches u collaborative filtering u many people categorize messages as spam, and submit them to a central system u also should have ham samples u may “wash out” individual differences u neural networks u similar concepts, but different learning methods

© 2003 Franz J. Kurfess Spam Filtering 29 Implementation u SpamBayes Project [SpamBayes][SpamBayes] u stand-alone filter u plug-in for some popular mail programs u Related Projects u SpamAssassin u combines statistical techniques, rules, black-lists, collaborative filtering u see Paul Graham’s list of spam filters at

© 2003 Franz J. Kurfess Spam Filtering 30 Future Work  extension to more sophisticated tokens  phrases  letters replaced by visually similar symbols  e.g. o/0, l/1  separators inserted between characters  spam -> s p a m, s-p-a-m  combination with other approaches  blacklists, whitelists, rule-based systems,...  genetic algorithms  construction of filters through evolution

© 2003 Franz J. Kurfess Spam Filtering 31 References  [Graham, 2003a] Paul Graham, A Plan for Spam. August [Graham, 2003a]  [Graham, 2003b] Paul Graham, Better Bayesian Filtering. January [Graham, 2003b]  [SpamBayes] SpamBayes : Bayesian anti-spam classifier written in Python. visited Feb [SpamBayes]SpamBayes : Bayesian anti-spam classifier written in Python.  [Robinson, 2002] Gary Robinson's Rants: Spam Detection. etection.html, Dec [A revised version is to appear in the March 2003 issue of the Linux Journal, [Robinson, 2002] etection.html [Giarratano & Riley 1998]

© 2003 Franz J. Kurfess Spam Filtering 32 Important Concepts and Terms  agenda  backward chaining  common-sense knowledge  conflict resolution  expert system (ES)  expert system shell  explanation  forward chaining  inference  inference mechanism  If-Then rules  knowledge  knowledge acquisition  knowledge base  knowledge-based system  knowledge representation  Markov algorithm  matching  Post production system  problem domain  production rules  reasoning  RETE algorithm  rule  working memory

© 2003 Franz J. Kurfess Spam Filtering 33 Summary Spam Filtering

© 2003 Franz J. Kurfess Spam Filtering 34