Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Information and Computer Science University of California, Irvine.

Slides:



Advertisements
Similar presentations
Microsoft ® Office Outlook ® 2003 Training Outlook can help protect you from junk Upstate Technology Services presents:
Advertisements

Mining customer ratings for product recommendation using the support vector machine and the latent class model William K. Cheung, James T. Kwok, Martin.
Text Categorization.
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
CS 4100 Artificial Intelligence Prof. C. Hafner Class Notes March 27, 2012.
Surrey Public Library Electronic Classrooms Essentials.
Exit Microsoft Outlook Skills Using Categories for Sorting, Filtering and Creating Group Oklahoma Department of Corrections Training Administration.
1 of 2 Going on vacation requires careful preparation and there are a number of things you should do at the office before taking extended time off. This.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
ICT Curriculum Evening – an introduction to Wizkid.
K nearest neighbor and Rocchio algorithm
XP Browser and Basics1. XP Browser and Basics2 Learn about Web browser software and Web pages The Web is a collection of files that reside.
1 of 2 This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS DOCUMENT. © 2007 Microsoft Corporation.
ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University.
CS Instance Based Learning1 Instance Based Learning.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Browser and Basics Tutorial 1. Learn about Web browser software and Web pages The Web is a collection of files that reside on computers, called.
Surrey Libraries Computer Learning Centres Totally New to Computers Easy Gmail September 2013 Easy Gmail Teaching Script.
Outlook Section 5. Objectives Student will learn e mails, Outlook.
COMMUNICATION IGCSE ICT 0417 Section 9.
Surrey Libraries Computer Learning Centres Totally New to Computers Easy Gmail March 2013 Easy Gmail Teaching Script.
This is the first page of the log in, this is were you enter your unique details.
ICT Essential Skills. (electronic mail) Snail Mail.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Communication Through Internet ADE100- Computer Literacy Lecture 25.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
. Open a Click on your inbox and click on a you want to open then it will open.
Unit 10 Communication Services.  Identify types of electronic communication  Describe users of electronic communication  Identify major components.
Content-Based Recommendation Systems Michael J. Pazzani and Daniel Billsus Rutgers University and FX Palo Alto Laboratory By Vishal Paliwal.
Characterizing Model Errors and Differences Stephen D. Bay and Michael J. Pazzani Information and Computer Science University of California, Irvine
Data mining and machine learning A brief introduction.
Electronic Communication Is the process of sending and receiving messages. Text Message File Transfer Sending messages from one computer.
The identification of interesting web sites Presented by Xiaoshu Cai.
The Internet 8th Edition Tutorial 2 Basic Communication on the Internet: .
Computer Technology Michael Viphongsay 4B. Electronic mail Internet or Intranet.
Unit 2—Using the Computer Lesson 14 and Electronic Communication.
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
A Technical Approach to Minimizing Spam Mallory J. Paine.
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM Rohan Malkhare Committee : Dr. Eugene Fink Dr. Dewey Rundus Dr. Alan Hevner.
Week Rainey Community ICT Classes ICT – The Basics.
CS 445/545 Machine Learning Winter, 2012 Course overview: –Instructor Melanie Mitchell –Textbook Machine Learning: An Algorithmic Approach by Stephen Marsland.
Machine learning system design Prioritizing what to work on
Microsoft Office XP Illustrated Introductory, Enhanced Started with Outlook 2002 Getting.
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Presentation Title Department of Computer Science A More Principled Approach to Machine Learning Michael R. Smith Brigham Young University Department of.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
By Ankur Khator Gaurav Sharma Arpit Mathur 01D05014 SPAM FILTERING.
XP Browser and Basics COM111 Introduction to Computer Applications.
Machine Learning for Spam Filtering 1 Sai Koushik Haddunoori.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Classification using Co-Training
Take a Second Look Before You Send a Message. Do Not Default to "Reply All”
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Sending effective and professional s . Session aims and objectives Lesson Aims and Objectives send s that are fit for purpose and audience.
Using Using Computers Safely, Effectively and Responsibly.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
Computer Literacy.
Prepared by: Mahmoud Rafeek Al-Farra
Fun gym Cambridge Nationals R001.
Text Categorization Assigning documents to a fixed set of categories
Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)
Overview of Machine Learning
ICT Communications Lesson 5: Communicating Using
A task of induction to find patterns
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
A task of induction to find patterns
Presentation transcript:

Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Information and Computer Science University of California, Irvine

Issues Addressed Would you let an agent filter your mail? If you could examine its filtering criteria, would this increase acceptance? Comprehensible filters can reduce legal liability This release of Outlook Express comes equipped with a new "junk" "filter. Insofar as Blue Mountain can ascertain, Microsoft's e- mail filter relegates greeting cards sent from Blue Mountain's web site to a "junk mail" folder for immediate discard, rather than receipt by the user. How should the mail filtering profile be represented?

Mail Filtering: Rule-based SpamFilter© by Novasoft Microsoft Outlook

Learning to Filter Mail Vector Space (TF-IDF)- R. Segal and J. Kephart. MailCat: An Intelligent Assistant for Organizing . In Proceedings of the Third International Conference on Autonomous Agents, May Rules- Cohen, W. (1996). Learning Rules that Classify Bayesian- Sahami, M., Dumais, S., Heckerman, D. and E. Horvitz (1998). A Bayesian approach to filtering junk . Support Vector Machines Dumais, S., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. Neural Networks- Lewis, D., Schapire, R., Callan, J., & Papka, R. (1996). Training algorithms for linear text classifiers.

The paper I was going to write Word pairs increase user acceptance of learned rule-based filters –Collect representative messages –Learned rule-based models with and without word pairs –Ask users to rate profiles learned under various conditions –Demonstrate increased acceptance of models with word pairs

Assumptions Why Rules? W. Cohen (1996) “the greater comprehensibility of the rules may be advantageous in a system that allows users to extend or otherwise modify a learned classifier.” Word Pairs: Treating two contiguous words as a single term Restaurant Recommendation: Pazzani (in press) “goat” vs. “goat cheese” “prime” vs. “prime rib” General finding: Negligible increase in accuracy of learned profile Intuition: It might make profiles much more understandable

Ripper Rules: Comprehensible  Acceptable Discard if the message contains our & internet Discard if the message contains free & call Discard if the message contains http & com Discard if the message contains UCI & available Discard if the message contains all & our & not Discard if the message contains business & you Discard if the message contains by & Humanities Discard if the message contains over & you & can Otherwise Forward

Ripper Rules with Word Pairs; A “floor” effect Discard if the message contains you can & to be Discard if the message contains the UCI Discard if the message contains the internet & if you Discard if the message contains you can & you have Discard if the message contains Discard if the message contains P.M. Discard if the message contains you want Discard if the message contains one of Discard if the message contains there are Discard if the message contains please contact Otherwise Forward

Ripper Rules for Forwarding Forward if the message contains I &not business &not you can Forward if the message contains computer science Forward if the message contains Subject Re: Forward if the message contains in your &not free Forward if the message contains I &not us Forward if the message contains use the Otherwise Discard

Ripper Rules with Style Features Discard if the message has greater than 5% capital letters & does not contain I & does not contain computing Discard if there is greater than 1 $ & not they Discard if the message contains our & http Discard if greater than 2% of the words are in ALL CAPS Discard if the message contains please &not your Otherwise Forward

FOCL Rules with Word Pairs Discard if the message contains not I &not science Discard if the message contains business &not Subject:Re Discard if the message contains our & internet Discard if the message contains income Discard if the message contains you can &not all your Discard if the message contains the UCI Otherwise Forward

Ripper Rules: 80% accurate profile Discard if the message contains the UCI & to the Discard if the message contains the internet & you have Discard if the message contains & you can Discard if the message contains are available Discard if the message contains you will Discard if the message contains web site Discard if the message contains of the & we are Discard if the message contains a new Otherwise Forward

Evaluation Criteria for Mail Filtering Accuracy (and precision, recall, sensitivity, etc.) Efficiency (Learning and Classification) Cost Sensitivity Traceability The ease with which the user can emulate the categorization using a model. Credibility: The degree to which the user believes the decision-making criteria will produce the desirable results. Accountability: The degree to which the representation allows a user to distinguish an accurate model from an inaccurate one.

Text classification for

Pilot Study: People are greater than 95% accurate

Willingness to use profiles

Text classification profiles Goals: create user understandable/editable create profile that makes errors easy to detect/correct Rule-based Representation similar to outlook disappointing results Speculations Representation issues Are weighted representations less understandable? Are “prototype” representations more understandable Hypotheses Using word pairs as terms make profile more understandable Using absence of words make profile less understandable

Prototype Representation IF the message contains more of papers particular business internet http money us THAN I me Re science problem talk ICS begins THEN Discard OTHERWISE Forward

Linear Threshold IF ( 11"remove" + 10"internet" + 8"http" + 7"call" + 7"business" +5"center" +3"please" + 3"marketing" + 2"money" + 1"us" + 1"reply" + 1"my" + 1"free" -14"ICS" - 10"me" - 8"science" - 6"thanks" - 6"meeting" - 5"problem" -5"begins" - 5"I" - 3"mail" - 3"com" - 2"www" - 2"talk" - 2"homework" -1"our" - 1"it" - 1" " - 1"all" - 1) is positive Then DELETE Else Forward

Linear Threshold with Pairs IF ( 10"business" + 7"internet" + 6"you can" + 6"http" + 6"center" +5"our" + 5" " + 3"money" + 2"the UCI" + 1"I have" -13"ICS" - 10"I'm" - 7"science" - 7"com" - 6"but I" - 6"Subject: Re" -5"I" - 4"thanks" - 4"problem" - 4"me" - 4"computer science" -4"I can" - 2"talk" - 2"mail" - 1"my" - 2) is positive

Prototype Representation with Pairs IF the message contains more of com service us marketing financial 'the UCI' 'http www' 'you can' 'removed from' THAN I me ICS learning 'Subject: Re:' function 'talk begins' 'computer science' 'the end' THEN Discard OTHERWISE Forward

Prototype Representation 80% accurate IF the message contains more of looking are over mailing expert reply ‘the subject’ ‘send an’ ‘at UCI’ THAN done I research sorry science because minute overview similar ‘of it’ ‘need to’ ‘a minute’ THEN Discard OTHERWISE Forward

Preferences AlgorithmMean Rating Rules0.015 Rules (Pairs) Rules (Noise) Linear Model0.421 Linear Model (Pairs)0.518 Linear Model (Noise) Prototype0.677 Prototype (Pairs)1.06 Prototype (Noise)0.195 The following differences were highly significant (at least at the.005 level). Prototype representations with word pairs received higher ratings than rule representations with word pairs t(132) = Inaccurate prototype models (learned from noisy training data) are less acceptable to users than accurate ones t(132)= The following differences were significant (at least at the.05 level). Prototype representations with word pairs received higher ratings than linear model representations with word pairs t(132) = Inaccurate linear models are less acceptable to users than accurate ones. t(132)=2.99. The following difference was marginally significant (between the 0.1 and.05 level). For prototype representations using word pairs as terms increases user ratings: t(132) = 2.37.

Learning Prototype: A First Pass Genetic Algorithm Instance is a pair of terms vectors 128 most informative terms Initialized with 10% of features of each example Fitness function: number correct on training data Operators breeding mutation results on mail, S&W data: as good as anything else AlgorithmMailGoatsSheepBands Perceptron Nearest ID Naïve Bayes Rocchio Prototype