Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Information and Computer Science University of California, Irvine
Issues Addressed Would you let an agent filter your mail? If you could examine its filtering criteria, would this increase acceptance? Comprehensible filters can reduce legal liability This release of Outlook Express comes equipped with a new "junk" "filter. Insofar as Blue Mountain can ascertain, Microsoft's e- mail filter relegates greeting cards sent from Blue Mountain's web site to a "junk mail" folder for immediate discard, rather than receipt by the user. How should the mail filtering profile be represented?
Mail Filtering: Rule-based SpamFilter© by Novasoft Microsoft Outlook
Learning to Filter Mail Vector Space (TF-IDF)- R. Segal and J. Kephart. MailCat: An Intelligent Assistant for Organizing . In Proceedings of the Third International Conference on Autonomous Agents, May Rules- Cohen, W. (1996). Learning Rules that Classify Bayesian- Sahami, M., Dumais, S., Heckerman, D. and E. Horvitz (1998). A Bayesian approach to filtering junk . Support Vector Machines Dumais, S., Platt, J., Heckerman, D., and Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. Neural Networks- Lewis, D., Schapire, R., Callan, J., & Papka, R. (1996). Training algorithms for linear text classifiers.
The paper I was going to write Word pairs increase user acceptance of learned rule-based filters –Collect representative messages –Learned rule-based models with and without word pairs –Ask users to rate profiles learned under various conditions –Demonstrate increased acceptance of models with word pairs
Assumptions Why Rules? W. Cohen (1996) “the greater comprehensibility of the rules may be advantageous in a system that allows users to extend or otherwise modify a learned classifier.” Word Pairs: Treating two contiguous words as a single term Restaurant Recommendation: Pazzani (in press) “goat” vs. “goat cheese” “prime” vs. “prime rib” General finding: Negligible increase in accuracy of learned profile Intuition: It might make profiles much more understandable
Ripper Rules: Comprehensible Acceptable Discard if the message contains our & internet Discard if the message contains free & call Discard if the message contains http & com Discard if the message contains UCI & available Discard if the message contains all & our & not Discard if the message contains business & you Discard if the message contains by & Humanities Discard if the message contains over & you & can Otherwise Forward
Ripper Rules with Word Pairs; A “floor” effect Discard if the message contains you can & to be Discard if the message contains the UCI Discard if the message contains the internet & if you Discard if the message contains you can & you have Discard if the message contains Discard if the message contains P.M. Discard if the message contains you want Discard if the message contains one of Discard if the message contains there are Discard if the message contains please contact Otherwise Forward
Ripper Rules for Forwarding Forward if the message contains I ¬ business ¬ you can Forward if the message contains computer science Forward if the message contains Subject Re: Forward if the message contains in your ¬ free Forward if the message contains I ¬ us Forward if the message contains use the Otherwise Discard
Ripper Rules with Style Features Discard if the message has greater than 5% capital letters & does not contain I & does not contain computing Discard if there is greater than 1 $ & not they Discard if the message contains our & http Discard if greater than 2% of the words are in ALL CAPS Discard if the message contains please ¬ your Otherwise Forward
FOCL Rules with Word Pairs Discard if the message contains not I ¬ science Discard if the message contains business ¬ Subject:Re Discard if the message contains our & internet Discard if the message contains income Discard if the message contains you can ¬ all your Discard if the message contains the UCI Otherwise Forward
Ripper Rules: 80% accurate profile Discard if the message contains the UCI & to the Discard if the message contains the internet & you have Discard if the message contains & you can Discard if the message contains are available Discard if the message contains you will Discard if the message contains web site Discard if the message contains of the & we are Discard if the message contains a new Otherwise Forward
Evaluation Criteria for Mail Filtering Accuracy (and precision, recall, sensitivity, etc.) Efficiency (Learning and Classification) Cost Sensitivity Traceability The ease with which the user can emulate the categorization using a model. Credibility: The degree to which the user believes the decision-making criteria will produce the desirable results. Accountability: The degree to which the representation allows a user to distinguish an accurate model from an inaccurate one.
Text classification for
Pilot Study: People are greater than 95% accurate
Willingness to use profiles
Text classification profiles Goals: create user understandable/editable create profile that makes errors easy to detect/correct Rule-based Representation similar to outlook disappointing results Speculations Representation issues Are weighted representations less understandable? Are “prototype” representations more understandable Hypotheses Using word pairs as terms make profile more understandable Using absence of words make profile less understandable
Prototype Representation IF the message contains more of papers particular business internet http money us THAN I me Re science problem talk ICS begins THEN Discard OTHERWISE Forward
Linear Threshold IF ( 11"remove" + 10"internet" + 8"http" + 7"call" + 7"business" +5"center" +3"please" + 3"marketing" + 2"money" + 1"us" + 1"reply" + 1"my" + 1"free" -14"ICS" - 10"me" - 8"science" - 6"thanks" - 6"meeting" - 5"problem" -5"begins" - 5"I" - 3"mail" - 3"com" - 2"www" - 2"talk" - 2"homework" -1"our" - 1"it" - 1" " - 1"all" - 1) is positive Then DELETE Else Forward
Linear Threshold with Pairs IF ( 10"business" + 7"internet" + 6"you can" + 6"http" + 6"center" +5"our" + 5" " + 3"money" + 2"the UCI" + 1"I have" -13"ICS" - 10"I'm" - 7"science" - 7"com" - 6"but I" - 6"Subject: Re" -5"I" - 4"thanks" - 4"problem" - 4"me" - 4"computer science" -4"I can" - 2"talk" - 2"mail" - 1"my" - 2) is positive
Prototype Representation with Pairs IF the message contains more of com service us marketing financial 'the UCI' 'http www' 'you can' 'removed from' THAN I me ICS learning 'Subject: Re:' function 'talk begins' 'computer science' 'the end' THEN Discard OTHERWISE Forward
Prototype Representation 80% accurate IF the message contains more of looking are over mailing expert reply ‘the subject’ ‘send an’ ‘at UCI’ THAN done I research sorry science because minute overview similar ‘of it’ ‘need to’ ‘a minute’ THEN Discard OTHERWISE Forward
Preferences AlgorithmMean Rating Rules0.015 Rules (Pairs) Rules (Noise) Linear Model0.421 Linear Model (Pairs)0.518 Linear Model (Noise) Prototype0.677 Prototype (Pairs)1.06 Prototype (Noise)0.195 The following differences were highly significant (at least at the.005 level). Prototype representations with word pairs received higher ratings than rule representations with word pairs t(132) = Inaccurate prototype models (learned from noisy training data) are less acceptable to users than accurate ones t(132)= The following differences were significant (at least at the.05 level). Prototype representations with word pairs received higher ratings than linear model representations with word pairs t(132) = Inaccurate linear models are less acceptable to users than accurate ones. t(132)=2.99. The following difference was marginally significant (between the 0.1 and.05 level). For prototype representations using word pairs as terms increases user ratings: t(132) = 2.37.
Learning Prototype: A First Pass Genetic Algorithm Instance is a pair of terms vectors 128 most informative terms Initialized with 10% of features of each example Fitness function: number correct on training data Operators breeding mutation results on mail, S&W data: as good as anything else AlgorithmMailGoatsSheepBands Perceptron Nearest ID Naïve Bayes Rocchio Prototype