& A Recommendation System for Email Recipients Vitor R. Carvalho and William W. Cohen, Carnegie Mellon University March 2007 Preventing Email Leaks.

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Characteristic Identifier Scoring and Clustering for Classification By Mahesh Kumar Chhaparia.
Implicit Queries for Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft Research)
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Lecture 22: Evaluation April 24, 2010.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Evaluation.
Evaluating Search Engine
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Evaluation.
Machine Learning for Personal Information Management William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Fine-tuning Ranking Models: a two-step optimization approach Vitor Jan 29, 2008 Text Learning Meeting - CMU With invaluable ideas from ….
Spam Detection Jingrui He 10/08/2007. Spam Types  Spam Unsolicited commercial  Blog Spam Unwanted comments in blogs  Splogs Fake blogs.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Preventing Information Leaks in Vitor Text Learning Group Meeting Jan 18, 2007 – SCS/CMU.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Chapter 5: Information Retrieval and Web Search
Targeted Fluency Intervention for Adolescents
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
Modeling Intention in Speech Acts, Information Leaks and User Ranking Methods Vitor R. Carvalho Carnegie Mellon University.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
Today Evaluation Measures Accuracy Significance Testing
Evaluating Classifiers
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large- scale Data Collections Xuan-Hieu PhanLe-Minh NguyenSusumu Horiguchi GSIS,
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Representation of Electronic Mail Filtering Profiles: A User Study Michael J. Pazzani Information and Computer Science University of California, Irvine.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Chapter 6: Information Retrieval and Web Search
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Evaluating Results of Learning Blaž Zupan
Evaluation of Recommender Algorithms for an Internet Information Broker based on Simple Association Rules and on the Repeat-Buying Theory WEBKDD 2002 Edmonton,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Modeling Intention in Speech Acts, Information Leaks and User Ranking Methods Vitor R. Carvalho Carnegie Mellon University.
Evaluation of Recommender Systems Joonseok Lee Georgia Institute of Technology 2011/04/12 1.
Learning TFC Meeting, SRI March 2005 On the Collective Classification of “Speech Acts” Vitor R. Carvalho & William W. Cohen Carnegie Mellon University.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Using Linguistic Analysis and Classification Techniques to Identify Ingroup and Outgroup Messages in the Enron Corpus.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
1 Modeling Intention in Vitor R. Carvalho Ph.D. Thesis DefenseThesis Committee: Language Technologies Institute William W. Cohen (chair) School of.
SUPPORTING SYNCHRONOUS SOCIAL Q&A THROUGHOUT THE QUESTION LIFECYCLE Matthew Richardson Ryen White Microsoft Research.
Predicting Leadership Roles in Workgroups Vitor R. Carvalho, Wen Wu and William W. Cohen Carnegie Mellon University CEAS-2007, Aug 2 nd 2007.
Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Evaluating Classifiers
An Empirical Study of Learning to Rank for Entity Search
Evaluating Results of Learning
Preventing Information Leaks in
Disambiguation Algorithm for People Search on the Web
Ranking Users for Intelligent Message Addressing
Presentation transcript:

& A Recommendation System for Recipients Vitor R. Carvalho and William W. Cohen, Carnegie Mellon University March 2007 Preventing Leaks

On July 6th 2001, the news agency Bloomberg.com published… “California Governor Gray Davis’s office released data on the state’s purchases in the spot electricity market — information Davis has been trying to keep secret — through a misdirected . The , containing data on California’s power purchases yesterday, was intended for members of the governor’s staff, said Davis spokesman Steve Maviglio. It was accidentally sent to some reporters on the office’s press list, he said. Davis is fighting disclosure of state power purchases, saying it would compromise negotiations for future contracts”.

Other examples…just google “ leak” Leaked exposes MS charity as PR exercise “Leaked may be behind Morgan Stanley's Asia economist's sudden resignation” Dell leaked shows channel plans -Direct threat haunts dealers-A leaked reveals Dell wants to get closer to UK resellers. California Power-Buying Data Disclosed in Misdirected .

Information Leaks via leak = message accidentally sent to “unintended” recipients. 1.Similar first/last names, aliases 2.Aggressive auto- completion of addresses 3.Typos 4.Keyboard settings leaks may contain sensitive information leading to disastrous consequences.

Detecting Leaks: Method Idea 1.Goal: to detect s accidentally sent to the wrong person 2.Generate artificial leaks: leaks may be simulated by various criteria: a typo, similar last names, identical first names, aggressive auto-completion of addresses, etc. 3.Method: Look for outliers Look for Outliers 1.Build model for (msg- recipients) pairs: train classifier on real data to detect simulated outliers (added to the “true” recipient list). 2.Features: textual(subject, body), network features (frequencies, co-occurrences, etc). 3.Rank potential outliers - Detect outlier and warn user based on classifier’s confidence

Detecting Leaks: Method Rec_6 Rec_2 … Rec_K Rec_5 Most likely outlier Least likely outlier P(rec_t) P(rec_t) =Probability recipient t is an outlier given “message text and other recipients in the message”. Look for Outliers 1.Build model for (msg- recipients) pairs: train classifier on real data to detect simulated outliers (added to the “true” recipient list). 2.Features: textual(subject, body), network features (frequencies, co-occurrences, etc). 3.Rank potential outliers - Detect outlier and warn user based on classifier’s confidence

Leak Criteria: how to generate (artificial) outliers Several options: –Frequent typos, same/similar last names, identical/similar first names, aggressive auto- completion of addresses, etc. We adopted the “3g-address” criteria: –On each trial, one of the msg recipients is randomly chosen and an outlier is generated according to: Else: Randomly select an address book entry

Data Preprocessing Used the Enron Dataset Setup a realistic temporal setup –For each user, 10% (most recent) sent messages will be used as test All users had their Address Books extracted –List of all recipients in the sent messages. Self-addressed messages were disregarded

Still Data Preprocessing ISI version of Enron –Remove repeated messages and inconsistencies Disambiguate Main Enron addresses –List provided by Corrada-Emmanuel from UMass Bag-of-words –Messages were represented as the union of BOW of body and BOW of subject (Textual Features) Some stop words removed

Experiments:Textual Features only Three Baseline Methods –Random Rank recipient addresses randomly –Cosine or TfIdf Centroid (Rocchio) Create a “TfIdf centroid” for each user in Address Book. A user1-centroid is the sum of all training messages (in TfIdf vector format) that were addressed to user user1. For testing, rank according to cosine similarity between test message and each centroid. –Knn-30 Given a test msg, get 30 most similar msgs in training set. Rank according to “sum of similarities” of a given user on the 30-msg set.

Experiments: Textual Features only Leak Prediction Results: Accuracy (or in 10 trials. On each trial, a different set of outliers is generated Accuracy

Using Network Features 1.Frequency features –Number of received messages (from this user) –Number of sent messages (to this user) –Number of sent+received messages 2.Co-Occurrence Features –Number of times a user co-occurred with all other recipients. Co-occurr means “two recipients were addressed in the same message in the training set” 3.Max3g features –For each recipient R, find Rm (=address with max score from 3g-address list of R), then use score(R)-score(Rm) as feature. Scores come from the CV10 procedure. Leak- recipient scores are likely to be smaller than their 3g- address highest score.

Combining Textual and Network Features 10-fold Crossvalidation scheme Training –Use Knn-30 on 10-Fold crossvalidation setting to get “textual score” of each user for all training messages –Turn each train example into |R| binary examples, where |R| is the number of recipients of the message. |R|-1 positive (the real recipients) 1 negative (leak-recipient) –Augment “textual score” with network features –Quantize features –Train a classifier VP5-Classification-based ranking scheme (VP5=Voted Perceptron with 5 passes over training set)

Results: Textual+Network Features

Finding Real Leaks in Enron How can we find it? –Look for “mistake”, “sorry” or “accident ”. We were looking for sentences like “Sorry. Sent this to you by mistake. Please disregard.”, “I accidentally send you this reminder”, etc. How many can we find? –Dozens of cases. Unfortunately, most of these cases were originated by non-Enron addresses or by an Enron address that is not one of the 151 Enron users whose messages were collected. Our method requires a collection of sent (+received) messages from a user. Found 2 real “valid” cases! (“valid” = testable) –Message germanyc/sent/930, message has 20 recipients, leak is –kitchen-l/sent items/497, it has 44 recipients, leak is

Finding Real Leaks in Enron –Very Disappointing Results!! –Reason: and were never observed in the training set! [Accuracy, Average Rank], 100 trials

“Smoothing” the leak generation Else: Randomly select an address book entry Generate a random address NOT in Address Book   Sampling from random unseen recipients with probability 

Some Results: Kitchen-l has 4 unseen addresses out of the 44 recipients, Germany-c has only one, out of 20.

Mixture parameter  :

Back to the simulated leaks:

Conclusions Privacy and papers are rare. To the best of our knowledge, this was the first paper on preventing information leaks via . Can prevent HUGE problems Easy to implement in any client – no change in server side. The Leak paper was accepted in SDM “This is a feature I would like to have in the client I use myself” “Personally, I am eager to use such a tool if its accuracy is good.”

& A Recommendation System for Recipients Vitor R. Carvalho and William W. Cohen, Carnegie Mellon University March 2007 Preventing Leaks

Recommending Recipients 1.Prevent a user from forgetting to add an important collaborator or manager as recipient, preventing costly misunderstandings and communication delays. Cost of errors in task management is high: for instance, deadlines can be missed or opportunities wasted because of such errors. 2.Find people in an organization that are working in a similar topic or project, or to find people with appropriate expertise or skills. Valuable addition to systems, particularly in large corporations systems that can suggest who recipients of a message might be while the message is being composed, given its current contents and given its previously-specified recipients.

Two Recommendation Tasks TO+CC+BCC Prediction CC+BCC Prediction Method: 1.Extract Features: Textual and non-Textual 2.Build model for (msg-recipients) pairs: train classifier to detect “true” missing recipients. 3. Rank all addresses in Address Book according to classifier’s confidence

Methods Large-scale multi-class multi-label classification tasks: Address Books have hundreds, sometimes thousands, of addresses (classes) One-vs-all training is too expensive, even for users having a small collections of messages Using Information Retrieval Techniques as baselines: Rocchio (TfIdf-Centroid) and KNN Enron Dataset with similar preprocessing steps as Leak Problem

Using Network Features 1.Frequency features –Number of received messages (from this user) –Number of sent messages (to this user) –Number of sent+received messages 2.Co-Occurrence Features (CC+BCC only) –Number of times a user co-occurred with all other recipients. Co-occurr means “two recipients were addressed in the same message in the training set” 3.Recency features –How frequent a recipient is in the last 20, 50, 100 messages.

Combining Textual and Network Features 10-fold Crossvalidation scheme Training –Use Knn-30 on 10-Fold crossvalidation setting to get “textual score” of each user for all training messages –Turn each train example into |AB| binary examples, where |AB| is the number of recipients in the Address Book. J positive (the real recipients) |AB|-J negative (all other in address book) –Augment “textual score” with network features –Train a classifier VP5-Classification-based ranking scheme (VP5=Voted Perceptron with 5 passes over training set)

Results: TO+CC+BCC Prediction

Avg. Recall vs Rank Curves

Overall Results

Related Work Privacy Enforcement System –Boufaden et al. (CEAS-2005) - used information extraction techniques and domain knowledge to detect privacy breaches via in a university environment. Breaches: student names, student grades and student IDs. CC Prediction –Pal & McCallum (CEAS-06) Counterpart problem: prediction of most likely intended recipients of msg. One single user, limited evaluation, not public data Expert finding in –Dom et al.(SIGMOD-03), Campbell et al(CIKM-03) –Balog & de Rijke (www-06), Balog et al (SIGIR-06) –Soboroff, Craswell, de Vries (TREC-Enterprise …) Expert finding task on the W3C corpus

Conclusions Submitted to KDD-07 The Recipient Prediction task can be seen as the negative counterpart of the Leak Prediction task. In the former, we want to find the intended recipients of messages, whereas in the latter we want to find the unintended recipients or -leaks. A desirable system addition to avoid misunderstandings and communication delays. Efficient, easy to implement and integrate, particularly in systems where traditional search is already available.