Download presentation
Presentation is loading. Please wait.
Published byΔεσποίνη Κυπραίος Modified over 6 years ago
1
Ranking Users for Intelligent Message Addressing
Vitor R. Carvalho and William Cohen Carnegie Mellon University Glasgow, April 2nd 2008
2
Outline Intelligent Message Addressing Models Data & Experiments
Auto-completion Mozilla Thunderbird Extension* Learning to Rank Results*
4
Ramesh Nallapati <ramesh@cs.cmu.edu> [Add]
William Cohen [Add] Akiko Matsui [Add] Yifen Huang [Add]
5
Ramesh Nallapati <ramesh@cs.cmu.edu> [Add]
William Cohen [Add] Akiko Matsui [Add] Yifen Huang [Add]
6
Ramesh Nallapati <ramesh@cs.cmu.edu> [Add]
Akiko Matsui [Add] Yifen Huang [Add]
7
einat <einat@cs.cmu.edu> [Add]
Ramesh Nallapati [Add] Jon Elsas [Add] Andrew Arnold [Add]
8
einat <einat@cs.cmu.edu> [Add]
Ramesh Nallapati [Add] Jon Elsas [Add] Andrew Arnold [Add]
9
Ramesh Nallapati <ramesh@cs.cmu.edu> [Add]
Jon Elsas [Add] Andrew Arnold [Add]
10
Tom Mitchell <tom@cs.cmu.edu> [Add]
Andrew Arnold [Add] Jon Elsas [Add] Frank Lin [Add]
11
Tom Mitchell <tom@cs.cmu.edu> [Add]
Andrew Arnold [Add] Jon Elsas [Add] Frank Lin [Add]
12
The Task: Intelligent Message Addressing
Predicting likely recipients of messages given: (1) contents of message being composed (2) other recipients already specified (3) a few initial letters of the intended recipient contact (intelligent auto-completion).
13
What for? Prevent high-cost management errors
Identifying people related to specific topics (or have specific relevant skills.) Relation to Expert Finding message ↔ (long) query addresses ↔ experts Improved Address Auto-completion Prevent high-cost management errors People just forget to add important recipients preventing costly misunderstandings communication delays missed opportunities. [Dom et al, 03; Campbell et al,03] Particularly in large corporations
14
How Frequent are These Errors?
Grep for “forgot”, “sorry” or “accident” in the Enron corpus - half a million real messages from a large corporation. “Sorry, I forgot to CC you his final offer” “Oops, I forgot to send it to Vince.” “Adding John to the discussion…..(sorry John)” “Sorry....missed your name on the cc: list!”. More frequent than expected at least 9.27% of the users forgot to add a desired recipient. At least 20.52% of the users were not included as recipients (even though they were intended recipients) in at least one received message. Lowerbound
15
Two Ranking Tasks TO+CC+BCC Prediction CC+BCC Prediction
16
Models Non-textual Models Expert Finding Models [Balog et al, 2006]
Frequency only Recency only Expert Finding Models [Balog et al, 2006] M1: Candidate Model M2: Document Model Rocchio (TFIDF) K-Nearest Neighbors (KNN) Rank Aggregation of the above
17
Non-Textual Models Frequency model Recency Model
Rank by total number of messages in training set Recency Model Exponential decay on chronologically ordered messages.
18
Expert Search Models M1: Candidate Model [Balog et al, 2006]
M2: Document Model [Balog et al, 2006] f(doc,ca) is estimated as user centric (UC) or document centric (DC)
19
Other Models Rocchio (TFIDF) [Joachims, 1997; Salton & Buckley, 1988]
K-Nearest Neighbors [Yang & Liu, 1999]
20
Model Parameters Chosen from preliminary tests.
Recency b = [10,20,50,100,200,500] KNN, K= [3,5,10,20,30,40,50,100] Rocchio’s b = [0,0.1,0.25,0.5]
21
Data: Enron Email Collection
Some good reasons: Large, half a million messages Natural work-related , not lists Public and free Different roles: managers, assistants, etc. Unfortunates No clear message thread information No complete Address Book information no first/last/full names of many recipients
22
Enron Data Preprocessing
Setup a realistic temporal setup (per user) For each user, 10% (most recent) sent messages will be used as test 36 users All users had their Address Books (AB) extracted TOCCBCC CCBCC
23
Enron Data Preprocessing
Bag-of-words representation Message were represented as the union of BOW of body and BOW of subject Removed inconsistencies and repeated messages Disambiguated Several Enron addresses Stop words removed, No stemming Self-addressed messages were removed
24
Threading No explicit thread information in Enron – Try to reconstruct. Build “Message Thread Set” MTS(msg) set of messages with same “subject” as the current one.
25
Results
26
Results
27
Results
28
Rank Aggregation Ranking combined by Reciprocal Rank:
29
Rank Aggregation Results
30
Observations ‘Threading’ improves MAP for all models
KNN seems is best choice overall: document-model with focus on a few top docs Data Fusion method for rank aggregation improved performance significantly Base systems making different types of mistakes
31
Intelligent Email Auto-completion
TOCCBCC CCBCC
32
Intelligent Email Auto-completion
33
Mozilla Thunderbird extension (Cut Once)
Suggestions: Click to add
34
Mozilla Thunderbird extension (Cut Once)
Interested? Just google “mozilla extension carnegie mellon” User Study using Cut Once Instead…write-then-address behavior
35
Can we do better ranking?
Learning to Rank: machine learning to improve ranking Feature-based ranking function Many recently proposed methods: RankSVM ListNet RankBoost Perceptron Variations Online, scalable. [Joachims, KDD-02] [Cao et al., ICML-07] [Freund et al, 2003] [Elsas, Carvalho & Carbonell, WSDM-08]
36
Learning to Rank Recipients
Ranking scores as features Textual Scores (KNN) Network Scores Frequency score Recency score Co-Occurrence Features Combine textual scores with other “network” features Textual Feature (KNN scores) Network Features
37
Learning to Rank Recipients: Results
38
Conclusions Problem: Predicting recipients of email messages
Useful for auto-completion, finding related people, and management addressing errors Evidence from Large collection 2 subtasks: TOCCBCC and CCBCC Various models: KNN best model in general Rank Aggregation improved performance Improvements in -auto completion Thunderbird Extension (Cut Once)* Promising Results on learning to rank recipients*
39
Thank you
40
Thank you
41
Comments (Thanks, reviewers!) No account for structural info (body ≠ subject ≠ quoted) Identifying Name entities (“Dear Mr. X”, etc.) Implicitly doing, but could be better Enron did not provide many first/last names Fair estimation of f(doc,ca) on ? Might explain weaker performance of M2 models.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.