Machine Learning for Personal Information Management William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
Implicit Queries for Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft Research)
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Extracting Personal Names from Applying Named Entity Recognition to Informal Text Einat Minkov & Richard C. Wang Language Technologies Institute.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
& A Recommendation System for Recipients Vitor R. Carvalho and William W. Cohen, Carnegie Mellon University March 2007 Preventing Leaks.
HTL-ACTS Workshop, June 2006, New York City Improving Speech Acts Analysis via N-gram Selection Vitor R. Carvalho & William W. Cohen Carnegie Mellon.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Fine-tuning Ranking Models: a two-step optimization approach Vitor Jan 29, 2008 Text Learning Meeting - CMU With invaluable ideas from ….
Preventing Information Leaks in Vitor Text Learning Group Meeting Jan 18, 2007 – SCS/CMU.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modeling Intention in Speech Acts, Information Leaks and User Ranking Methods Vitor R. Carvalho Carnegie Mellon University.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Citation Recommendation 1 Web Technology Laboratory Ferdowsi University of Mashhad.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
Learning to Classify into “Speech Acts” William W. Cohen, Vitor R. Carvalho and Tom M. Mitchell Presented by Vitor R. Carvalho IR Discussion Series.
CSM06 Information Retrieval Lecture 6: Visualising the Results Set Dr Andrew Salway
Presenter: Shanshan Lu 03/04/2010
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Modeling term relevancies in information retrieval using Graph Laplacian Kernels Shuguang Wang Joint work with Saeed Amizadeh and Milos Hauskrecht.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
By Gianluca Stringhini, Christopher Kruegel and Giovanni Vigna Presented By Awrad Mohammed Ali 1.
Modeling Intention in Speech Acts, Information Leaks and User Ranking Methods Vitor R. Carvalho Carnegie Mellon University.
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Learning TFC Meeting, SRI March 2005 On the Collective Classification of “Speech Acts” Vitor R. Carvalho & William W. Cohen Carnegie Mellon University.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Post-Ranking query suggestion by diversifying search Chao Wang.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
1 Modeling Intention in Vitor R. Carvalho Ph.D. Thesis DefenseThesis Committee: Language Technologies Institute William W. Cohen (chair) School of.
1 CS 430: Information Discovery Lecture 5 Ranking.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Predicting Leadership Roles in Workgroups Vitor R. Carvalho, Wen Wu and William W. Cohen Carnegie Mellon University CEAS-2007, Aug 2 nd 2007.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
SIGIR, August 2005, Salvador, Brazil On the Collective Classification of “Speech Acts” Vitor R. Carvalho & William W. Cohen Carnegie Mellon University.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Final Project Presentation Information Extraction Learning to Extract Signature and Reply Lines from Vitor R. Carvalho.
Information Retrieval in Practice
Queensland University of Technology
Xiang Li,1 Lili Mou,1 Rui Yan,2 Ming Zhang1
Information Organization: Overview
An Empirical Study of Learning to Rank for Entity Search
Classification with Perceptrons Reading:
Preventing Information Leaks in
Contextual Search and Name Disambiguation in using Graphs
Disambiguation Algorithm for People Search on the Web
Ranking Users for Intelligent Message Addressing
Learning to Rank Typed Graph Walks: Local and Global Approaches
Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007
The Voted Perceptron for Ranking and Structured Classification
Information Organization: Overview
Presentation transcript:

Machine Learning for Personal Information Management William W. Cohen Machine Learning Department and Language Technologies Institute School of Computer Science Carnegie Mellon University Vitor Carvalho, Einat Minkov, Tom Mitchell, Andrew Ng (Stanford) and and Ramnath Balasubramanyan

ML for Starting point: Ishmail, an emacs RMAIL extension written by Charles Isbell in summer ’95 (largely for Ron Brachman) Could manually write mailbox definitions and filtering rules in Lisp [Cohen, AAAI Spring Symposium on ML and IR 1996]

Foldering tasks Rule-learning method [Cohen, ICML95][Rocchio, 71]

Machine Learning in Why study learning for ? 1. has more visible impact than anything else you do with computers. 2. is hard to manage: People get overwhelmed. People lose important information in archives. People make horrible mistakes.

Machine Learning in Why study learning for ? For which tasks can learning help ? –Foldering –Spam filtering –Search: beyond keyword search –Recognizing errors –Help for tracking tasks search don’t sort! important and well-studied “Oops, did I just hit reply-to-all?” “Dropping the ball”

proposal CMU CALO graph William 6/18/07 6/17/07 Sent To Term In Subject Learning to Search [SIGIR 2006, CEAS 2006, WebKDD/SNA 2007]

as a Graph  A closed set of 5 node types  A closed set of 18 edge types (inc. inverse edges)

Q: “what are Jason’s aliases?” “Jason” Msg 5 Msg 18 andrew.cmu.edu Sent from Sent to Jason Ernst Sent-to AddressOf cs.cmu.edu Similar to Msg 2 Sent To einat Has term inv. Basic idea: learning to search is learning to query a graph for information

How do you pose queries to a graph? An extended similarity measure via graph walks:

Propagate “similarity” from start nodes through edges in the graph – accumulating evidence of similarity over multiple connecting paths. How do you pose queries to a graph?

An extended similarity measure via graph walks: Fixed probability of halting the walk at every step – i.e., shorter connecting paths have greater importance (exponential decay) Propagate “similarity” from start nodes through edges in the graph – accumulating evidence of similarity over multiple connecting paths. How do you pose queries to a graph?

An extended similarity measure via graph walks: Fixed probability of halting the walk at every step – i.e., shorter connecting paths have greater importance (exponential decay) In practice we can approximate with a short finite graph walk, implemented with sparse matrix multiplication Propagate “similarity” from start nodes through edges in the graph – accumulating evidence of similarity over multiple connecting paths. How do you pose queries to a graph?

An extended similarity measure via graph walks: Fixed probability of halting the walk at every step – i.e., shorter connecting paths have greater importance (exponential decay) In practice we can approximate with a short finite graph walk, implemented with sparse matrix multiplication The result is a list of nodes, sorted by “similarity” to an input node distribution (final node probabilities ). Propagate “similarity” from start nodes through edges in the graph – accumulating evidence of similarity over multiple connecting paths. How do you pose queries to a graph?

, contacts etc: a graph Graph nodes are typed, edges are directed and typed Multiple edges may connect two given nodes. Every edge type is assigned a fixed weight—which determines probability of being followed in a walk: e.g., uniform A query language: Q: {, } Returns a list of nodes (of type ) ranked by the graph walk probs. = query “terms” Random walk with restart, graph kernels, heat diffusion kernels, diffusion processes, Laplacian regularization, graph databases (BANKS, DbExplorer, …), graph mincut, associative Markov networks, …

Tasks that are like similarity queries Person name disambiguation Threading Alias finding [ term “andy” file msgId ] “person” [ file msgId ] “file”  What are the adjacent messages in this thread?  A proxy for finding “more messages like this one” What are the -addresses of Jason ?... [ term Jason ] “ -address” Meeting attendees finder Which -addresses (persons) should I notify about this meeting? [ meeting mtgId ] “ -address”

Learning to search better Query a  node rank 1  node rank 2  node rank 3  node rank 4  …  node rank 10  node rank 11  node rank 12  …  node rank 50 Query b Query q  node rank 1  node rank 2  node rank 3  node rank 4  …  node rank 10  node rank 11  node rank 12  …  node rank 50  node rank 1  node rank 2  node rank 3  node rank 4  …  node rank 10  node rank 11  node rank 12  …  node rank 50 … GRAPH WALK + Rel. answers a+ Rel. answers b+ Rel. answers q Task T (query class) Standard set of features used for x on each problem: Edge n-grams in all paths from V q to x Number of reachable source nodes Features of top- ranking paths (e.g. edge bigrams)

Graph walk Feature generation Learn re-ranker Re-ranking function Learning Node re-ordering: train task

Learning Approach train task Graph walk Feature generation Learn re-ranker Re-ranking function Graph walk Feature generation Score by re-ranking function Node re-ordering: Boosting test task [Collins & Koo, CL 2005; Collins, ACL 2002] Voted Perceptron; RankSVM; PerceptronCommittees; … [Joacchim KDD 2002, Elsas et al WSDM 2008]

Tasks that are like similarity queries Person name disambiguation Threading Alias finding [ term “andy” file msgId ] “person” [ file msgId ] “file”  What are the adjacent messages in this thread?  A proxy for finding “more messages like this one” What are the -addresses of Jason ?... [ term Jason ] “ -address” Meeting attendees finder Which -addresses (persons) should I notify about this meeting? [ meeting mtgId ] “ -address”

Corpora Nicknames: Dave for David, Kai for Keiko, Jenny for Qing Common names are ambiguous Person names Corpora and datasets PERSON NAME DISAMBIGUATION

CSpace collected at CMU 15,000+ s from semester-line management course students formed groups that acted as “companies” and worked together dozens of groups with some known social connections (e.g., “president”)

Results Mgmt. game PERSON NAME DISAMBIGUATION

Results Mgmt. game PERSON NAME DISAMBIGUATION

Results Mgmt. game PERSON NAME DISAMBIGUATION

Results Mgmt. game PERSON NAME DISAMBIGUATION

Enron: Sager-E Enron: Shapiro-R PERSON NAME DISAMBIGUATION Mgmt. Game Results On All Three Problems

Tasks Person name disambiguation Threading Alias finding [ term “andy” file msgId ] “person” [ file msgId ] “file”  What are the adjacent messages in this thread?  A proxy for finding “more messages like this one” What are the -addresses of Jason ?... [ term Jason ] “ -address” Meeting attendees finder Which -addresses (persons) should I notify about this meeting? [ meeting mtgId ] “ -address”

Header & Body Subject Reply lines Header & Body Subject - Header & Body - Header & Body Subject Reply lines Header & Body Subject - Header & Body - Mgmt. Game Enron: Farmer MAP Threading: Results

Graph walk Weight update Theta* Learning approaches Edge weight tuning:

Graph walk Weight update Graph walk Learning approaches Edge weight tuning: Theta* task Graph walk Feature generation Learn re-ranker Re-ranking function Graph walk Feature generation Score by re-ranking function Node re-ordering: Boosting; Voted Perceptron Question: which is better? [Diligenti et al, IJCAI 2005; Toutanova & Ng, ICML 2005; … ]

Results (MAP) Name disambiguation Threading Alias finding * * * * * * * * * * * * Reranking and edge- weight tuning are complementary. Best result is usually to tune weights, and then rerank Reranking overfits on small datasets (meetings)

Machine Learning in Why study learning for ? For which tasks can learning help ? –Foldering –Spam filtering –Search beyond keyword search –Recognizing errors –Help for tracking tasks “Oops, did I just hit reply-to-all?” “Dropping the ball”

Idea –Goal: to detect s accidentally sent to the wrong person –Generate artificial leaks: leaks may be simulated by various criteria: a typo, similar last names, identical first names, aggressive auto-completion of addresses, etc. –Method: Look for outliers. Leak: accidentally sent to wrong person Leak Preventing errors in [SDM 2007]

Preventing Leaks Method –Create simulated/artificial recipients –Build model for (msg.recipients): train classifier on real data to detect synthetically created outliers (added to the true recipient list). Features: textual(subject, body), network features (frequencies, co-occurrences, etc). –Rank potential outliers - Detect outlier and warn user based on confidence. Rec 6 Rec 2 … Rec K Rec 5 Most likely outlier Least likely outlier P(rec t ) P(rec t ) =Probability recipient t is an outlier given “message text and other recipients in the message”.

Enron Data Preprocessing 1 Realistic scenario –For each user, 10% (most recent) sent messages will be used as test Construct Address Books for all users –List of all recipients in the sent messages.

Simulating Leaks Several options: –Frequent typos, same/similar last names, identical/similar first names, aggressive auto-completion of addresses, etc. In this paper, we adopted the 3g-address criteria: –On each trial, one of the msg recipients is randomly chosen and an outlier is generated according to: Else: Randomly select an address book entry α α Random non- address-book address

Enron Data Preprocessing 2 ISI version of Enron –Remove repeated messages and inconsistencies Disambiguate Main Enron addresses –List provided by Corrada-Emmanuel from UMass Bag-of-words –Messages were represented as the union of BOW of body and BOW of subject Some stop words removed Self-addressed messages were removed

Experiments: using Textual Features only Three Baseline Methods –Random Rank recipient addresses randomly –Rocchio/TfIdf Centroid [Rocchio 71] Create a “TfIdf centroid” for each user in Address Book. A user1-centroid is the sum of all training messages (in TfIdf vector format) that were addressed to user user1. For testing, rank according to cosine similarity between test message and each centroid. –Knn-30 [Yang and Chute, SIGIR 94] Given a test msg, get 30 most similar msgs in training set. Rank according to “sum of similarities” of a given user on the 30-msg set.

Experiments: using Textual Features only Leak Prediction Results: in 10 trials. On each trial, a different set of outliers is generated

Network Features How frequent a recipient was addressed How these recipients co- occurred in the training set

Using Network Features 1.Frequency features –Number of received messages (from this user) –Number of sent messages (to this user) –Number of sent+received messages 2.Co-Occurrence Features –Number of times a user co-occurred with all other recipients. 3.Max3g features –For each recipient R, find Rm (=address with max score from 3g-address list of R), then use score(R)-score(Rm) as feature. Combine with text-only scores using voted- perceptron reranking, trained on simulated leaks

Precision at rank 1 α = 0

Finding Real Leaks in Enron How can we find it? –Grep for “mistake”, “sorry” or “accident”. –Note: must be from one of the Enron users Found 2 good cases: 1.Message germanyc/sent/930, message has 20 recipients, leak is 2.kitchen-l/sent items/497, it has 44 recipients, leak is

Results on real leaks Kitchen-l has 4 unseen addresses out of the 44 recipients, Germany-c has only one.

The other kind of recipient error How accurately can you fill in missing recipients, using the message text as evidence? Mean average precision over 36 users, after using thread information [ECIR 2008]

Leak warnings: hit x to remove recipient Pause or cancel send of message Timer: msg is sent after 10sec by default Suggestions: hit + to add Current prototype (Thunderbird plug-in) Classifier/rankers written in JavaScript

Machine Learning in Why study learning for ? For which tasks can learning help ? –Foldering –Spam filtering –Search beyond keyword search –Recognizing errors –Help for tracking tasks “Dropping the ball”

Dropping the Ball

Request Time/date Check off Task: “get new screen shots for kickoff talk” -? [Minkov et al, IJCAI 2005; ACL 2005]

Speech Acts for [EMNLP 2004, SIGIR 2005, ACL Acts WS 2006]

Classifying Speech Acts [Carvalho & Cohen, SIGIR 2005]: Relational model including adjacent messages in thread; pseudo-likelihood/RDN model with annealing phase. [Carvalho & Cohen, ACL workshop 2006: IE preprocessing, n-grams, feature extraction, YFLA] ** Ciranda package [Dabbish et al, CHI 05; Drezde et al, IUI 06; Khoussainov & Kushmeric, CEAS 2006; Goldstein et al, CEAS 2006; Goldstein & Sabin, HICCS 06]

Request Time/date Add Task: follow up on: “request for screen shots” by ___ days before -? 2 “next Wed” (12/5/07) “end of the week” (11/30/07) “Sunday” (12/2/07) - other -

Commitment Time/date Add Task: “METAL – fairly urgent feedback sought” by “tomorrow noon” (11/29/07) - other - Warning! You are making a commitment! Hit cancel to abort!

Conclusions/Summary is visible and important There are lots of interesting problems associated with processing –Learning to query heterogeneous data graphs –Modeling patterns of interactions User  User textual communication User  User commun. frequency, recency, … –… to predict likely recipients/nonrecipients, correct possible errors, and/or aid user in tracking requests and commitments  perfect ML application

Conclusions/Summary is visible and important There are lots of interesting problems associated with processing –Learning to query heterogeneous data graphs –Modeling patterns of interactions User  User textual communication User  User commun. frequency, recency, … –… to predict likely recipients/nonrecipients, correct possible errors, and/or aid user in tracking requests and commitments

Bibliography: Our Group Einat Minkov and William Cohen (2007): Learning to Rank Typed Graph Walks: Local and Global Approaches in WebKDD Vitor Carvalho, Wen Wu and William Cohen (2007): Discovering Leadership Roles in Workgroups in CEAS Vitor Carvalho and William Cohen (2007): Ranking Users for Intelligent Message Addressing, to appear in ECIR Vitor Carvalho and William W. Cohen (2007): Preventing Information Leaks in in SDM Einat Minkov and William W. Cohen (2006): An and Meeting Assistant using Graph Walks in CEAS Einat Minkov, Andrew Ng and William W. Cohen (2006): Contextual Search and Name Disambiguation in using Graphs in SIGIR Vitor Carvalho and William W. Cohen (2006): Improving Speech Act Analysis via N-gram Selection in HLT/NAACL ACTS Workshop William W. Cohen, Einat Minkov & Anthony Tomasic (2005): Learning to Understand Web Site Update Requests in IJCAI Einat Minkov, Richard C. Wang, and William W. Cohen (2005): Extracting Personal Names from Applying Named Entity Recognition to Informal Text in EMNLP/HLT Vitor Carvalho & William W. Cohen (2005): On the Collective Classification of Speech Acts in SIGIR William W. Cohen, Vitor R. Carvalho & Tom Mitchell (2004): Learning to Classify into "Speech Acts" in EMNLP Vitor R. Carvalho & William W. Cohen (2004): Learning to Extract Signature and Reply Lines from in CEAS 2004.

Bibliography: Other Cited Papers M. Collins. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In ACL, [M. Collins and T. Koo. Discriminative reranking for natural language parsing. Computational Linguistics, 31(1):25–69, M. Diligenti, M. Gori, and M. Maggini. Learning web page scores by error back- propagation. In IJCAI, T. Joachims, Optimizing Search Engines Using Clickthrough Data, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, Jonathan Elsas, Vitor R. Carvalho and Jaime Carbonell. Fast Learning of Document Ranking Functions with the Committee Perceptron. WSDM-2008 (ACM International Conference on Web Search and Data Mining). Y. Yang and C. G. Chute, “An example-based mapping method for text classification and retrieval”, ACM Trans Info Systems, 12(3), L. A. Dabbish, R. E. Kraut, S. Fussell, and S. Kiesler, “Understanding use: predicting action on a message,” in CHI ’05: Proceedings of the SIGCHI conference on Human factors in computing systems, 2005, pp. 691–700.

Bibliography: Other Cited Papers M. Dredze, T. Lau, and N. Kushmerick, “Automatically classifying s into activities,” in IUI ’06: Proceedings of the 11th international conference on Intelligent user interfaces, 2006, pp. 70–77. D. Feng, E. Shaw, J. Kim, and E. Hovy, “Learning to detect conversation focus of threaded discussions,” in Proceedings of the HLT/NAACL 2006 (Human Language Technology Conference - North American chapter of the Association for Computational Linguistics), New York City, NY, J. Goldstein, A. Kwasinksi, P. Kingsbury, R. E. Sabin, and A. McDowell, “Annotating subsets of the enron corpus,” in Conference on and Anti-Spam (CEAS’2006), J. Goldstein and R. E. Sabin, “Using speech acts to categorize and identify genres,” Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS’06), vol. 3, p. 50b, [11] R. Khoussainov and N. Kushmerick, “ task management: An iterative relational learning approach,” in Conference on and Anti- Spam (CEAS’2005), David Allen, “Getting things done: the art of stress-free productivity”, Penguin Books, 2001.