Download presentation
Presentation is loading. Please wait.
Published byἈρτεμίσιος Βλαβιανός Modified over 6 years ago
1
Contextual Search and Name Disambiguation in Email using Graphs
Einat Minkov William W. Cohen Andrew Y. Ng SIGIR-2006
2
Outline Extended similarity measure using graph walks
Instantiation for Learning Evaluation Person Name Disambiguation Threading Summary and future directions In this talk, I will first describe our framework for representing structured data as a graph, and how we derive extended similarity measures between objects in the graph using graph walks. I will give a detailed instantiation of this framework for . Then, I’ll describe how we incorporate learning into this framework, to improve performance Give evaluation results, for two -related tasks And conclude with related work and future directions.
3
Object Similarity Textual similarity measures model D-D (or Q:D) similarity However, in structured data, documents are not isolated. We are interested in extending the text-based similarity measures to complex structure settings: Represent structured data as a graph Derive object similarity using lazy graph walks. We instantiate this framework for (a private case of structured data) In traditional IR, inter-document similarity or query-document similarity are modeled in terms of text. However, in structured data, documents are not isolated objects. There may be connected to other documents through hyperlinks or meta-data. is one example of structured data. And so is the evolving semantic web. In order to use the information that is embedded in data structure, we need to extend the plain text-based similarity measures to more general settings. In this work we suggest to represent such data as a graph, which includes entities and the relations between them. We will then derive object similarity measures in this framework using finite lazy graph walks.
4
Chris.germany @enron.com
as a Graph alias Chris sent_ from sent_ from_ ch2m.com sent_ to_ file 1 On_date is a natural example of structured data, as in addition to its textual content, an contains a header with structured information. How do we represent as a graph? Given this example message.. We first represent the message as an file node. Then person names can be extracted from the header, as well as their addresses. They are represented as nodes, and we draw an edge named alias, to represent the relation between them. We also draw edges from the file node to the sender and recipient with labels sent_from and sent_to. And similarly, to the -address nodes. Every term of the content is represented as a node, and we differentiate between has_term and has_subject_term relations. Finally, we represent the time signature of the as a date node, with a relation on_date. sent_to has_subj_term Melissa Germany has_ term work where yo I’m you
5
Email as a Graph A directed graph A node carries an entity type
An edge carries a relation type Edges are bi-directional (cyclic) Nodes inter-connect via linked entities. In general, we have a directed graph, where the nodes carry an entity type, and the edges carry a relation type. Every edge in the graph actually has an opposite edge, going in the other direction. Therefore, the graph is definitely cyclic. One thing to note is that unlike social network representation, for example, here there are no direct edges between person nodes. Persons are connected to other persons through the file nodes in which they co-occurred, and also through longer paths that include terms, and -addresses. So, in this sense, this is a direct representation of the data.
6
Edge Weights Graph G : - nodes x, y, z - node types T(x), T(y), T(z)
- edge labels - parameters Edge weight x y: Prob. Distribution: Some formal details about the framework. The graph nodes are denoted by x,y,z, where every node has a type, that we call T. A directed edge from node x to y has a label of some type l. And the graph edge weights are determined by the graph parameters Theta. The edge weights, from x to y, are assigned in 2 steps: we first choose the outgoing edge type from x given x’ type, according to theta. Given this probability mass, it is further split uniformly over all target nodes y that are connected to x over this edge type. We require that the total outgoing probability mass from any node be equal to 1, so it is a probability distribution. a. Pick an outgoing edge label b. Pick node y uniformly
7
Graph Similarity Defined by lazy graph walks over k steps. Given:
Stay probability: (larger values favor shorter paths) A transition matrix: Initial node distribution: Output node distribution: In order to derive graph similarity between source nodes and other nodes in the graph, we do a lazy graph walk over k steps. In a ‘lazy’ graph walk, there is some stay probability gamma. The larger gamma is, the larger is the weight of shorter paths. We use a default of 0.5. Given the transition matrix and an initial distribution, we perform a k step walk by simple matrix multiplication. In every single step, probability mass is propagated to adjacent nodes. and a k-step matrix multiplication sums up the probability contributions for any node in the graph over multiple connecting paths. We will use this framework to perform search of related items in a graph. A query is an initial distribution over nodes. The graph walk will return a ranked list of nodes, filtered by a pre-defined node type, and ranked according to the graph walk final scores. We use this platform to perform SEARCH of related items in the graph: a query is initial distribution Vq over nodes and a desired output type Tout
8
Relation to IDF Reduce the graph to files and terms only.
One-dimensional search of files, over one step (Query = multiple source term nodes) A natural IDF filter: terms occurring in multiple files will ‘spread’ their probability mass into small fractions over many file nodes. file 2 term 5 term 6 term 4 term 2 term 3 term 1 term 7 file 3 file 1 An interesting note with regard to this framework is that it has an inherent IDF property. To demonstrate this – suppose we reduce the graph to term and file nodes only. Then, a query can be represented as a uniform distribution over the query term nodes, and they will propagate probability to files that include these terms. Terms that occur in multiple… And terms that are infrequent, will make significant contribution to specific file nodes. This holds of course for all the node types in the graph.
9
Learning Learn how to better rank graph nodes per a particular task.
The parameters can be adjusted using gradient descent methods (Diligenti et-al, IJCAI 2005) We suggest a re-ranking approach (Collins and Koo, Computational Linguistics, 2005) take advantage of ‘global’ features A training example includes: a ranked list of li nodes. Each node represented through m features At least one known correct node Features will describe the graph walk paths Now we’re going to add some learning to this framework. The graph walk results are determined by the graph layout and graph parameters theta. An interesting question is, how can we tune the graph walk given a particular task, to direct probability mass to the relevant nodes. A gradient descent algorithm that learns the graph parameters has been suggested for a similar framework, with infinite walks, that can probably adjusted to our framework. However, in this work, we suggest another approach of node re-ranking. In this approach, following closely on Michael Collins and Terry Koo, we represent the top ranked candidate nodes as bags of features, and train a linear classifier to re-score them. The advantage of re-ranking is that it can use global features, that can not be easily considered in local weight tuning. So, this approach can be viewed as a complementary to weight tuning. The features that we use will describe the graph paths that lead to each node in the ranked list.
10
Path describing Features
The full set of paths to a target node in step k can be recovered. X1 ‘Edge unigrams’: was edge type l used in reaching x from Vq. X2 X3 X4 ‘Edge bigrams’: were edge types l1 and l2 used (in that order) in reaching x from Vq. X5 K=0 K=1 K=2 In order to describe nodes in terms of the graph walk, we first recover the set of full paths leading to every node. This can probably be done using a linear programming method. We used here a backward breadth first search that we call ‘path unfolding’, which is common procedure in back-propagation neural networks. The set of paths leading to X3 in this slide over a 2 step walk are a length 1 path, going form x2 to x3. And it includes 3 more paths that reach x3 in exactly 2 steps. Given these sets of connecting paths, we define three types of features: Edge unigrams. Edge Bigrams And top edge bigrams ‘Top edge bigrams’: were edge types l1 and l2 used (in that order) in reaching x from Vq, among the top two highest scoring paths. Paths (x3, k=2): x2 x3 x2 x1 x3 x4 x1 x3 x2 x2 x3
11
Outline Extended similarity measure using graph walks
Instantiation for Learning Evaluation Person Name Disambiguation Threading Summary and future directions Getting to the evaluation results.. Since we had no relevant evaluation corpus, we defined tasks for which it was easy to come up with a set of definite correct answers. The first task we evaluate on is person name disambiguation, and we also experimented with threading.
12
Person Name Disambiguation
file Person file Person “who is Andy?” file In the person name disambiguation task we are given a mention of a person name, where it is not trivial to map this mention to the actual person. For example, here, Gloria asks Shelley to talk to Andy, who is not mentioned in the header. Resolving the name identity task is actually complementary to the task of personal name annotation in , as part of automatic processing. And, this might also be an end user application, for people who communicate with a large number of people. Formally, we define the task as follows: andy Given: a term that is known to be a personal name is not mentioned ‘as is’ in header (otherwise, easy) Output: ranked person nodes. Person
13
Corpora and Datasets Example types : Two-fold problem: Andy Andrew
Kai Keiko Jenny Xing Two-fold problem: Map terms to person nodes (co-occurrence) Disambiguation (context) We created three datasets labeled with the correct person nodes, corresponding to three separate corpora. The first corpus, Mgmt game, is a collection of s written by CMU students in a management game project in 97. There we have team information, and we were able to resolve relatively complicated name mentions. The other two corpora are subsets of the public Enron corpus, using the folders of two employees. The datasets for Enron were generated automatically, where we identified names that appear in both the content and exclusively in the cc line, and then discarded the cc line. There are several types of name mention resolution in the datasets. For example: map a nickname like Andy to a person named Andrew. In the mgmt game corpus we also have international names and nicknames, like Kai referring to Keiko. And, we also have some examples of American names that were adopted by foreign students, like Jenny for Xing. So, the problem is two-fold: First, we need to map terms to person nodes. We will show that the graph approach is effective in doing that, as it uses co-occurrence information. In addition, in many cases, disambiguation resolution is required. For example, Andy may refer to multiple persons named Andrew in the corpus. We will apply disambiguation in the graph walk, using the context that is naturally available in .
14
Methods Lexical similarity 2. Graph walk: Term Vq: name term node
Co-occurrence 1. Baseline: String matching (& common nicknames) Find persons that are similar to the name term (Jaro measure) Lexical similarity 3. Graph walk: Term + File Vq: name term + file nodes Co-occurrence Ambiguity but, incorporates additional noise 4. G: Term + File, Reranked Re-rank (3), using: Path-describing features ‘source count’ : do the paths originate from a single/two source nodes string similarity We used 4 methods in the experiments. As a baseline, we applied string matching, using the Jaro similarity measure. The string matching method ranks all the known person names by their lexical similarity to the name term. String matching is successful in many cases. However, it will fail when there is none or little similarity between the reference and the formal name. And, whenever there are multiple persons with the same name, string matching can not point out the relevant one. 2. The second method is Graph walk, applied for 2 steps. Here we start walking from the term node. UNLIKE string matching, the graph walk models co occurrence information. HOWEVER, it still can’t handle ambiguity. In case there are multiple people that are referenced the same way, the graph walk will assign higher weight to the person that is most dominant, or central, in the graph. 3. Therefore, in the third method, we add context information, as we start from both the term node and the file node. The file node, in which the term appears, serves as natural available context. Adding the file context actually solves the disambiguation problem. However, the file node bias, also adds in person nodes that are not related to the initial term node. 4. In the last method we use, we re-rank the results of the previous method. We rank the top 10 nodes, using the path describing features. We also use a feature named ‘source count’, which indicates if the paths originate from a single or the two source nodes. We hope to put more weight on nodes that can be reached from both source nodes. We also use another feature that is not related to the graph, indicating string similarity.
15
Results Mgmt. game This is a plot of the recall level at every rank in the returned list, up to rank 10. This data is for the management game corpus, averaged over 80 examples. Using string matching gives this curve. In about 60% of the instances the correct answer is in the top 5 ranks. The recall at rank 10 is limited to about 70%.
16
Results Mgmt. game With graph walk, starting from term only, performance is better – reaching recall of almost 80% at rank 5, and better recall is achieved overall. This is due to co-occurrence modeling.
17
Results Mgmt. game Graph walk, using both term and file information, gets much better recall - about 90%, and at higher ranks. This is mostly due to disambiguation resolution – since the correct person node in ambiguous cases gets to be ranked first here. However, the performance at rank 1 is somewhat worse here, and that’s because person nodes that are connected only to the file but not to the term nodes get included.
18
Results Mgmt. game Finally, re-ranking the top 10 nodes of the previous method gives this curve – where the recall is almost 100% already at rank 5.
19
Results Mgmt. Game Enron: Sager-E Enron: Shapiro-R
These are the full results for the 3 datasets, given in terms of Mean Average Precision. And, also in terms of accuracy (which is the percentage of correct answers at rank 1). For management game, the MAP using the contextual search and re-ranking is 82% higher than the baseline, and accuracy is more than twice better. For the Enron corpora the re-ranked approach gives between 30-40% improvement in MAP. Here the baseline MAP is higher (due to name mixture) – and accuracy is about double as well. So, these are very strong results, that we are quite happy with. Enron: Shapiro-R
20
Threading There are often irregularities in thread structural information (Lewis and Knowles, 1997) Threading can improve message categorization into topical folders (Klimt and Yang, 2004) Adjacent messages in a thread can be assumed to be most similar to each other in the corpus. An approximation for finding similar messages in a corpus. The second task we experimented with is threading. Threading is an interesting task for several reasons: For example, people sometime start a new thread as reply to unrelated messages etc. It has also been pointed out that threads can improve message categorization into folders. And, we are also interested in this problem because we assume that adjacent messages in a thread are most similar to each other in a corpus. Therefore, it can serve as an approximation to the general task of finding related messages in a corpus. We define the threading task as retrieving adjacent messages to a given message. Given: a file Output: ranked file nodes adjacent files in a thread are correct answers
21
The Joint Graph Shared content Social network Timeline file x
In general, threading should benefit from our framework, since threads are related in multiple dimensions of text, social network and a timeline. Here is a sketch, that can give some intuition how graph walk can accumulate such evidence. Starting from some file node, in two steps, we can reach files that share term nodes with the source file. Then, we would also reach files that are connected to the source file through person or -address nodes, modeling common social network. And, if we get to other files via date nodes, we are modeling a timeline relation. Social network Timeline
22
Threading: experiments
Baseline: TF-IDF Similarity Consider all the available information (header & body) as text Graph walk: Uniform weights Vq: file, 2 steps Graph walk: Random weights Vq: file, 2 steps (best out of 10) For the threading experiments, we applied a baseline of TFIDF. Where for every example file, we generate a ranked list of related files, ranked by TFIDF cosine similarity. The TF-IDF vector includes also the header information, parsed as terms. We have two versions of graph walks, both over 2 steps. In one version we use uniform edge weights, and in the other version we use random weights, selected as best out of 10 runs. In the fourth experiment, we re-rank the top 50 nodes output by the random-weight graph walk using the graph describing features. (How many nodes are re-ranked?) Graph walk: reranked Rerank the output of (3) using the graph-describing features
23
Results Mgmt. Game Enron: Farmer 73.8 71.5 58.4 60.3 50.2 MAP 36.2
Header & Body Subject Reply lines Header & Body Subject - Header & Body - 79.8 The results of the threading experiments are given here in terms of Mean Average Precision. These are the results for the mgmt game corpus and one of the Enron corpora. We did another split in experimental settings. From left to right: we first consider all the available information, including reply lines. Then, eliminated the reply lines, which hurt mostly the TF-IDF. Then we eliminated the subject. In this setting, the thread messages should resemble more generally related messages, since the common information of both reply lines and subject line is eliminated. The results show that the graph approach does better than TF-IDF, especially when reply lines and subject are not available. Also, re-ranking improves results substantially in almost all cases. Enron: Farmer 65.7 65.1 MAP 36.1 Header & Body Subject Reply lines Header & Body Subject - Header & Body -
24
Main Contributions Presented an extended similarity measure incorporating non-textual objects Perform finite lazy random walks for typed search A re-ranking paradigm to improve on graph walk results Instantiation of this framework for Enron Datasets and corpora are available online
25
Future directions Scalability: Language Model Learning: Timeline
Sampling-based approximation to iterative matrix multiplication 10-step walks on a million-node corpus in seconds Language Model Learning: Adjust the weights Eliminate noise in contextual/complex queries Timeline
26
Related Research IR: Data Mining Machine Learning
Infinite walks for node centrality Graph walks for query expansion Spreading activation over semantic/association networks Data Mining Relational data representation Machine Learning Semi supervised learning in graphs
27
Thank you! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.