Contextual Search and Name Disambiguation in Email Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Evaluating the Robustness of Learning from Implicit Feedback Filip Radlinski Thorsten Joachims Presentation by Dinesh Bhirud
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Implicit Queries for Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft Research)
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Evaluating Search Engine
CVPR 2008 James Philbin Ondˇrej Chum Michael Isard Josef Sivic
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
1 Statistical correlation analysis in image retrieval Reporter : Erica Li 2004/9/30.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Language-Independent Set Expansion of Named Entities using the Web Richard C. Wang & William W. Cohen Language Technologies Institute Carnegie Mellon University.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Enron Corpus: A New Dataset for Classification By Bryan Klimt and Yiming Yang CEAS 2004 Presented by Will Lee.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
LOGO Summarizing Conversations with Clue Words Giuseppe Carenini, Raymond T. Ng, Xiaodong Zhou (WWW ’07) Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.
Effective Information Access Over Public Archives Progress Report William Lee, Hui Fang, Yifan Li For CS598CXZ Spring 2005.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
1 CS 430: Information Discovery Lecture 5 Ranking.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
ENHANCING CLUSTER LABELING USING WIKIPEDIA David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab SIGIR’09.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Latent Semantic Indexing
Contextual Search and Name Disambiguation in using Graphs
Disambiguation Algorithm for People Search on the Web
Feature Selection for Ranking
Learning to Rank Typed Graph Walks: Local and Global Approaches
Actively Learning Ontology Matching via User Interaction
Topic: Semantic Text Mining
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Contextual Search and Name Disambiguation in Using Graphs Einat Minkov, William W. Cohen, Andrew Y. Ng Carnegie Mellon University and Stanford University SIGIR 2006

INTRODUCTION 計算文件的 similarity 除了 textual feature 外, 其實 還有一些其它的資訊可以用 Ex. Hyperlinks in webs, meta-data, and header information in In this paper we consider extended similarity metrics for documents and other objects embedded in graphs, facilitated via a lazy graph walk

INTRODUCTION In a lazy graph walk, there is a fixed probability of halting the walk at each step Two problem disambiguating personal names in Threading

AS A GRAPH

“ Einat Minkov ” person node “ Einat Minkov ” -address node “ ” 其它規則

Edge weights To walk away from a node x, one first picks an edge label l We assume that the probability of picking the label l depends only on the type T (x)

Edge weights once l is picked, y is chosen uniformly from the set of all y

Graph walks a lazy graph walk, there is some probability °staying at x if V 0 is some initial probability distribution over nodes, then the distribution after a k- step walk is proportional to

Graph walks In our framework, a query is an initial distribution Vq over nodes, plus a desired output type Tout Ex. “ economic impact of recycling tires ” would be an appropriate distribution Vq over query terms, with Tout = file

Relation to TF-IDF Suppose we restrict ourselves to only two types, terms and files, and allow only in-file edges common term “ the ” will spread its probability mass into small fractions over many file nodes unusual term “ aardvark ” will spread its weight over only a few files the effect will be similar to use of an IDF weighting scheme

LEARNING Previous researchers have described schemes for adjusting the parameters using gradient descent-like methods In this paper, we suggest an alternative approach of learning to re-order an initial ranking

LEARNING The reranking algorithm is provided with a training set containing n examples Example i includes a ranked list of li nodes Let wij be the j th node for example i A candidate node wij is represented through m features, which are computed by m feature functions f1,..., fm

LEARNING ranking function for node x is defined as: where L(x) = log(p(x)) and ᾱ is a vector of real- value parameters minimizes the following loss function on the training data

Corpora Cspace corpus contains messages collected from a management course conducted at Carnegie Mellon University in 1997 The Enron corpus a collection of mail from the Enron corpus that has been made available for the research community

Person Name Disambiguation “ Andrew ” = “ Andrew Y. Ng ” or “ Andrew McCallum ” ??? The Cspace corpus, We collected 106 cases in which single-token names were mentioned in the the body of a message but did not match any name from the header

Person Name Disambiguation For Enron, two datasets were generated automatically. we eliminate the collected person name from the header the namesin this corpus include people that are in the header,but cannot be matched because

Results for person name disambiguation Baseline method The similarity score between the name term and a person name is calculated as the maximal Jaro similarity score between the term and any single token of the personal name (ranging between 0 to 1) In addition, we incorporate a nickname dictionary, such that if the name term is a known nickname of the person name, the similarity score of that pair is set to 1

Results for person name disambiguation Graph walk methods 嘗試兩種 Vq query distribution on the name term equal weight to the name term node and the file in which it appears Tout=person type we will use a uniform weighting of labels

Reranking the output of a walk Edge unigram features for each edge label L, whether L was used in reaching x from Vq Edge bigram features whether L1 and L2 were used (in that order) in reaching x from Vq Top edge bigram features paths leading to a node originate from one or two nodes inVq

Person name disambiguation results: Recall at rank k

Person Name Disambiguation Results

Threading A thread is a conversation among 2 or more people carried out by exchange of messages Threading problem Retrieving other messages in an thread given a single message from the thread Given an file as a query, produce a ranked list of related files, where the immediate parent and child of the given file are considered to be “ correct ” answers.

Threading several information types are available Header - sender, recipients and date Body - the textual content of an emai reply lines - quoted lines from previous messages Subject - the content of the subject line

Threading Baseline method TF-IDF term weighting+cosine similarity Graph walk methods Vq assign probability 1 to the file node corresponding to the original message, Tout = file

Graph walk methods weight-tuning method we evaluate 10 randomly-chosen sets of weights and pick the one that performs best (in terms of MAP) on the CSpace training data Reranking the output of walks The features applied are edge unigram, edge bigram and top edge bigram

Threading Results: MAP

Threading results: Recall at rank k

CONCLUSION We have presented a scheme for representing a corpus of messages with a graph of typed entities This scheme provides good performance on two representative -related tasks: disambiguating person names, and threading.