Mehran Sahami Timothy D. Heilman A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Text Categorization.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Vector Space Model CS 652 Information Extraction and Integration.
The Vector Space Model …and applications in Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Vector Space Model : TF - IDF
CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6)
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Information Retrieval Basic Document Scoring. Similarity between binary vectors Document is binary vector X,Y in {0,1} v Score: overlap measure What’s.
Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Vector Space Models.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
1 A Fuzzy Logic Framework for Web Page Filtering Authors : Vrettos, S. and Stafylopatis, A. Source : Neural Network Applications in Electrical Engineering,
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Citation-Based Retrieval for Scholarly Publications 指導教授:郭建明 學生:蘇文正 M
1 CS 430: Information Discovery Lecture 5 Ranking.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
IR 6 Scoring, term weighting and the vector space model.
The Vector Space Models (VSM)
Information Retrieval and Web Search
From frequency to meaning: vector space models of semantics
Identifying terms with similar meanings across corpora
Chapter 5: Information Retrieval and Web Search
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Mehran Sahami Timothy D. Heilman A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets

Introduction Wish to determine how similar two short text snippets are. High degree of semantic similarity United Nations Secretary General vs Kofi Annan AI vs Articial Intelligence Share terms graphical models vs graphical interface 5%

Related Work Query expansion techniques Other means of determining query similarity Set overlap (intersection) SVM for text classification Latent Semantic Kernels (LSK) Semantic Proximity Matrix Cross-lingual techniques 10%

A New Similarity Function represent a short text snippet (query) to a search engine S be the set of n retrieved documents Compute the TFIDF term vector for each documentTFIDF Truncate each vector to include its m highest weighted term 15%

Normalize Let be the centroid of the L2 normalized vectorL2 Let QE(x) be the L2 normalization of the centroid C(x) 20%

Kernel Function 25%

Initial Results with Kernel Three genres of text snippet matching Acronyms Individuals and their positions Multi-faceted terms 30%

Acronyms Text1Text2Kernel CosineSet Overlap Support vector machine SVM Portable document format PDF Artificial intelligence AI Artificial insemination AI term frequency inverse document frequency tf idf term frequency inverse document frequency tfidf %

Individuals and their positions 40%

Multi-faceted terms 45%

Related Query Suggestion Kernel function for u is any newly issued user query A repository Q of approximately 116 million popular user queries issued in 2003, determined by sampling anonymized web search logs from the Google search engine 50%

Algorithm Given user query and list of matched queries from repository Output list of queries to suggest Initialize suggestion list Sort kernel scores in descending order to produce an ordered list of corresponding queries MAX is set to the maximum number of suggestions 55%

Post-Filter |q| denotes the number of terms in query q 60%

Evaluation of Query Suggestion System 1. suggestion is totally off topic. 2. suggestion is not as good as original query. 3. suggestion is basically same as original query. 4. suggestion is potentially better than original query. 5. suggestion is fantastic - should suggest this query since it might help a user find what they're looking for if they issued it instead of the original query. 65%

Evaluations Original Query Suggested QueriesKernel Score Human Rating california lottery california lotto home winning lotto numbers in california california lottery super lotto plus valentines day 2003 valentine's day valentine day card valentines day greeting cards I love you valentine new valentine one %

Average ratings at various kernel thresholds 75%

Average ratings versus average number of query suggestions 80%

Application in QA K("Who shot Abraham Lincoln", "John Wilkes Booth") = K("Who shot Abraham Lincoln", "Abraham Lincoln") = %

Conclusion A new kernel function for measuring the semantic similarity between pairs of short text snippets The first is improvement in the generation of query expansions with the goal of improving the match score for the kernel function

Term Weighting Scheme The weight associated with the term in document is defined to be : Where is the frequency of in N is the total number of ducuments, and is the total number of documents that contain

Given by: Most common cases P=1,This is the L1 norm, which is also called Manhattan distance P=2,This is the L2 norm, which is also called the Euclidean distance P=, This is the L norm, also called the infinity norm or the Chebyshev norm Lp Norm