Information Retrieval: Models and Methods

Slides:



Advertisements
Similar presentations
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Advertisements

Chapter 5: Introduction to Information Retrieval
Information retrieval – LSI, pLSI and LDA
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Introduction to Information Retrieval
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Information Retrieval: Models and Methods October 15, 2003 CMSC Gina-Anne Levow.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Chapter 5: Information Retrieval and Web Search
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Chapter 6: Information Retrieval and Web Search
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
SINGULAR VALUE DECOMPOSITION (SVD)
Nearest Neighbor & Information Retrieval Search Artificial Intelligence CMSC January 29, 2004.
Nearest Neighbor & Information Retrieval Search Artificial Intelligence CMSC January 29, 2004.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
Natural Language Processing Topics in Information Retrieval August, 2002.
(Pseudo)-Relevance Feedback & Passage Retrieval Ling573 NLP Systems & Applications April 28, 2011.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
IR 6 Scoring, term weighting and the vector space model.
From Frequency to Meaning: Vector Space Models of Semantics
Automated Information Retrieval
Information Retrieval in Practice
Plan for Today’s Lecture(s)
Information Retrieval: Models and Methods
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Latent Semantic Indexing
Vector-Space (Distributional) Lexical Semantics
Representation of documents and queries
Design open relay based DNS blacklist system
Text Categorization Assigning documents to a fixed set of categories
CS 430: Information Discovery
Chapter 5: Information Retrieval and Web Search
Michal Rosen-Zvi University of California, Irvine
4. Boolean and Vector Space Retrieval Models
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
INF 141: Information Retrieval
Recuperação de Informação B
Recuperação de Informação B
Restructuring Sparse High Dimensional Data for Effective Retrieval
Recuperação de Informação B
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.
CS 430: Information Discovery
Presentation transcript:

Information Retrieval: Models and Methods

Roadmap Problem: Methods: Challenge: Beyond literal matching Matching Topics and Documents Methods: Classic: Vector Space Model N-grams HMMs Challenge: Beyond literal matching Expansion Strategies Aspect Models

Matching Topics and Documents Two main perspectives: Pre-defined, fixed, finite topics: “Text Classification” Arbitrary topics, typically defined by statement of information need (aka query) “Information Retrieval”

Matching Topics and Documents Documents are “about” some topic(s) Question: Evidence of “aboutness”? Words !! Possibly also meta-data in documents Tags, etc Model encodes how words capture topic E.g. “Bag of words” model, Boolean matching What information is captured? How is similarity computed?

Models for Retrieval and Classification Plethora of models are used Here: Vector Space Model N-grams HMMs

Vector Space Information Retrieval Task: Document collection Query specifies information need: free text Relevance judgments: 0/1 for all docs Word evidence: Bag of words No ordering information

Vector Space Model Represent documents and queries as Vectors of term-based features Features: tied to occurrence of terms in collection E.g. Solution 1: Binary features: t=1 if presense, 0 otherwise Similiarity: number of terms in common Dot product

Vector Space Model II Problem: Not all terms equally interesting E.g. the vs dog vs Levow Solution: Replace binary term features with weights Document collection: term-by-document matrix View as vector in multidimensional space Nearby vectors are related Normalize for vector length

Vector Similarity Computation Similarity = Dot product Normalization: Normalize weights in advance Normalize post-hoc

Term Weighting “Aboutness” “Specificity” To what degree is this term what document is about? Within document measure Term frequency (tf): # occurrences of t in doc j “Specificity” How surprised are you to see this term? Collection frequency Inverse document frequency (idf):

Term Selection & Formation Some terms are truly useless Too frequent, no content E.g. the, a, and,… Stop words: ignore such terms altogether Creation: Too many surface forms for same concepts E.g. inflections of words: verb conjugations, plural Stem terms: treat all forms as same underlying

N-grams Simple model Evidence: More than bag of words Captures context, order information E.g. White House Applicable to many text tasks Language identification, authorship attribution, genre classification, topic/text classification Language modeling for ASR,etc

Text Classification with N-grams Task: Classes identified by document sets Assign new documents to correct class N-gram categorization: Text: D; category: Select c maximizing posterior probability

Text Classification with N-grams Representation: For each class, train N-gram model “Similarity”: For each document D to classify, select c with highest likelihood Can also use entropy/perplexity

Assessment & Smoothing Comparable to “state of the art” 0.89 Accuracy Reliable Across smoothing techniques Across languages – generalizes to Chinese characters n Abs G-T Lin W-B 4 0.88 0.87 0.89 5 6

HMMs Provides a generative model of topicality Noisy channel model: Solid probabilistic framework rather than ad hoc weighting Noisy channel model: View query Q as output of underlying relevant document D, passed through mind of user

HMM Information Retrieval Task: Given user generated query Q, return ranked list of relevant documents Model: Maximize Pr(D is Relevant) for some query Q Output symbols: terms in document collection States: Process to generate output symbols From document D From General English Pr(q|GE) a General English Query start Query end b Document Pr(q|D)

HMM Information Retrieval Generally use EM to train transition and output probabilities E.g query-relevant document pairs Data often insufficient Simplified strategy: EM for transition, assume same across docs Output distributions:

EM Parameter Update a a ‘ a ‘ b ‘ English a

Evaluation Comparison to VSM HMM can outperform VSM Some variation related to implementation Can integrate other features –e.g. bigram or trigram models,

Key Issue All approaches operate on term matching If a synonym, rather than original term, is used, approach fails Develop more robust techniques Match “concept” rather than term Expansion approaches Add in related terms to enhance matching Mapping techniques Associate terms to concepts Aspect models, stemming

Expansion Techniques Can apply to query or document Thesaurus expansion Use linguistic resource – thesaurus, WordNet – to add synonyms/related terms Feedback expansion Add terms that “should have appeared” User interaction Direct or relevance feedback Automatic pseudo relevance feedback

Query Refinement Typical queries very short, ambiguous Cat: animal/Unix command Add more terms to disambiguate, improve Relevance feedback Retrieve with original queries Present results Ask user to tag relevant/non-relevant “push” toward relevant vectors, away from nr β+γ=1 (0.75,0.25); r: rel docs, s: non-rel docs “Roccio” expansion formula

Compression Techniques Reduce surface term variation to concepts Stemming Map inflectional variants to root E.g. see, sees, seen, saw -> see Crucial for highly inflected languages – Czech, Arabic Aspect models Matrix representations typically very sparse Reduce dimensionality to small # key aspects Mapping contextually similar terms together Latent semantic analysis

Latent Semantic Analysis

Latent Semantic Analysis

LSI

Classic LSI Example (Deerwester)

SVD: Dimensionality Reduction

LSI, SVD, & Eigenvectors SVD decomposes: Term x Document matrix X as X=TSD’ Where T,D left and right singular vector matrices, and S is a diagonal matrix of singular values Corresponds to eigenvector-eigenvalue decompostion: Y=VLV’ Where V is orthonormal and L is diagonal T: matrix of eigenvectors of Y=XX’ D: matrix of eigenvectors of Y=X’X S: diagonal matrix L of eigenvalues

Computing Similarity in LSI

SVD details

SVD Details (cont’d)