Under The Hood [Part I] Web-Based Information Architectures MSEC 20-760 – Mini II 28-October-2003 Jaime Carbonell.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Boolean and Vector Space Retrieval Models
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Web-based Information Architectures Jian Zhang. Today’s Topics Term Weighting Scheme Vector Space Model & GVSM Evaluation of IR Rocchio Feedback Web Spider.
Hinrich Schütze and Christina Lioma
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Evaluating the Performance of IR Sytems
Information Retrieval: Models and Methods October 15, 2003 CMSC Gina-Anne Levow.
Under The Hood [Part II] Web-Based Information Architectures MSEC Mini II Jaime Carbonell.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
IR Models: Review Vector Model and Probabilistic.
Under The Hood [Part I] Web-Based Information Architectures MSEC Mini II Jaime Carbonell.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Artificial Intelligence Information Retrieval (How to Power a Search Engine) Jaime Carbonell 20 September 2001 Topics Covered: “Bag of Words” Hypothesis.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Vector Space Models.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
IR 6 Scoring, term weighting and the vector space model.
Why indexing? For efficient searching of a document
Information Retrieval and Web Search
אחזור מידע, מנועי חיפוש וספריות
Basic Information Retrieval
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
Query Caching in Agent-based Distributed Information Retrieval
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
4. Boolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models
Presentation transcript:

Under The Hood [Part I] Web-Based Information Architectures MSEC – Mini II 28-October-2003 Jaime Carbonell

Topics Covered The Vector Space Model for IR (VSM) Evaluation Metrics for IR Query Expansion (the Rocchio Method) Inverted Indexing for Efficiency A Glimpse into Harder Problems

The Vector Space Model Definitions of document and query vectors, where w j = j th word, and c(w j,d i ) = count the occurrences of w i in document d j

Computing the Similarity Dot-product similarity: Cosine similarity:

Computing Norms and Products Dot product: Eucledian vector norm (aka “2-norm”):

Similarity in Retrieval Similarity ranking: If sim(q,d i ) > sim(q,d j ), d i ranks higher Retrieving top k documents:

Refinements to VSM (1) Word normalization Words in morphological root form countries => country interesting => interest Stemming as a fast approximation countries, country => countr moped => mop Reduces vocabulary (always good) Generalizes matching (usually good) More useful for non-English IR (Arabic has > 100 variants per verb)

Refinements to VSM (2) Stop-Word Elimination Discard articles, auxiliaries, prepositions,... typically most frequent small words Reduce document “length” by 30-40% Retrieval accuracy improves slightly (5- 10%)

Refinements to VSM (3) Proximity Phrases E.g.: "air force" => airforce Found by high-mutual information p(w 1 w 2 ) >> p(w 1 )p(w 2 ) p(w 1 & w 2 in k-window) >> p(w 1 in k-window) p(w 2 in same k-window) Retrieval accuracy improves slightly (5-10%) Too many phrases => inefficiency

Refinements to VSM (4) Words => Terms term = word | stemmed word | phrase Use exactly the same VSM method on terms (vs words)

Evaluating Information Retrieval (1) Contingency table: relevantnot-relevant retrievedab not retrievedcd Recall = a/(a+c) = fraction of relevant retrieved Precision = a/(a+b) = fraction of retrieved that is relevant

Evaluating Information Retrieval (2) P = a/(a+b)R = a/(a+c) Accuracy = (a+d)/(a+b+c+d) F1 = 2PR/(P+R) Miss = c/(a+c) = 1 - R (false negatives) F/A = b/(a+b+c+d) (false positives)

Evaluating Information Retrieval (3) 11-point precision curves IR system generates total ranking Plot precision at 10%, 20%, 30%... recall,

Query Expansion (1) Observations: Longer queries often yield better results User’s vocabulary may differ from document vocabulary Q: how to avoid heart disease D: "Factors in minimizing stroke and cardiac arrest: Recommended dietary and exercise regimens" Maybe longer queries have more chances to help recall.

Query Expansion (2) Bridging the Gap Human query expansion (user or expert) Thesaurus-based expansion Seldom works in practice (unfocused) Relevance feedback –Widen a thin bridge over vocabulary gap –Adds words from document space to query Pseudo-Relevance feedback Local Context analysis

Relevance Feedback: Rocchio’s Method Idea: update the query via user feedback Exact method: (vector sums)

Relevance Feedback (2) For example, if: Q = (heart attack medicine) W(heart,Q) = W(attack,Q) = W(medicine,Q) = 1 D rel = (cardiac arrest prevention medicine nitroglycerine heart disease...) W(nitroglycerine,D) = 2, W(medicine,D) = 1 D irr = (terrorist attack explosive semtex attack nitroglycerine proximity fuse...) W(attack,D) = 1, W(nitroglycerine = 2), W(explosive,D) = 1 AND α =1, β =2, γ =.5

Relevance Feedback (3) Then: W(attack,Q’) = 1* *1 = 0.5 W(nitroglycerine, Q’) = W(medicine, Q’) = w(explosive, Q’) =

Term Weighting Methods (1) Salton’s Tf*IDf Tf = term frequency in a document Df = document frequency of term = # documents in collection with this term IDf = Df -1

Term Weighting Methods (2) Salton’s Tf*IDf TfIDf = f 1 (Tf)*f 2 (IDf) E.g. f 1 (Tf) = Tf*ave(|D j |)/|D| E.g. f 2 (IDf) = log 2 (IDF) f 1 and f 2 can differ for Q and D

Efficient Implementations of VSM (1) Exploit sparseness Only compute non-zero multiplies in dot- products Do not even look at zero elements (how?) => Use non-stop terms to index documents

Efficient Implementations of VSM (2) Inverted Indexing Find all unique [stemmed] terms in document collection Remove stopwords from word list If collection is large (over 100,000 documents), [Optionally] remove singletons Usually spelling errors or obscure names Alphabetize or use hash table to store list For each term create data structure like:

Efficient Implementations of VSM (3) [term IDF term i, <doc i, freq(term, doc i ) doc j, freq(term, doc j )...>] or: [term IDF term i, <doc i, freq(term, doc i ), [pos 1,i, pos 2,i,...] doc j, freq(term, doc j ), [pos 1,j, pos 2,j,...]...>] pos l,1 indicates the first position of term in document j and so on.

Open Research Problems in IR (1) Beyond VSM Vectors in different Spaces: Generalized VSM, Latent Semantic Indexing... Probabilistic IR (Language Modeling): P(D|Q) = P(Q|D)P(D)/P(Q)

Open Research Problems in IR (2) Beyond Relevance Appropriateness of doc to user comprehension level, etc. Novelty of information in doc to user anti- redundancy as approx to novelty

Open Research Problems in IR (3) Beyond one Language Translingual IR Transmedia IR

Open Research Problems in IR (4) Beyond Content Queries "What’s new today?" "What sort of things to you know about" "Build me a Yahoo-style index for X" "Track the event in this news-story"