3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Adapted from Information Retrieval and Web Search
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
Boolean Retrieval Lecture 2: Boolean Retrieval Web Search and Mining.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
1 Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof Raymond Mooney (UTexas), Prof. Joydeep Ghosh (UT ECE) who.
CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Hinrich Schütze and Christina Lioma
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Evaluating the Performance of IR Sytems
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 1: Introduction to IR.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
LIS618 lecture 2 the Boolean model Thomas Krichel
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Modern Information Retrieval Lecture 3: Boolean Retrieval.
Basic ranking Models Boolean and Vector Space Models.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Lecture 5 Term and Document Frequency CS 6633 資訊檢索 Information Retrieval and Web Search 資電 125 Based on ppt files by Hinrich Schütze.
Chapter 6: Information Retrieval and Web Search
Information retrieval 1 Boolean retrieval. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text)
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Information Retrieval and Web Search
Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER I Boolean Model 1.
Information Retrieval Lecture 1. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shakespeare’s.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
1 Information Retrieval Tanveer J Siddiqui J K Institute of Applied Physics & Technology University of Allahabad.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
Introduction to Information Retrieval Boolean Retrieval.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Search Engines Session 5 INST 301 Introduction to Information Science.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Term weighting and Vector space retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Module 2: Boolean retrieval. Introduction to Information Retrieval Information Retrieval  Information Retrieval (IR) is finding material (usually documents)
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
CS315 Introduction to Information Retrieval Boolean Search 1.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
CS122B: Projects in Databases and Web Applications Winter 2017
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Boolean Retrieval.
Representation of documents and queries
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Information Retrieval and Web Search Lecture 1: Boolean retrieval
4. Boolean and Vector Space Retrieval Models
Presentation transcript:

3: Search & retrieval: Structures

The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc matrix Meta-data d1d1 dndn dog11 stop10 attack11 cat10 live10 USA10

Term-document matrix 1 if play contains word, 0 otherwise Brutus AND Caesar but NOT Calpurnia

Inverted index construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman Documents to be indexed. Friends, Romans, countrymen.

TF-IDF (ranking) binary match (Boolean) vs. probabilistic ranking (similarity) term frequency: Occurrences of term in doc tf t,d = frequency of t in doc d document frequency: docs with the term df t = documents with term t inverse document frequency (n=total docs): idf t = log(n/df t ) tf.idf weights for term i in document d is: (1) highest when lots of i in few documents (2) few times or many documents (2) frequent in many documents

Documents as vectors Each doc d can now be viewed as a vector of wf  idf values, one component for each term So we have a vector space – terms are axes – docs live in this space – even with stemming, may have 50,000+ dimensions (axes).

High-dimensional vector space Postulate: Documents that are “close together” in the vector space talk about the same things. t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 θ φ dog cat attack dog cat dog attack dog attack dog attack attack cat attack cat attack cat

Classic IR: match query to indexed docs Re-articulating need as query Faceted search: “chunking” and “aliasing”

Precision = relevant/return Recall = return/relevant Concept1Term1Concept2Polysemy Concept3 Term1 Term2Concept1Synonymy Term3

“Text” processing 200 factors Document similarity – like tf.idf Web page – update, au, anchor Link structure – PageRank Google – commercial – ad populum fallacy GoogleScholar – indexing – 10-50% accessible