Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Improved TF-IDF Ranker
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Under The Hood [Part I] Web-Based Information Architectures MSEC – Mini II 28-October-2003 Jaime Carbonell.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Hinrich Schütze and Christina Lioma
Evaluating the Performance of IR Sytems
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Term Weighting and Ranking Models Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
A fast algorithm for the generalized k- keyword proximity problem given keyword offsets Sung-Ryul Kim, Inbok Lee, Kunsoo Park Information Processing Letters,
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Vector Space Models.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
NRA Top k query processing using Non Random Access Only sequential access Only sequential accessAlgorithm 1) 1) scan index lists in parallel; 2) 2) consider.
Modern Information Retrieval Lecture 2: Key concepts in IR.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Natural Language Processing Topics in Information Retrieval August, 2002.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
CS315 Introduction to Information Retrieval Boolean Search 1.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Lecture 1: Introduction and the Boolean Model Information Retrieval
Indexing & querying text
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
אחזור מידע, מנועי חיפוש וספריות
Basic Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
Boolean and Vector Space Retrieval Models
INF 141: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014

Back in those days 2 Once upon a time in the world, there were days without search engines We had access to much smaller amount of information Had to find information manually

Search engine 3 User needs some information Assumption: the required information is present somewhere A search engine tries to bridge this gap How:  User “expresses” the information need – query  Engine returns – list of documents, or by some better means

Search engine 4 User needs some information Assumption: the required information is present somewhere A search engine tries to bridge this gap Simplest model  User submits query – a set of words (terms)  Search engine returns documents “matching” the query  Assumption: matching the query would satisfy the information need  Modern search has come a long way from the simple model, but the fundamentals are still required

Basic approach 5 This is in Indian Statistical Institute, Kolkata, India Statistically flying is the safest mode of journey Diwali is a huge festival in India India’s population is huge Thank god it is a holiday This is autumn There is no end of learning  Documents contain terms  Documents are represented by terms present in them  Match queries and documents by terms  For simplicity: ignore positions, consider documents as “bag-of- words”  There may be many matching documents – need to rank them Query: india statistics

Vector space model Each term represents a dimension Documents are vectors in the term-space Term-document matrix: a very sparse matrix Query is also a vector in the term-space 6 d1d1 d2d2 d3d3 d4d4 d5d5 q diwali10000 india flying01000 population00010 autumn00100 statistical  Similarity of each document d with the query q is measured by the cosine similarity (dot product normalized by norms of the vectors)

Scoring function: TF.iDF  How important is a term t in a document d  Approach: take two factors into account – With what significance does t occur in d? [term frequency] – Does t occur in many other documents also? [document frequency] – Called TF.iDF: TF × iDF, has many variants for TF and iDF  Variants for TF(t, d) 1.Number of times t occurs in d: freq(t, d) 2.Logarithmically scaled frequency: 1 + log(freq(t, d)) 3.Augmented frequency: avoid bias towards longer documents  Inverse document frequency of t : iDF(t) 7 for all t in d; 0 otherwise where N = total number of documents DF(t) = number of documents in which t occurs Half the score for just being present Rest a function of frequency

BM25  Okapi IR system – Okapi BM25  If the query q = {q 1, …, q n } where q i ’s are words in the query 8 where N = total number of documents avgdl = average length of documents k 1 and b are optimized parameters, usually b = 0.75 and 1.2 ≤ k 1 ≤ 2.0  BM25 exhibited better performance than TF.iDF in TREC consistently

Relevance  Simple IR model: query, documents, returned results  Relevant document: a document that satisfies the information need expressed by the query – Merely matching query terms does not make a document relevant – Relevance is human perception, not a mathematical statement – User may want some statistics on population of India by the query “india statistics” – The document “Indian Statistical Institute” matches the query terms, but not relevant  To evaluate effectiveness of a system, we need for each query 1.Given a result, an assessment of whether it is relevant 2.The set of all relevant results assessed (pre-validated) If the second is available, it serves the purpose of the first as well  Measures: precision, recall, F-measure (harmonic mean of precision and recall) 9

Inverted index  Standard representation: document  terms  Inverted index: term  documents  For each term t, store the list of the documents in which t occurs 10 This is in Indian Statistical Institute, Kolkata, India Statistically flying is the safest mode of journey Diwali is a huge festival in India India’s population is huge Thank god it is a holiday This is autumn There is no end of learning diwali: d3d3 india:d2d2 d3d3 d7d7 flying: d1d1 population: d7d7 autumn: d4d4 statistical: d1d1 d2d2 Scores?

Inverted index  Standard representation: document  terms  Inverted index: term  documents  For each term t, store the list of the documents in which t occurs 11 diwali: d 3 (0.5) india:d 2 (0.7)d 3 (0.3)d 7 (0.4) flying: d 1 (0.3) population:d 7 (0.6) autumn:d 4 (0.8) statistical: d 1 (0.2)d 2 (0.5) Note: These scores are dummy, not by any formula This is in Indian Statistical Institute, Kolkata, India Statistically flying is the safest mode of journey Diwali is a huge festival in India India’s population is huge Thank god it is a holiday This is autumn There is no end of learning

Positional index  Just documents and scores follows bag of words model  Cannot perform proximity search or phrase query search  Positional inverted index: also store position of each occurrence of term t in each document d where t occurs 12 diwali: d 3 (0.5): india:d 2 (0.7): d 3 (0.3): d 7 (0.4): flying: d 1 (0.3): population:d 7 (0.6): autumn:d 4 (0.8): statistical: d 1 (0.2): d 2 (0.5): This is in Indian Statistical Institute, Kolkata, India Statistically flying is the safest mode of journey Diwali is a huge festival in India India’s population is huge Thank god it is a holiday This is autumn There is no end of learning

Pre-processing  Removal of stopwords: of, the, and, … – Modern search does not completely remove stopwords – Such words add meaning to sentences as well as queries  Stemming: words  stem (root) of words – Statistics, statistically, statistical  statistic (same root) – Loss of slight information (the form of the word also matters) – But unifies differently expressed queries on the same topic – Lemmatization: doing this properly with morphological analysis of words  Normalization: unify equivalent words as much as possible – U.S.A, USA – Windows, windows  Stemming, lemmatization, normalization, synonym finding, all are important subfields on their own!! 13

Creating an inverted index  For each document, write out pairs (term, docid)  Sort by term  Group, compute DF 14 This is in Indian Statistical Institute, Kolkata, India Statistically flying is the safest mode of journey Diwali is a huge festival in India India’s population is huge Thank god it is a holiday This is autumn There is no end of learning TermdocId statistic1 fly1 safe1 …… india2 statistic2 india3 …… 7 TermdocId india2 3 7 …… fly1 safe1 statistic1 2 …… TermdocId india (df=3)237 fly (df=1)1 statistic (df=2)12 ……

Traditional architecture 15 Analysis (stemming, normalization, …) Basic format conversion, parsing Indexing Core query processing (accessing index, ranking) Core query processing (accessing index, ranking) Index Different types of documents Query handler (query parsing) Results handler (displaying results) User Query Results Query Results

Query processing lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 16 Pick the smallest doc id

Merge lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 17 doc 5 (0.6) Pick the smallest doc id

Merge lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 18 Pick the smallest doc id doc 5 (0.6)

Merge lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 19 Pick the smallest doc id doc 5 (0.6)

Merge lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 20 Pick the smallest doc id doc 5 (0.6) doc 10 (0.1)

Merge lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 21 Pick the smallest doc id doc 5 (0.6) doc 10 (0.1)

Merge lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 22 Pick the smallest doc id doc 5 (0.6) doc 10 (0.1)

Merge lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 23 Pick the smallest doc id doc 5 (0.6) doc 10 (0.1) doc 14 (0.6)

Merge lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 24 Pick the smallest doc id doc 5 (0.6) doc 10 (0.1) doc 14 (0.6)

Merge lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 25 Pick the smallest doc id doc 5 (0.6) doc 10 (0.1) doc 14 (0.6)

Merge lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 26 Pick the smallest doc id doc 5 (0.6) doc 10 (0.1) doc 14 (0.6) doc 17 (1.6)

Merge lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 27 Pick the smallest doc id doc 5 (0.6) doc 10 (0.1) doc 14 (0.6) doc 17 (1.6)

Merge lists sorted by doc id doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 One pointer in each list 28 doc 5 (0.6) doc 10 (0.1) doc 14 (0.6) doc 17 (1.6) doc 21 (0.5) doc 25 (0.6) doc 38 (0.6) doc 44 (0.1) doc 61 (0.3) doc 65 (0.1) doc 78 (0.5) doc 81 (0.2) doc 83 (1.8) doc 91 (0.1) Merged list still sorted by doc id (Partial) sort doc 83 (1.8) doc 17 (1.6) Top-2 Complexity? klogn

Merge Simple and efficient, minimal overhead Lists sorted by doc id Merge Merged list But, have to scan the lists fully! 29

Top-k algorithms  If there are millions of documents in the lists – Can the ranking be done without accessing the lists fully?  Exact top-k algorithms (used more in databases) – Family of threshold algorithms (Ronald Fagin et al) – Threshold algorithm (TA) – No random access algorithm (NRA) [we will discuss, as an example] – Combined algorithm (CA) – Other follow up works  Inexact top-k algorithms – Exact top-k not required, the scores are only “crude” approximation of “relevance” (human perception) – Several heuristics – Further reading: IR book by Manning, Raghavan and Schuetze, Ch. 7 30

NRA (No Random Access) Algorithm lists sorted by score doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 Fagin ’ s NRA Algorithm: read one doc from every list 31

NRA (No Random Access) Algorithm lists sorted by score doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc Fagin ’ s NRA Algorithm: round 1 doc 83 [0.9, 2.1] doc 17 [0.6, 2.1] doc 25 [0.6, 2.1] Candidates min top-2 score: 0.6 maximum score for unseen docs: 2.1 min-top-2 < best-score of candidates List 1 List 2 List 3 read one doc from every list current score best-score =

NRA (No Random Access) Algorithm lists sorted by score Fagin ’ s NRA Algorithm: round 2 doc 17 [1.3, 1.8] doc 83 [0.9, 2.0] doc 25 [0.6, 1.9] doc 38 [0.6, 1.8] doc 78 [0.5, 1.8] Candidates min top-2 score: 0.9 maximum score for unseen docs: 1.8 doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc List 1 List 2 List 3 min-top-2 < best-score of candidates read one doc from every list =

NRA (No Random Access) Algorithm lists sorted by score doc 83 [1.3, 1.9] doc 17 [1.3, 1.7] doc 25 [0.6, 1.5] doc 78 [0.5, 1.4] Candidates min top-2 score: 1.3 maximum score for unseen docs: 1.3 doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc Fagin ’ s NRA Algorithm: round 3 List 1 List 2 List 3 min-top-2 < best-score of candidates no more new docs can get into top-2 but, extra candidates left in queue read one doc from every list =

NRA (No Random Access) Algorithm doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc lists sorted by score doc doc 83 [1.3, 1.9] doc 25 [0.6, 1.4] Candidates min top-2 score: 1.3 maximum score for unseen docs: 1.1 Fagin ’ s NRA Algorithm: round 4 List 1 List 2 List 3 min-top-2 < best-score of candidates no more new docs can get into top-2 but, extra candidates left in queue read one doc from every list =

NRA (No Random Access) Algorithm doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc doc lists sorted by score doc doc Candidates min top-2 score: 1.6 maximum score for unseen docs: 0.8 Done! Fagin ’ s NRA Algorithm: round 5 List 1 List 2 List 3 no extra candidate in queue read one doc from every list = More approaches:  Periodically also perform random accesses on documents to reduce uncertainty (CA)  Sophisticated scheduling on lists  Crude approximation: NRA may take a lot of time to stop. Just stop after a while with approximate top-k – who cares if the results are perfect according to the scores?

References  Primarily: IR Book by Manning, Raghavan and Schuetze: 37