Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC,

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
03/20/2003Parallel IR1 Papers on Parallel IR Agenda Introduction Paper 1:Inverted file partitioning schemes in multiple disk systems Paper 2: Parallel.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Evaluating Search Engine
Information Retrieval in Practice
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
CS/Info 430: Information Retrieval
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Parallel and Distributed IR
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Search Engines and Information Retrieval Chapter 1.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
 CIKM  Implementation of Smoothing techniques on the GPU  Re running experiments using the wt2g collection  The Future.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Web Search Algorithms By Matt Richard and Kyle Krueger.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Vector Space Models.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Search Engine Architecture
Indexing & querying text
Information Retrieval in Practice
Evaluation of IR Systems
Implementation Issues & IR Systems
Compact Query Term Selection Using Topically Related Text
Basic Information Retrieval
Implementation Based on Inverted Files
Chapter 5: Information Retrieval and Web Search
Inverted Indexing for Text Retrieval
Feature Selection for Ranking
INF 141: Information Retrieval
Information Retrieval and Web Design
Presentation transcript:

Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC, Fall 2008

Overview Indexing Ranking Query Expansion Query Evaluation Tupleflow

Topics Not Covered Binned Probabilities Score-Sorted Index Optimization Document-Sorted Index Optimization Navigational Search with Complex Features

Document Indexing Inverted List A mapping from a single word to a set of documents that contain the word Inverted Index A set of inverted lists

Inverted Index Contain one inverted list for each term in the document collection Often omit frequently occurring words such as “a,” “and” and “the.”

Inverted Index Example Sample Documents 1.Cats, dogs, dogs. 2.Dogs, cats, sheep. 3.Whales, sheep, goats. 4.Fish, whales, whales. Inverted Index catsdogsfishgoatssheepwhales QueryAnswer cats1,2 sheep + dogs2

Expanding Inverted Indexes Include term frequency More terms implies “about” catsdogsfishgoatssheepwhales (1,1)(1,2)(4,1)(3,1)(2,1)(3,1) (2,1) (3,1)(4,2)

Expanding Inverted Indexes (cont.) Add word position information Facilitates phrase searching catsdogsfishgoatssheepwhales (1,1): 1(1,2): 2,3(4,1): 1(3,1): 2(2,1): 3(3,1): 1 (2,1): 2(2,1): 1(3,1): 2(4,2): 1

Inverted Index Statistics Compressed inverted indexes containing only word counts –5% of the document collection in size –Built and queried faster Compressed inverted indexes containing word counts and positions –20% of the document collection in size –Essential for high effectiveness, even in queries not using phrases

Document Ranking Documents returned in order of relevance Perfect ranking impossible Retrieval systems calculate probability a document is relevant

Computing Relevance Assume “bag of words” with term independence Simple estimation Problems 1.If a document does not contain all words of a multi-word query it will not be retrieved. document containing 0 words = document containing some words 2.All words are treated equally. Query = Maltese falcon document(maltese:2, falcon:1) = document(maltese:1,falcon:2) for documents of similar length Smoothing can help # occurrences document length

Computing Relevance (cont.) Add additional features –Position/field in document, ex. title –Proximity of query terms –Combinations

Computing Relevance (cont.) Add query independent information # links from other documents URL depth shortergeneral longer specific User clicks May match expectations but not relevance Dwell time Document quality models Unusual term distribution implies poor grammar so the document is not a good retrieval candidate

Query Expansion Stemming Groups words that mean the same concept based on natural language rules. ex: run, runs, running, ran Aggressive Stemmer May group words that are not related. ex. marine, marinate Conservative Stemmer May fail to group words that are related. ex. run, ran Statistical Stemmer Uses word co-occurrence data to determine if they are related. Would probably avoid the marine, marinate mistake.

Query Expansion (cont.) Synonyms Group by terms that mean the same concept Problem May be different depending on context US: President = head of state = commander in chief UK: prime minister = head of state Corporation: president = chief executive (maybe) Solutions –Include synonyms in query but prefer term matches –Use context from the whole query “president of canada” “prime minister”

Query Expansion (cont.) Relevance Feedback User selects relevant documents and they are used to find similar documents. Pseudo Relevance Feedback System assumes the first few documents retrieved are relevant and uses them to search for more. No user involvement, so not as precise.

Evaluation Effectiveness Efficiency

Effectiveness Precision # of relevant results / # results Success Whether the first document was relevant Recall # relevant docs found / # relevant docs that exist Mean Average Precision (MAP) Average precision over all relevant documents Normalized Discounted Cumulative Gain (NDCG) Calculates using sum over result ranks

Calculating MAP Assume a retrieval set of 10 documents with 1, 5, 7, 8 and 10 relevant. RankPrecision 11/1 = 1 52/5 =.2 73/7 =.43 84/8 =.5 105/10 =.5 If there were only 5 relevant documents, then ( ) / 5 =.53 If we retrieved only 5 of 6 relevant documents, then ( ) / 6 =.44

NDCG Uses 4 values for relevance, not just is/is not with 0 being not relevant and 4 being most relevant. Calculated as N (2r(i) − 1)/ log(1 + i) Where i is the rank and r(i) is the relevance value at that rank. Example: with the following results where  is relevant and  is not  i MAPNDCG  1.00  .33.55

Efficiency Throughput # of queries processed per second Must use identical systems. Latency Time between when the user issues a query and the system delivers a response. < 150ms considered “instantaneous” Generally, improving one implies worsening the other

Measuring Efficiency Direct Attempt to create a real world system and measure statistics. Straightforward but limited to experimenter access. Simulation System operation is simulated in software. Repeatable but is only as good as its model.

Query Evaluation Document-at-a-time Evaluate each term for a document before moving to the next document. Term-at-a-time Evaluate each document for a term before moving to the next term.

Document-at-a-Time Produces complete document scores early so can quickly display partial results. Can incrementally fetch the inverted list data so uses less memory.

Document-at-a-Time Algorithm procedure DocumentAtATimeRetrieval(Q) L ← Array() R ← PriorityQueue() for all terms wi in Q do li ← InvertedList(wi) L.add( li ) end for for all documents D in the collection do for all inverted lists li in L do sD ← sD + f(Q,C,wi)(c(wi;D)) #Update the document score end for sD ← sD · d(Q,C)(|D|) #Multiply by a document-dependent factor R.add( sD,D ) end for return the top n results from R end procedure

Term-at-a-Time Does not jump between inverted lists so saves branching. Inner loop iterates over documents so is executed for a long time, thus is easier to optimize. Efficient query processing strategies have been developed for term-at-a-time. Preferred for efficient system implementation.

Term-at-a-Time Algorithm procedure TermAtATimeRetrieval(Q) A ← HashTable() for all terms wi in Q do li ← InvertedList(wi) for all documents D in li do swi,D ← A[D] + f(Q,C,wi)(c(wi;D)) end for R ← PriorityQueue() for all accumulators A[D] in A do sD ← A[D] · d(Q,C)(|D|) #Normalize the accumulator value R.add( sD,D ) end for return the top n results from R end procedure

Optimization Types Unoptimized Unsafe Set Safe Rank Safe Score Safe

Unoptimized Compare the query to each document and calculate the score. Sort the documents. Documents with the same score may appear in any order. Return results in ranked order. “Top k documents” could be different.

Optimized Unsafe Documents returned have no guaranteed set of properties. Set Safe Documents are guaranteed to be in the result set but may not be in the same order as the unoptimized results. Rank Safe Documents are guaranteed to be in the result set and in the correct order, but document scores may not be thes same as the unoptimized results. Score Safe Documents are guaranteed to be in the result set and have the same scores as the unoptimized results.

Tupleflow Distributed computing framework for indexing. Flexibility Settings made in parameter files, no ode changes required Scalability Independent tasks spread across processors Disk abstraction Streaming data model Low abstraction penalty Code handles custom hashing, sorting and serialization

Traditional Indexing Approach Create a word occurrence model by counting the unique terms in each document. Serial processing Parse one document, move to the next Large memory requirements for unique word hash over large document set words, misspellings, numbers, urls, etc. Different code required for each document type Documents, web pages, databases, etc.

Tupleflow Approach Break processing into steps Count terms (countsMaker) Sort terms Combine counts (countsReducer)

Tupleflow Example The cat in the hat. countsMakersortcountsReducer WordCountWordCountWordCount the1cat1 1 1hat1 1 in1 1 1 the1 1 2 hat1the1

Tupleflow Execution Graph Single ProcessorMultiple Processors filenames read text parse text count words filenames read text parse text count words combine counts read text parse text count words read text parse text count words

Summary Document indexing and querying are time and resource intensive tasks. Optimizing and parallelizing wherever possible is essential to minimize resources and maximize efficiency. Tupleflow is one example of efficient indexing by parallelization.

Questions?