Download presentation
Presentation is loading. Please wait.
1
Efficient Processing of Complex Features for Information Retrieval Dissertation by Trevor Strohman Presented by Laura C. Vandivier For ITCS6050, UNCC, Fall 2008
2
Overview Indexing Ranking Query Expansion Query Evaluation Tupleflow
3
Topics Not Covered Binned Probabilities Score-Sorted Index Optimization Document-Sorted Index Optimization Navigational Search with Complex Features
4
Document Indexing Inverted List A mapping from a single word to a set of documents that contain the word Inverted Index A set of inverted lists
5
Inverted Index Contain one inverted list for each term in the document collection Often omit frequently occurring words such as “a,” “and” and “the.”
6
Inverted Index Example Sample Documents 1.Cats, dogs, dogs. 2.Dogs, cats, sheep. 3.Whales, sheep, goats. 4.Fish, whales, whales. Inverted Index catsdogsfishgoatssheepwhales 114323 2234 QueryAnswer cats1,2 sheep + dogs2
7
Expanding Inverted Indexes Include term frequency More terms implies “about” catsdogsfishgoatssheepwhales (1,1)(1,2)(4,1)(3,1)(2,1)(3,1) (2,1) (3,1)(4,2)
8
Expanding Inverted Indexes (cont.) Add word position information Facilitates phrase searching catsdogsfishgoatssheepwhales (1,1): 1(1,2): 2,3(4,1): 1(3,1): 2(2,1): 3(3,1): 1 (2,1): 2(2,1): 1(3,1): 2(4,2): 1
9
Inverted Index Statistics Compressed inverted indexes containing only word counts –5% of the document collection in size –Built and queried faster Compressed inverted indexes containing word counts and positions –20% of the document collection in size –Essential for high effectiveness, even in queries not using phrases
10
Document Ranking Documents returned in order of relevance Perfect ranking impossible Retrieval systems calculate probability a document is relevant
11
Computing Relevance Assume “bag of words” with term independence Simple estimation Problems 1.If a document does not contain all words of a multi-word query it will not be retrieved. document containing 0 words = document containing some words 2.All words are treated equally. Query = Maltese falcon document(maltese:2, falcon:1) = document(maltese:1,falcon:2) for documents of similar length Smoothing can help # occurrences document length
12
Computing Relevance (cont.) Add additional features –Position/field in document, ex. title –Proximity of query terms –Combinations
13
Computing Relevance (cont.) Add query independent information # links from other documents URL depth shortergeneral longer specific User clicks May match expectations but not relevance Dwell time Document quality models Unusual term distribution implies poor grammar so the document is not a good retrieval candidate
14
Query Expansion Stemming Groups words that mean the same concept based on natural language rules. ex: run, runs, running, ran Aggressive Stemmer May group words that are not related. ex. marine, marinate Conservative Stemmer May fail to group words that are related. ex. run, ran Statistical Stemmer Uses word co-occurrence data to determine if they are related. Would probably avoid the marine, marinate mistake.
15
Query Expansion (cont.) Synonyms Group by terms that mean the same concept Problem May be different depending on context US: President = head of state = commander in chief UK: prime minister = head of state Corporation: president = chief executive (maybe) Solutions –Include synonyms in query but prefer term matches –Use context from the whole query “president of canada” “prime minister”
16
Query Expansion (cont.) Relevance Feedback User selects relevant documents and they are used to find similar documents. Pseudo Relevance Feedback System assumes the first few documents retrieved are relevant and uses them to search for more. No user involvement, so not as precise.
17
Evaluation Effectiveness Efficiency
18
Effectiveness Precision # of relevant results / # results Success Whether the first document was relevant Recall # relevant docs found / # relevant docs that exist Mean Average Precision (MAP) Average precision over all relevant documents Normalized Discounted Cumulative Gain (NDCG) Calculates using sum over result ranks
19
Calculating MAP Assume a retrieval set of 10 documents with 1, 5, 7, 8 and 10 relevant. RankPrecision 11/1 = 1 52/5 =.2 73/7 =.43 84/8 =.5 105/10 =.5 If there were only 5 relevant documents, then (1 +.2 +.43 +.5 +.5) / 5 =.53 If we retrieved only 5 of 6 relevant documents, then (1 +.2 +.43 +.5 +.5) / 6 =.44
20
NDCG Uses 4 values for relevance, not just is/is not with 0 being not relevant and 4 being most relevant. Calculated as N (2r(i) − 1)/ log(1 + i) Where i is the rank and r(i) is the relevance value at that rank. Example: with the following results where is relevant and is not i 1 10 20 MAPNDCG 1.00 .51.79 .33.55
21
Efficiency Throughput # of queries processed per second Must use identical systems. Latency Time between when the user issues a query and the system delivers a response. < 150ms considered “instantaneous” Generally, improving one implies worsening the other
22
Measuring Efficiency Direct Attempt to create a real world system and measure statistics. Straightforward but limited to experimenter access. Simulation System operation is simulated in software. Repeatable but is only as good as its model.
23
Query Evaluation Document-at-a-time Evaluate each term for a document before moving to the next document. Term-at-a-time Evaluate each document for a term before moving to the next term.
24
Document-at-a-Time Produces complete document scores early so can quickly display partial results. Can incrementally fetch the inverted list data so uses less memory.
25
Document-at-a-Time Algorithm procedure DocumentAtATimeRetrieval(Q) L ← Array() R ← PriorityQueue() for all terms wi in Q do li ← InvertedList(wi) L.add( li ) end for for all documents D in the collection do for all inverted lists li in L do sD ← sD + f(Q,C,wi)(c(wi;D)) #Update the document score end for sD ← sD · d(Q,C)(|D|) #Multiply by a document-dependent factor R.add( sD,D ) end for return the top n results from R end procedure
26
Term-at-a-Time Does not jump between inverted lists so saves branching. Inner loop iterates over documents so is executed for a long time, thus is easier to optimize. Efficient query processing strategies have been developed for term-at-a-time. Preferred for efficient system implementation.
27
Term-at-a-Time Algorithm procedure TermAtATimeRetrieval(Q) A ← HashTable() for all terms wi in Q do li ← InvertedList(wi) for all documents D in li do swi,D ← A[D] + f(Q,C,wi)(c(wi;D)) end for R ← PriorityQueue() for all accumulators A[D] in A do sD ← A[D] · d(Q,C)(|D|) #Normalize the accumulator value R.add( sD,D ) end for return the top n results from R end procedure
28
Optimization Types Unoptimized Unsafe Set Safe Rank Safe Score Safe
29
Unoptimized Compare the query to each document and calculate the score. Sort the documents. Documents with the same score may appear in any order. Return results in ranked order. “Top k documents” could be different.
30
Optimized Unsafe Documents returned have no guaranteed set of properties. Set Safe Documents are guaranteed to be in the result set but may not be in the same order as the unoptimized results. Rank Safe Documents are guaranteed to be in the result set and in the correct order, but document scores may not be thes same as the unoptimized results. Score Safe Documents are guaranteed to be in the result set and have the same scores as the unoptimized results.
31
Tupleflow Distributed computing framework for indexing. Flexibility Settings made in parameter files, no ode changes required Scalability Independent tasks spread across processors Disk abstraction Streaming data model Low abstraction penalty Code handles custom hashing, sorting and serialization
32
Traditional Indexing Approach Create a word occurrence model by counting the unique terms in each document. Serial processing Parse one document, move to the next Large memory requirements for unique word hash over large document set words, misspellings, numbers, urls, etc. Different code required for each document type Documents, web pages, databases, etc.
33
Tupleflow Approach Break processing into steps Count terms (countsMaker) Sort terms Combine counts (countsReducer)
34
Tupleflow Example The cat in the hat. countsMakersortcountsReducer WordCountWordCountWordCount the1cat1 1 1hat1 1 in1 1 1 the1 1 2 hat1the1
35
Tupleflow Execution Graph Single ProcessorMultiple Processors filenames read text parse text count words filenames read text parse text count words combine counts read text parse text count words read text parse text count words
36
Summary Document indexing and querying are time and resource intensive tasks. Optimizing and parallelizing wherever possible is essential to minimize resources and maximize efficiency. Tupleflow is one example of efficient indexing by parallelization.
37
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.