Download presentation
Presentation is loading. Please wait.
Published byEmory Robinson Modified over 8 years ago
1
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard
2
Agenda Questions Finish up evaluation from last time Computational complexity Inverted indexes Project planning
3
User Studies Goal is to account for interface issues –By studying the interface component –By studying the complete system Formative evaluation –Provide a basis for system development Summative evaluation –Designed to assess performance
4
Quantitative User Studies Select independent variable(s) –e.g., what info to display in selection interface Select dependent variable(s) –e.g., time to find a known relevant document Run subjects in different orders –Average out learning and fatigue effects Compute statistical significance –Null hypothesis: independent variable has no effect –Rejected if p<0.05
5
Variation in Automatic Measures System –What we seek to measure Topic –Sample topic space, compute expected value Topic+System –Pair by topic and compute statistical significance Collection –Repeat the experiment using several collections
6
Additional Effects in User Studies Learning –Vary topic presentation order Fatigue –Vary system presentation order Topic+User (Expertise) –Ask about prior knowledge of each topic
7
Presentation Order
8
Document Selection Experiments Interactive Selection F 0.8 Standard Ranked List Topic Description
9
Measures of Effectiveness Query Formulation: Uninterpolated Average Precision –Expected value of precision [over relevant document positions] –Interpreted based on query content at each iteration Document Selection: Unbalanced F-Measure: –P = precision –R = recall – = 0.8 favors precision Models expensive human translation
10
End-to-End Experiments Query Formulation Automatic Retrieval Interactive Selection Average Precision F 0.8 Topic Description
11
End-to-End Experiment Results F α=0.8 English queries, German documents 4 searchers, 20 minutes per topic
12
Summary Qualitative user studies suggest what to build Design decomposes task into components Automated evaluation helps to refine components Quantitative user studies show how well it works
13
Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection
14
Some Questions for Today How long will it take to find a document? –Is there any work we can do in advance? If so, how long will that take? How big a computer will I need? –How much disk space? How much RAM? What if more documents arrive? –How much of the advance work must be repeated? –Will searching become slower? –How much more disk space will be needed?
15
A Cautionary Tale Searching is easy - just ask Microsoft! –“Find” can search my hard drive in a few minutes If it only looks at the file names... How long would it would take for the Web? –A 100 GB disk? –For the World Wide Web? Computers are getting faster, but… –How does Google give answers in 3 seconds?
16
Find “complex” in the dictionary marsupial belligerent complex marsupial belligerent complex arcade astronomical mastiff relatively relaxation resplendent
17
Computational Complexity Time complexity: how long will it take? Space complexity: how much memory is needed? Things you need to know to assess complexity: –What is the “size” of the input? (“n”) What aspects of the input are we paying attention to? –How is the input represented? –How is the output represented? –What are the internal data structures? –What is the algorithm?
18
Worst Case Complexity
19
10n: O(n) 100n: O(n) 100n+25263: O(n) n 2 : O(n 2 ) n 2 +45662: O(n 2 )
20
“Asymptotic” Complexity Constant, i.e. O(1) n doesn’t matter Sublinear, e.g. O(log n) n = 65536 log n = 16 Linear, i.e. O(n) n = 65536 n = 65536 Polynomial, e.g. O(n 3 ) n = 65536 n 3 = 281,474,976,710,656 Exponential, e.g. O(2 n ) n = 65536 beyond astronomical
21
The “Inverted File” Trick Organize the bag of words matrix by terms –You know the terms that you are looking for Look up terms like you search dictionaries –For each letter, jump directly to the right spot For terms of reasonable length, this is very fast –For each term, store the document identifiers For every document that contains that term At query time, use the document identifiers –Consult a “postings file”
22
An Example quick brown fox over lazy dog back now time all good men come jump aid their party 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 Term Doc 1Doc 2 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 Doc 3 Doc 4 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 1 Doc 5Doc 6 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 0 0 Doc 7Doc 8 A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 Postings Inverted File
23
The Finished Product quick brown fox over lazy dog back now time all good men come jump aid their party Term A B C F D G J L M N O P Q T AI AL BA BR TH TI 4, 8 2, 4, 6 1, 3, 7 1, 3, 5, 7 2, 4, 6, 8 3, 5 3, 5, 7 2, 4, 6, 8 3 1, 3, 5, 7 2, 4, 8 2, 6, 8 1, 3, 5, 7, 8 6, 8 1, 3 1, 5, 7 2, 4, 6 PostingsInverted File
24
What Goes in a Postings File? Boolean retrieval –Just the document number Ranked Retrieval –Document number and term weight (TF*IDF,...) Proximity operators –Word offsets for each occurrence of the term Example: Doc 3 (t17, t36), Doc 13 (t3, t45)
25
How Big Is the Postings File? Very compact for Boolean retrieval –About 10% of the size of the documents If an aggressive stopword list is used! Not much larger for ranked retrieval –Perhaps 20% Enormous for proximity operators –Sometimes larger than the documents!
26
Building an Inverted Index Simplest solution is a single sorted array –Fast lookup using binary search –But sorting large files on disk is very slow –And adding one document means starting over Tree structures allow easy insertion –But the worst case lookup time is linear Balanced trees provide the best of both –Fast lookup and easy insertion –But they require 45% more disk space
27
Starting a B+ Tree Inverted File nowtimegoodall aaaaanow Now is the time for all good …
28
Adding a New Term nowtimegoodall aaaaanow Now is the time for all good men … aaaaamen
29
How Big is the Inverted Index? Typically smaller than the postings file –Depends on number of terms, not documents Eventually, most terms will already be indexed –But the postings file will continue to grow Postings dominate asymptotic space complexity –Linear in the number of documents
30
Index Compression CPU’s are much faster than disks –A disk can transfer 1,000 bytes in ~20 ms –The CPU can do ~10 million instructions in that time Compressing the postings file is a big win –Trade decompression time for fewer disk reads Key idea: reduce redundancy –Trick 1: store relative offsets (some will be the same) –Trick 2: use an optimal coding scheme
31
Compression Example Postings (one byte each = 7 bytes = 56 bits) –37, 42, 43, 48, 97, 98, 243 Difference –37, 5, 1, 5, 49, 1, 145 Optimal Huffman Code –0:1, 10:5, 110:37, 1110:49, 1111: 145 Compressed (17 bits) –11010010111001111
32
Indexing and Searching Indexing –Walk the inverted file, splitting if needed –Insert into the postings file in sorted order –Hours or days for large collections Query processing –Walk the inverted file –Read the postings file –Manipulate postings based on query –Seconds, even for enormous collections
33
Summary Slow indexing yields fast query processing –Key fact: most terms don’t appear in most documents We use extra disk space to save query time –Index space is in addition to document space –Time and space complexity must be balanced Disk block reads are the critical resource –This makes index compression a big win
34
Project Options LBSC 796 MLS/MIM –Option 1: TREC-like IR evaluation (team of 2) –Option 2: Design and run a user study (team of 3) LBSC 796 Ph.D. –Research paper LBSC 828o –Program a new capability
35
One Minute Paper What was the muddiest point in today’s lecture?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.