Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays.

Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays

Boolean IR Documents composed of TERMS(words, stems) Express result in set-theoretic terms Doc’s containing term A term B term C Doc’s containing term A term B term C A AND B(A AND B) OR C - Pre 1970’s - Dominant industrial model through 1994 (Lexis-Nexis, DIALOG)

Boolean Operators A AND B A OR B (A AND B) OR C A AND ( NOT B ) Doc’s containing term A Adjacent AND  “ A B ” e.g. “Johns Hopkins” “The Who” Proximity window  A w/10 B A and B within +/- 10 words  A w/sent B A + B in same sentence Proximity Operators (Extended ANDs) (in +/- K words)

Boolean IR(implementation) Bit vectors Inverted files(a.k.a. Index) PAT tree(more powerful index) 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 Impractical  very sparse(wastefully big)  costly to compare V1V1 V2V2 Term i

Problems with Boolean IR Does not effectively support relevance ranking of returned documents Base model : expression satisfaction is Boolean A document matches expression or it doesn’t Extension to permit ordering : (A AND B) OR C –Supermatches(5 terms/doc > 3 terms/doc) –Partial matches (expression incompletely satisfied – give partial credit) –Importance weighting(10A OR 5B) Weight/importance

Boolean IR Advantages : Can directly control search Good for precise queries in structured data (e.g. database search or legal index) Disadvantages : Must directly control search –Users should be familiar with domain and term space(know what to ask for and exclude) –Poor at relevance ranking –Poor at weighted query expansion, user modelling etc.

Signature Files 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 Problem : several different document bit vectors(i.e. different words) get mapped to same signature. (use stoplist to help avoid common words from overwhelming signatures) Document Bit vector Signature Mapping function f( ) Superimposed Coding Using some mapping/ Hash function fewer bits

False Drop Problem On retrieval, all documents/bit vectors mapped to the same signature are retrieved(returned) Only a portion are relevant Need to do secondary validation step to make sure target words actually match Prob(False Drop) = Prob(Signature qualifies & Text does not)

Efficiency Problem Testing for signature match may require linear scan through all document signatures

Vertical Partitioning Improves sig1, sig2 comparison speed, but still requires O(N) linear search of all signatures Options : sig - Bit sliced onto different devices for parallel comparision - And together matches on each segment sig1 sig2 comp AND  result

Horizontal Partitioning Goal : avoid sequential scanning of the signature file Signature Database Input signature Hash function or index yielding specific candidates to try

Inverted Files Like an index to a book 14 15 16 17 37 38 39 40 14 39 156 39 45 156 290 41 86 156 217 Terms Baum Bayes Viterbi index Documents

Inverted Files Very efficient for single word queries Just enumerate documents pointed to by index O( |A| ) = O(S A ) Efficient for OR’s Just enumerate both lists and remove duplicates O(S A + S B )

AND’s using Inverted Files 14 39 156 227 319 39 45 58 96 156 208 Method 1: Begin with two pointers(i, j) on list # is in index(A,B) if A[ i ] = B[ i ], write A[ i ] to output if A[ i ] < B[ i ], i++ else j++ AiAi BjBj Index for Bayes Index for Viterbi i j O(S A + S B ) same as OR, but smaller output (meet search)

AND’s using Inverted Files 39 227 1 5 25 28 39 45 58 96 156 Method 2: Useful if one index is smaller than the other(S A << S B ) AiAi BjBj (Johns) (Hopkins) For all members of A bsearch (A[ i ], B) (do binary search into larger index) for all members of smaller index A AND B AND C Order by smaller list pairwise Cost : S A * log 2 (S B ) can achieve S A * log log (S B )

Proximity Search AHAH JHJH H A Anthony Johns Hopkins Document level indexes not adequate Option 1 : Size of corpus = size of index Doc 1 Doc 2 Doc 3 Doc i Index to corpus Position offset Before : Match if ptr A = ptr B Now : “A B” = match if ptr A = ptr B -1 A w/10 B = match if | ptr A - ptr B | 10

Variations 1 Don’t index function words X The Johns Hopkins index wordlist * Johns The Do linear match search in corpus  savings on 50% index size  potential speed improvement given data access costs

Variations 2 : Multilevel Indexes Anthony Johns Hopkins Johns Hopkins Johns Hopkins Anthony Hopkins Anthony Doc level Position level  Supports parallel search  May have paging cost advantage  Cost – large index N + dV Avg. Doc/vocab size

Interpolation Search 174 195 * 211 * 226 230 231 246 483 496 521 526 995 17 18 19 20 21 22 23 48 49 50 51 100 B i cell value Useful when data are numeric and uniformly distributed # of cells in index : 100 Values range from 0 … 1000 Goal : looking for the value 211 Binary search : begin looking at cell 50 Interpolation search : better guess for 1 st cell to examine?

Binary Search Bsearch(low, high, key) mid = (high + low) / 2 If (key = A[mid]) return mid Else if (key < A[mid]) Bsearch (low, mid-1, key) Else Bsearch(mid+1, high, key) Interpolation Search Isearch(low, high, key) mid = best estimate of pos mid = low + (high – low) * (expected % of way through range)

Binary Search 50 25 12 18 22 21 19. Interpolation Search 21 19.  go directly to expected region Typical sequence of cell’s tested : log log (N) Comparison

Cost of Computing Inverted Index 1.Simple word position pairs and sort 2.If N >> memory size 1)Tokenize(words  integers) 2)Create histogram 3)Allocate space in index 4)Do multipass(K-pass) through corpus only adding tokens in K bins Corpus size  N log N

K-pass Indexing index W1W2W3W4W1W2W3W4 Block1 (pass K = 1) K = 2 Time = KN + 1 But big win over N log N on paging

Vector Models for IR Gerald Salton, Cornell (Salton + Lesk, 68) (Salton, 71) (Salton + McGill, 83) SMART System Chris Buckely, Cornell  Current keeper of the flame Salton’s magical automatic retrieval tool(?)

Vector Models for IR 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 Doc V 1 Doc V 2 Boolean Model SMART Vector Model 1.0 3.5 4.6 0.1 0.0 0.0 Doc V 1 Doc V 2 0.0 0.0 0.0 0.1 4.0 0.0 Term i Word Stem Special compounds SMART vectors are composed of real valued Term weights NOT simply Boolean Term Present or NOT

Example 3 5 4 1 0 1 0 0 Doc V 1 Doc V 2 Comput* C++ Sparc genome Bilog* protein CompilerDNA 1 0 0 0 5 3 1 4 Doc V 3 2 8 0 1 0 1 0 0 Issues How are weights determined? (simple option :  raw freq.  weighted by region, titles, keywords) Which terms to include? Stoplists Stem or not?

QUERIES and Documents share same vector representaion D3D3 D2D2 D1D1 Q Given Qeury D Q  map to vector V Q and find document D i : sim (V i,V Q ) is greatest

Similarity Functions Many other options availabe(Dice, Jaccard) Cosine similarity is self normalizing D3D3 D2D2 Q V1V1 100 200 300 50 V2V2 1 2 3 0.5 V3V3 10 20 30 5 Can use arbitrary integer values (don’t need to be probabilities)

Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays.

Similar presentations

Presentation on theme: "Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays.

Similar presentations

Presentation on theme: "Today’s Topics Boolean IR Signature files Inverted files PAT trees Suffix arrays."— Presentation transcript:

Similar presentations

About project

Feedback