Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???

Similar presentations


Presentation on theme: "Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???"— Presentation transcript:

1 Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???

2 Toy problem #1: Max Subarray Algorithm Compute P[1,n] array of Prefix-Sums over A Compute M[1,n] array of Mins over P Find end such that P[end]-M[end] is maximum. start is such that P[start] is minimum. A = 2 -5 6 1 -2 4 3 -13 9 -6 7 Goal: Find the time window achieving the best “market performance”. Math Problem: Find the subarray of maximum sum. P = 2 -3 3 4 2 6 9 -4 5 -1 6 M = 2 -3 -3 -3 -3 -3 -3 -4 -4 -4 -4 Note: Find maxsum x≤y A[x,y] = max x ≤ y P[y] – P[x] = max y [ P[y] – (min x≤y P[x]) ]

3 Toy problem #1 (solution 2) Algorithm sum=0; For i=1,...,n do If (sum + A[i] ≤ 0) sum=0; else MAX(max_sum, sum+A[i]); sum +=A[i]; A = 2 -5 6 1 -2 4 3 -13 9 -6 7 Note: Sum = 0 when OPT starts; Sum > 0 within OPT A = Optimum ≤0 ≥0

4 Toy problem #2: Top-freq elements Algorithm Use a pair of variables For each item s of the stream, if (X==s) then C++ else { C--; if (C==0) X=s; C=1;} Return X; Goal: Top queries over a stream of n items (  large). Math Problem: Find the item y whose frequency is > n/2, using the smallest space. (i.e. If mode occurs > n/2) Proof If X≠y, then every one of y’s occurrences has a “negative” mate. Hence these mates should be ≥#y. As a result, 2 * #occ(y) > n... A = b a c c c d c b a a a c c b c c c. Problems if ≤ n/2

5 Toy problem #3 : Indexing Consider the following TREC collection: N = 6 * 10 9 size n = 10 6 documents TotT= 10 9 (avg term length is 6 chars) t = 5 * 10 5 distinct terms What kind of data structure we build to support word-based searches ?

6 Solution 1: Term-Doc matrix 1 if play contains word, 0 otherwise t=500K n = 1 million Space is 500Gb !

7 Solution 2: Inverted index Brutus Calpurnia Caesar 12358132134 248163264128 1316 1.Typically use about 12 bytes 2.We have 10 9 total terms  at least 12Gb space 3.Compressing 6Gb documents gets  1.5Gb data Better index but yet it is >10 times the text !!!! We can do still better: i.e. 30  50% original text

8 Toy problem #4 : sorting How to sort tuples (objects) on disk 10 9 objects of 12 bytes each, hence 12 Gb Key observation: Array A to sort is an “array of pointers to objects” For each object-to-object comparison A[i] vs A[j]: 2 random accesses to memory locations A[i] and A[j] If we use qsort, this is an indirect sort !!!  (n log n) random memory accesses !! (I/Os ?) Memory containing the tuples (objects) A

9 Cost of Quicksort on large data Some typical parameter settings N=10 9 tuples of 12 bytes each Typical Disk (Seagate Cheetah 150Gb): seek time ~5ms Analysis of qsort on disk: qsort is an indirect sort:  (n log 2 n) random memory accesses [5ms] * n log 2 n = 10 9 * log 2 (10 9 ) * 5ms ≥ 3 years In practice a little bit better because of caching, but...

10 B-trees for sorting ? Using a well-tuned B-tree library: Berkeley DB n=10 9 insertions  Data get distributed arbitrarily !!! Tuples B-tree internal nodes B-tree leaves (“tuple pointers") What about listing tuples in order ? Possibly 10 9 random I/Os = 10 9 * 5ms  2 months

11 Binary Merge-Sort Merge-Sort(A) 01 if length(A) > 1 then 02 Copy the first half of A into array A1 03 Copy the second half of A into array A2 04 Merge-Sort(A1) 05 Merge-Sort(A2) 06 Merge(A, A1, A2) Merge-Sort(A) 01 if length(A) > 1 then 02 Copy the first half of A into array A1 03 Copy the second half of A into array A2 04 Merge-Sort(A1) 05 Merge-Sort(A2) 06 Merge(A, A1, A2) Divide Conquer Combine

12 Merge-Sort Recursion Tree 102 2 10 51 1 5 1319 13 19 97 7 9 154 4 15 83 3 8 1217 12 17 611 6 11 1 2 5 107 9 13 193 4 8 156 11 12 17 1 2 5 7 9 10 13 193 4 6 8 11 12 15 17 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 log 2 n How do we exploit the disk features ??

13 External Binary Merge-Sort Increase the size of initial runs to be merged! 10251131997154831217611 1 2 5 107 9 13 193 4 8 156 11 12 17 1 2 5 7 9 10 13 193 4 6 8 11 12 15 17 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 Main-memory sort Main-memory sort Main-memory sort Main-memory sort External two-way merge External two-way merges N/M runs, each level is 2 passes (R/W) over the data

14 Cost of External Binary Merge-Sort Some typical parameter settings: n=10 9 tuples of 12 bytes each, N=12 Gb of data Typical Disk (Seagate): seek time ~8ms avg transfer rate is 100Mb per sec = 10 -8 secs/byte Analysis of binary-mergesort on disk (M = 10Mb = 10 6 tuples): Data divided into (N/M) runs:  10 3 runs #levels is log 2 (N/M)  10 It executes 2 * log 2 (N/M)  20 passes (R/W) over the data I/O-scanning cost: 20 * [12 * 10 9 ] * 10 -8  2400 sec = 40 min

15 Multi-way Merge-Sort Sort N items using internal-memory M and disk-pages of size B: Pass 1: Produce (N/M) sorted runs. Pass 2, …: merge X  M/B runs each pass. Main memory buffers of B items INPUT 1 INPUT X OUTPUT Disk INPUT 2...

16 Multiway Merging Out File: Run 1 Run 2 Merged run Current page EOF Bf1 p1p1 Bf2 p2p2 Bfo popo min(Bf1[p 1 ], Bf2[p 2 ], …, Bfx[p X ]) Fetch, if p i = B Flush, if Bfo full Run X=M/B Current page Bfx pXpX

17 Cost of Multi-way Merge-Sort Number of passes = log M/B #runs  log M/B N/M Cost of a pass = 2 * (N/B) I/Os Increasing the fan-out (M/B) increases #I/Os per pass! Parameters M = 10Mb; B = 8Kb; N = 12 Gb; N/M  10 3 runs; #passes = log M/B N/M  1 !!! I/O-scanning: 20 passes (40m)  2 passes (4 m) Tuning depends on disk features

18 Does compression may help? Goal: enlarge M and reduce N #passes = O(log M/B N/M) Cost of a pass = O(N/B)

19 Please !! Do not underestimate the features of disks in algorithmic design


Download ppt "Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???"

Similar presentations


Ads by Google