IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.

Slides:



Advertisements
Similar presentations
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chapter 1 Many slides are revisited from Stanford’s lectures by P.R.
Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Algoritmi per IR Prologo. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher A bunch of scientific papers available.
Lower bound for sorting, radix sort COMP171 Fall 2005.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
I/O-Algorithms Lars Arge Spring 2009 January 27, 2009.
I/O-Algorithms Lars Arge Spring 2007 January 30, 2007.
External Sorting R & G Chapter 13 One of the advantages of being
Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:
I/O-Algorithms Lars Arge Spring 2006 February 2, 2006.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chapter 1 Many slides are revisited from Stanford’s lectures by P.R.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Lars Arge 1/12 Lars Arge. 2/12  Pervasive use of computers and sensors  Increased ability to acquire/store/process data → Massive data collected everywhere.
I/O-Algorithms Lars Arge Spring 2008 January 31, 2008.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???
I/O-Algorithms Lars Arge Fall 2014 August 28, 2014.
Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.
BTrees & Sorting 11/3. Announcements I hope you had a great Halloween. Regrade requests were due a few minutes ago…
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Merge-Sort(A,i,j)
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
Lecture 24 Query Execution Monday, November 28, 2005.
1 External-Memory Sorting External-memory algorithms When data do not fit in main-memory External-memory sorting Rough idea: sort peaces that fit in main-
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
Lecture 3 Sorting and Selection. Comparison Sort.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
1 External-Memory Sorting External-memory algorithms When data do not fit in main-memory External-memory sorting Rough idea: sort peaces that fit in main-
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
External Sorting Chapter 13
Evaluation of Relational Operations
Advanced Topics in Data Management
Lecture#12: External Sorting (R&G, Ch13)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Index Construction: sorting
External Sorting Chapter 13
Selected Topics: External Sorting, Join Algorithms, …
Overview of Query Evaluation: JOINS
External Sorting.
Database Systems (資料庫系統)
External Sorting Chapter 13
External Sorting Dina Said
Presentation transcript:

IR Paolo Ferragina Dipartimento di Informatica Università di Pisa

Paradigm shift: Web 2.0 is about the many

Do big DATA need big PC s ?? an Italian Ad of the ’80 about a BIG brush or a brush BIG....

big DATA  big PC ? We have three types of algorithms: T 1 (n) = n, T 2 (n) = n 2, T 3 (n) = 2 n... and assume that 1 step = 1 time unit How many input data n each algorithm may process within t time units? n 1 = t, n 2 = √t, n 3 = log 2 t What about a k-times faster processor?...or, what is n, when the available time is k*t ? n 1 = k * t, n 2 = √k * √t, n 3 = log 2 (kt) = log 2 k + log 2 t

A new scenario for Algorithmics Data are more available than even before n ➜ ∞... is more than a theoretical assumption  The RAM model is too simple Step cost is  (1)

The memory hierarchy CPU RAM 1 CPU registers L1 L2RAM Cache Few Mbs Some nanosecs Few words fetched Few Gbs Tens of nanosecs Some words fetched HD net Few Tbs Many Tbs Even secs Packets Few millisecs B = 32K page

Does Virtual Memory help ? M = memory size, N = problem size p = prob. of memory access [0,3÷0,4 (Hennessy-Patterson)] C = cost of an I/O [10 5 ÷ 10 6 (Hennessy-Patterson)] If N ≤ M, then the cost per step is 1 If N=(1+  ) M, then the avg cost per step is: 1 + C * p *  /(1+  ) This is at least > 10 4 *  /(1+  ) If  = 1/1000 ( e.g. M = 1Gb, N = 1Gb + 1Mb ) Avg step-cost is > 20

The I/O-model Spatial locality or Temporal locality “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) Less and faster I/Oscaching CPU RAM HD 1 B Count I/O s

Other issues  other models  Random vs sequential I/Os Scanning is better than jumping  Not just one CPU Many PCs, Multi-cores CPUs or even GPUs  Parameter-free algorithms Anywhere, anytime, anyway... Optimal !! Streaming algorithms Parallel or Distributed algorithms Cache-oblivious algorithms

What about energy-consumption ? [Leventhal, CACM 2008] ≈10 IO/s/W ≈6000 IO/s/W

Our topics, on an example Web Crawler Page archive Which pages to visit next? Query resolver ? Ranker Page Analizer text Structure auxiliary Indexer Hashing Data Compression Dictionaries Sorting Linear Algebra Clustering Classification

Warm up... Take Wikipedia in Italian, and compute word freq: Few GBs  n  10 9 words How do you proceed ?? Tokenize into a sequence of strings Sort the strings Create tuples

Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Divide Conquer Combine Merge is linear in the #items to be merged

But... Few key observations: Items = (short) strings = atomic...  (n log n) memory accesses (I/Os ??) [5ms] * n log 2 n ≈ 3 years In practice it is a “faster”, why?

Implicit Caching… log 2 N M N/M runs, each sorted in internal memory (no I/Os) 2 passes (one Read/one Write) = 2 * (N/B) I/Os — I/O-cost for binary merge-sort is ≈ 2 (N/B) log 2 (N/M) Log 2 (N/M) 2 passes (R/W)

B A key inefficiency B After few steps, every run is longer than B !!! B We are using only 3 pages But memory contains M/B pages ≈ 2 30 /2 15 = 2 15 B Output Buffer Disk 1, 2, 3 Output Run 4,...

Multi-way Merge-Sort Sort N items with main-memory M and disk-pages B: Pass 1: Produce (N/M) sorted runs. Pass i: merge X = M/B-1 runs  log X N/M passes Main memory buffers of B items Pg for run1 Pg for run X Out Pg Disk Pg for run 2...

Cost of Multi-way Merge-Sort Number of passes = log X N/M  log M/B (N/M) Total I/O-cost is  ( (N/B) log M/B N/M ) I/Os Large fan-out (M/B) decreases #passes In practice M/B ≈ 10 5  #passes = 1  few mins Tuning depends on disk features Compression would decrease the cost of a pass! N/B log M/B M = log M/B [(M/B)*B] = (log M/B B) + 1

I/O-lower bound for Sorting Every I/O fetches B items, in memory M Decision tree with fan out: There are N/B steps in which x B! cmp-outcomes We get t =  ( (N/B) log M/B N/B ) I/Os Find t > N/B such that:

Keep attention... If sorting needs to manage arbitrarily long strings Key observations: Array A is an “array of pointers to objects” For each object-to-object comparison A[i] vs A[j]: 2 random accesses to 2 memory locations A[i] and A[j]  (n log n) random memory accesses (I/Os ??) Memory containing the strings A Again chaching helps, But it may be less effective than before Indirect sort