Sorting by the Numbers Sorting Part Four. Question Suppose you are given the task of writing an application to sort a big data file. What do you need.

Slides:



Advertisements
Similar presentations
External sorting R & G – Chapter 13 Brian Cooper Yahoo! Research.
Advertisements

B-tree. Why B-Trees When the data is too big, we will have to use disk storage instead of putting all the data in main memory In such case, we have to.
Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
Lecture 8 Join Algorithms. Intro Until now, we have used nested loops for joining data – This is slow, n^2 comparisons How can we do better? – Sorting.
Sorting. Sorting Considerations We consider sorting a list of records, either into ascending or descending order, based upon the value of some field of.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
1 Today’s Material Divide & Conquer (Recursive) Sorting Algorithms –QuickSort External Sorting.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
B-Trees. Motivation for B-Trees Index structures for large datasets cannot be stored in main memory Storing it on disk requires different approach to.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Cosequential Processing Chapter 8. Cosequential processing model Two or more input files sorted the same way on the same keys set current record to first.
Wednesday, 11/25/02, Slide #1 CS 106 Intro to CS 1 Wednesday, 11/25/02  QUESTIONS??  Today:  More on sorting. Advanced sorting algorithms.  Complexity:
Other time considerations Source: Simon Garrett Modifications by Evan Korth.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
CHAPTER 11 Sorting.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
Information Retrieval IR 4. Plan This time: Index construction.
External Sorting Access to secondary storage is orders of magnitude slower than memory access. Minimize access to secondary storage (tape or disk).
CSC 2300 Data Structures & Algorithms March 20, 2007 Chapter 7. Sorting.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
Improve Run Generation Overlap input,output, and internal CPU work. Reduce the number of runs (equivalently, increase average run length). DISK MEMORY.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???
©Silberschatz, Korth and Sudarshan12.1Database System Concepts B + -Tree Index Files Indexing mechanisms used to speed up access to desired data.  E.g.,
PARALLEL TABLE LOOKUP FOR NEXT GENERATION INTERNET
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
CSE332: Data Abstractions Lecture 8: Memory Hierarchy Tyler Robison Summer
 … we have been assuming that the data collections we have been manipulating were entirely stored in memory.
Indexing.
IDA / ADIT Databasteknik Databaser och bioinformatik Data structures and Indexing (I) Fang Wei-Kleiner.
CPSC 252 External Searching Page 1 External Searching Motivation: To this point in the course we have assumed that any data that we are searching through.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture17.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
Excellence Publication Co. Ltd. Volume Volume 1.
1 B + -Trees: Search  If there are n search-key values in the file,  the path is no longer than  log  f/2  (n)  (worst case).
Searching Topics Sequential Search Binary Search.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
CSE 326: Data Structures Lecture 23 Spring Quarter 2001 Sorting, Part 1 David Kaplan
1 Merge Sort 7 2  9 4   2  2 79  4   72  29  94  4.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
Sorting Lower Bounds n Beating Them. Recap Divide and Conquer –Know how to break a problem into smaller problems, such that –Given a solution to the smaller.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Merge Sort Comparison Left Half Data Movement Right Half Sorted.
External Sorting Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
1 Chapter 8-1: Lower Bound of Comparison Sorts. 2 About this lecture Lower bound of any comparison sorting algorithm – applies to insertion sort, selection.
B-Trees B-Trees.
How do Computers Work ?.
B-Trees B-Trees.
External Sort Any sort algorithm which uses external memory, such as tape or disk, during the sort. The best algorithms for processing large amounts of.
Sorting by Tammy Bailey
Database Management Systems (CS 564)
The Greedy Method Spring 2007 The Greedy Method Merge Sort
Merge Sort 11/28/2018 2:21 AM The Greedy Method The Greedy Method.
Lecture 7: Index Construction
Other time considerations
CSIT 402 Data Structures II With thanks to TK Prasad
Time Complexity Lecture 14 Sec 10.4 Thu, Feb 22, 2007.
Chapter 12 Query Processing (1)
Intro to Computer Science CS1510 Dr. Sarah Diesburg
Binary System.
External Sorting.
Time Complexity Lecture 15 Mon, Feb 27, 2006.
CENG 351 Data Management and File Structures
Database Systems (資料庫系統)
RANDOM NUMBERS SET # 1:
External Sorting Dina Said
Presentation transcript:

Sorting by the Numbers Sorting Part Four

Question Suppose you are given the task of writing an application to sort a big data file. What do you need to know to pick a good solution?  File Size = 1 GB  Record Size = 250 Bytes  Available Memory = ¼ GB

How many Runs? How big is each Run? Total Records to Process  1 billion bytes in the file  250 bytes for each record  = 4 million records in the file Run Size  1GB file  ¼ GB memory  = 4 Runs of 1 million records each

Time to Create the Runs Sorting One Run  Using either Quicksort or Ordered Binary Tree N log 2 N 1million * 20  approximately 20 million comparisons of internal memory locations Sorting Four Runs  80 million internal memory comparisons

Refresher on Merging Files So, to merge 2 files of N random records each, requires 2N compares And, to merge 2 files where the runs were built from a sorted file requires N compares File One File Two File One File Two

Merging the Four Files R1R2 T2 R3 T1 R4R1R2 T2 R3T1 R4 2 million compares 4 million compares 3 million compares 2 million compares 4 million compares

Total Processing Time Time to Create the 4 Runs  80 million comparisons Time to Merge the 4 Runs  8 million comparisons Assuming a File Read takes just 100 times longer than a Memory Read  Total Time = 880 million time units  note, we have omitted the time to read the runs into memory and to write the runs to temp files

Second Example 2 Runs of 2 Million Records each 2 Runs of 2 Million Records each  Internal Sorting N log2 N = 2million * 24 = 48 million compares 96 million to create both runs  File Merging 4 million compares  Total Time 496 million time units 496 million time units

Next in this course So how much time does it take to access the disk?