I/O-Algorithms Lars Arge Spring 2006 February 2, 2006.

Slides:



Advertisements
Similar presentations
I/O-Algorithms Lars Arge Spring 2012 April 17, 2012.
Advertisements

CS 400/600 – Data Structures External Sorting.
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
Lars Arge 1/43 Big Terrain Data Analysis Algorithms in the Field Workshop SoCG June 19, 2012 Lars Arge.
Lars Arge 1/13 Efficient Handling of Massive (Terrain) Datasets Lars Arge A A R H U S U N I V E R S I T E T Department of Computer Science.
I/O-Algorithms Lars Arge January 31, Lars Arge I/O-algorithms 2 Random Access Machine Model Standard theoretical model of computation: –Infinite.
I/O-Algorithms Lars Arge University of Aarhus February 21, 2005.
I/O-Algorithms Lars Arge Aarhus University February 27, 2007.
I/O-Algorithms Lars Arge Spring 2011 March 8, 2011.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
I/O-Algorithms Lars Arge Aarhus University February 13, 2007.
I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.
I/O-Algorithms Lars Arge Spring 2009 January 27, 2009.
I/O-Algorithms Lars Arge Spring 2007 January 30, 2007.
I/O-Algorithms Lars Arge Aarhus University February 16, 2006.
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
I/O-Algorithms Lars Arge University of Aarhus February 13, 2005.
I/O-Algorithms Lars Arge University of Aarhus March 1, 2005.
I/O-Algorithms Lars Arge Spring 2009 March 3, 2009.
I/O-Algorithms Lars Arge Aarhus University February 6, 2007.
Lars Arge 1/14 A A R H U S U N I V E R S I T E T Department of Computer Science Efficient Handling of Massive (Terrain) Datasets Professor Lars Arge University.
I/O-Algorithms Lars Arge Aarhus University March 5, 2008.
I/O-Algorithms Lars Arge Aarhus University April 16, 2008.
I/O-Algorithms Lars Arge Aarhus University February 9, 2006.
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
Massive Data Algorithmics Faglig Dag, January 17, 2008 Gerth Stølting Brodal University of Aarhus Department of Computer Science.
Chapter 4 Parallel Sort and GroupBy 4.1Sorting, Duplicate Removal and Aggregate 4.2Serial External Sorting Method 4.3Algorithms for Parallel External Sort.
I/O-Algorithms Lars Arge University of Aarhus March 7, 2005.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Flow modeling on grid terrains. Why GIS?  How it all started.. Duke Environmental researchers: computing flow accumulation for Appalachian Mountains.
CSC 2300 Data Structures & Algorithms March 20, 2007 Chapter 7. Sorting.
External Memory Algorithms Kamesh Munagala. External Memory Model Aggrawal and Vitter, 1988.
Lars Arge 1/12 Lars Arge. 2/12  Pervasive use of computers and sensors  Increased ability to acquire/store/process data → Massive data collected everywhere.
I/O-Algorithms Lars Arge Spring 2008 January 31, 2008.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
I/O-Algorithms Lars Arge Fall 2014 August 28, 2014.
Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
External Memory Algorithms for Geometric Problems Piotr Indyk (slides partially by Lars Arge and Jeff Vitter)
Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Lars Arge1 Permuting Lower Bound Permuting N elements according to a given permutation takes I/Os in “indivisibility” model Indivisibility model: Move.
CS 61B Data Structures and Programming Methodology July 28, 2008 David Sun.
An Optimal Cache-Oblivious Priority Queue and Its Applications in Graph Algorithms By Arge, Bender, Demaine, Holland-Minkley, Munro Presented by Adam Sheffer.
CSE332: Data Abstractions Lecture 8: Memory Hierarchy Tyler Robison Summer
 … we have been assuming that the data collections we have been manipulating were entirely stored in memory.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Lecture 2: External Memory Indexing Structures CS6931 Database Seminar.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
1 CSE 326: Data Structures A Sort of Detour Henry Kautz Winter Quarter 2002.
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms – Matrix multiplication – Distribution sort – Static searching.
Internal and External Sorting External Searching
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
CSE 326: Data Structures Lecture 23 Spring Quarter 2001 Sorting, Part 1 David Kaplan
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
Internal Memory Pointer MachineRandom Access MachineStatic Setting Data resides in records (nodes) that can be accessed via pointers (links). The priority.
Sorting.
Digital Terrain Analysis for Massive Grids
Sorting atomic items Chapter 5 Distribution based sorting paradigms.
Advanced Topics in Data Management
Advanced Topics in Data Management
Low Depth Cache-Oblivious Algorithms
Database Systems (資料庫系統)
Presentation transcript:

I/O-Algorithms Lars Arge Spring 2006 February 2, 2006

Lars Arge I/O-Algorithms (spring 2006) 2 Massive Data Massive datasets are being collected everywhere Storage management software is billion-$ industry Examples (2002): Phone: AT&T 20TB phone call database, wireless tracking Consumer: WalMart 70TB database, buying patterns WEB: Web crawl of 200M pages and 2000M links, Akamai stores 7 billion clicks per day Geography: NASA satellites generate 1.2TB per day

Lars Arge I/O-Algorithms (spring 2006) 3 Example: Grid Terrain Data Appalachian Mountains (800km x 800km) –100m resolution  ~ 64M cells  ~128MB raw data (~500MB when processing) –~ 1.2GB at 30m resolution NASA SRTM mission acquired 30m data for 80% of the earth land mass –~ 12GB at 10m resolution (much of US available from USGS) –~ 1.2TB at 1m resolution (selected, mostly military, availability)

Lars Arge I/O-Algorithms (spring 2006) 4 Example: LIDAR Terrain Data Massive (irregular) point sets (1-10m resolution) –Becoming relatively cheap and easy to collect Appalachian Mountains between 50GB and 5TB

Lars Arge I/O-Algorithms (spring 2006) 5 Random Access Machine Model Standard theoretical model of computation: –Infinite memory –Uniform access cost Simple model crucial for success of computer industry RAMRAM

Lars Arge I/O-Algorithms (spring 2006) 6 Hierarchical Memory Modern machines have complicated memory hierarchy –Levels get larger and slower further away from CPU –Data moved between levels using large blocks L1L1 L2L2 RAMRAM

Lars Arge I/O-Algorithms (spring 2006) 7 Slow I/O –Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes) Important to store/access data to take advantage of blocks (locality) Disk access is 10 6 times slower than main memory access “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

Lars Arge I/O-Algorithms (spring 2006) 8 Scalability Problems Most programs developed in RAM-model –Run on large datasets because OS moves blocks as needed Moderns OS utilizes sophisticated paging and prefetching strategies –But if program makes scattered accesses even good OS cannot take advantage of block access  Scalability problems! data size running time

Lars Arge I/O-Algorithms (spring 2006) 9 N= # of items in the problem instance B = # of items per disk block M = # of items that fit in main memory T = # of items in output I/O: Move block between memory and disk We assume (for convenience) that M >B 2 D P M Block I/O External Memory Model

Lars Arge I/O-Algorithms (spring 2006) 10 Fundamental Bounds Internal External Scanning: N Sorting: N log N Permuting Searching: Note: –Linear I/O: O(N/B) –Permuting not linear –Permuting and sorting bounds are equal in all practical cases –B factor VERY important: –Cannot sort optimally with search tree

Lars Arge I/O-Algorithms (spring 2006) 11 Scalability Problems: Block Access Matters Example: Reading an array from disk –Array size N = 10 elements –Disk block size B = 2 elements –Main memory size M = 4 elements (2 blocks) Difference between N and N/B large since block size is large –Example: N = 256 x 10 6, B = 8000, 1ms disk access time  N I/Os take 256 x 10 3 sec = 4266 min = 71 hr  N/B I/Os take 256/8 sec = 32 sec Algorithm 2: N/B=5 I/OsAlgorithm 1: N=10 I/Os

Lars Arge I/O-Algorithms (spring 2006) 12 Queues and Stacks Queue: –Maintain push and pop blocks in main memory  O(1/B) Push/Pop operations Stack: –Maintain push/pop block in main memory  O(1/B) Push/Pop operations Push Pop

Lars Arge I/O-Algorithms (spring 2006) 13 Sorting <M/B sorted lists (queues) can be merged in O(N/B) I/Os M/B blocks in main memory Unsorted list (queue) can be distributed using <M/B split elements in O(N/B) I/Os

Lars Arge I/O-Algorithms (spring 2006) 14 Sorting Merge sort: –Create N/M memory sized sorted lists –Repeatedly merge lists together Θ(M/B) at a time  phases using I/Os each  I/Os

Lars Arge I/O-Algorithms (spring 2006) 15 Sorting Distribution sort (multiway quicksort): –Compute Θ(M/B) splitting elements –Distribute unsorted list into Θ(M/B) unsorted lists of equal size –Recursively split lists until fit in memory  phases  I/Os if splitting elements computed in O(N/B) I/Os

Lars Arge I/O-Algorithms (spring 2006) 16 Computing Splitting Elements In internal memory (deterministic) quicksort split element (median) found using linear time selection Selection algorithm: Finding i’th element in sorted order 1) Select median of every group of 5 elements 2) Recursively select median of ~ N/5 selected elements 3) Distribute elements into two lists using computed median 4) Recursively select in one of two lists Analysis: –Step 1 and 3 performed in O(N/B) I/Os. –Step 4 recursion on at most ~ elements  I/Os

Lars Arge I/O-Algorithms (spring 2006) 17 Sorting Distribution sort (multiway quicksort): Computing splitting elements: –Θ(M/B) times linear I/O selection  O(NM/B 2 ) I/O algorithm –But can use selection algorithm to compute splitting elements in O(N/B) I/Os, partitioning into lists of size <  phases  algorithm

Lars Arge I/O-Algorithms (spring 2006) 18 Computing Splitting Elements 1) Sample elements: –Create N/M memory sized sorted lists –Pick every ’th element from each sorted list 2) Choose split elements from sample: –Use selection algorithm times to find every ’th element Analysis: –Step 1 performed in O(N/B) I/Os –Step 2 performed in I/Os  O(N/B) I/Os

Lars Arge I/O-Algorithms (spring 2006) 19 Computing Splitting Elements 1) Sample elements: –Create N/M memory sized sorted lists –Pick every ’th element from each sorted list 2) Choose split elements from sample: –Use selection algorithm times to find every ’th element sampled elements

Lars Arge I/O-Algorithms (spring 2006) 20 Computing Splitting Elements Elements in range R defined by consecutive split elements –Sampled elements in R: –Between sampled elements in R: –Between sampled element in R and outside R:  sampled elements

Lars Arge I/O-Algorithms (spring 2006) 21 Permuting Lower Bound Permuting N elements according to a given permutation takes I/Os –Also a lower bound on sorting Proof:

Lars Arge I/O-Algorithms (spring 2006) 22 Sorting lower bound Sorting N elements takes I/Os in comparison model Proof: –Initially N elements stored in N/B first blocks on disk –Initially all N! possible orderings consistent with out knowledge –After t I/Os? N!

Lars Arge I/O-Algorithms (spring 2006) 23 Sorting lower bound Consider one input assuming: –S consistent orderings before input –Compute total order of elements in memory –Adversary choose ”worst” outcome of comparisons done possible orderings of M-B ”old” and B new elements in memory Adversary can choose outcome such that still consistent orderings Only get B! term N/B times  consistent orderings after t I/Os  N!

Lars Arge I/O-Algorithms (spring 2006) 24 Summary/Conclusion: Sorting I/O (and memory hierarchy efficiency) increasingly important –Massive data everywhere External merge or distribution sort takes I/Os –Merge-sort based on M/B-way merging –Distribution sort based on -way distribution and partition elements finding Optimal in comparison model Can prove lower bound in stronger model –Holds even for permuting

Lars Arge I/O-Algorithms (spring 2006) 25 References Input/Output Complexity of Sorting and Related Problems A. Aggarwal and J.S. Vitter. CACM 31(9), 1998 External partition element finding Lecture notes by L. Arge and M. G. Lagoudakis. Lower bound on External Permuting/Sorting Lecture notes by L. Arge.

Lars Arge I/O-Algorithms (spring 2006) 26