Presentation is loading. Please wait.

Presentation is loading. Please wait.

Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.

Similar presentations


Presentation on theme: "Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012."— Presentation transcript:

1 Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012

2 Heavily based on slides by Lars Arge Pervasive use of computers and sensors Increased ability to acquire/store/process data → Massive data collected everywhere Society increasingly “data driven” → Access/process data anywhere any time Nature/Science special issues 2/06,9/08, 2/11 Scientific data size growing exponentially, while quality and availability improving Paradigm shift: Science will be about mining data Massive Data Obviously not only in sciences: Economist 02/10: From 150 Billion Gigabytes five years ago to 1200 Billion today Managing data deluge difficult; doing so will transform business/public life

3 Heavily based on slides by Lars Arge I/O-Algorithms 3 Example: Grid Terrain Data Appalachian Mountains (800km x 800km) –100m resolution  ~ 64M cells  ~128MB raw data (~500MB when processing) –~ 1.2GB at 30m resolution NASA SRTM mission acquired 30m data for 80% of the earth land mass –~ 12GB at 10m resolution (much of US available from USGS) –~ 1.2TB at 1m resolution (quickly becoming available) –…. 10-100 points per m 2 already possible

4 Heavily based on slides by Lars Arge I/O-Algorithms 4 Example: LIDAR Terrain Data Massive (irregular) point sets (~1m resolution) –Becoming relatively cheap and easy to collect Sub-meter resolution using mobile mapping

5 Heavily based on slides by Lars Arge I/O-Algorithms 5 Example: LIDAR Terrain Data ~1,2 km ~ 280 km/h at 1500-2000m ~ 1,5 m between measurements

6 Heavily based on slides by Lars Arge Example: LIDAR Terrain Data ~2 million points at 30 meter (<1GB) ~18 billion points at 1 meter (>1TB) I/O-Algorithms 6

7 Heavily based on slides by Lars Arge I/O-Algorithms 7 Application Example: Flooding Prediction +1 meter +2 meter

8 Heavily based on slides by Lars Arge Example: Detailed Data Essential Mandø with 2 meter sea-level raise 80 meter terrain model 2 meter terrain model

9 Heavily based on slides by Lars Arge I/O-Algorithms 9 Random Access Machine Model Standard theoretical model of computation: –Infinite memory –Uniform access cost Simple model crucial for success of computer industry RAMRAM

10 Heavily based on slides by Lars Arge I/O-Algorithms 10 Hierarchical Memory Modern machines have complicated memory hierarchy –Levels get larger and slower further away from CPU –Data moved between levels using large blocks L1L1 L2L2 RAMRAM

11 Heavily based on slides by Lars Arge I/O-Algorithms 11 Slow I/O –Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes) Important to store/access data to take advantage of blocks (locality) Disk access is 10 6 times slower than main memory access “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)

12 Heavily based on slides by Lars Arge I/O-Algorithms 12 Scalability Problems Most programs developed in RAM-model –Run on large datasets because OS moves blocks as needed Moderns OS utilizes sophisticated paging and prefetching strategies –But if program makes scattered accesses even good OS cannot take advantage of block access  Scalability problems! data size running time

13 Heavily based on slides by Lars Arge I/O-Algorithms 13 N= # of items in the problem instance B = # of items per disk block M = # of items that fit in main memory T = # of items in output I/O: Move block between memory and disk D P M Block I/O External Memory Model

14 Heavily based on slides by Lars Arge I/O-Algorithms 14 Fundamental Bounds Internal External Scanning: N Sorting: N log N Permuting Searching: Note: –Linear I/O: O(N/B) –Permuting not linear –Permuting and sorting bounds are equal in all practical cases –B factor VERY important: –Cannot sort optimally with search tree

15 Heavily based on slides by Lars Arge Scalability Problems: Block Access Matters Example: Traversing linked list (List ranking) –Array size N = 10 elements –Disk block size B = 2 elements –Main memory size M = 4 elements (2 blocks) Large difference between N and N/B large since block size is large –Example: N = 256 x 10 6, B = 8000, 1ms disk access time  N I/Os take 256 x 10 3 sec = 4266 min = 71 hr  N/B I/Os take 256/8 sec = 32 sec Algorithm 2: N/B=5 I/OsAlgorithm 1: N=10 I/Os 1526 7 34108912 9 8 54763 15 I/O-Algorithms

16 Heavily based on slides by Lars Arge I/O-Algorithms 16 Queues and Stacks Queue: –Maintain push and pop blocks in main memory  O(1/B) Push/Pop operations Stack: –Maintain push/pop blocks in main memory  O(1/B) Push/Pop operations Push Pop

17 Heavily based on slides by Lars Arge I/O-algorithms 17 Hierarchical Memory Modern machines have complicated memory hierarchy –Levels get larger and slower further away from CPU –Large access time amortized using block transfer between levels Bottleneck often transfers between largest memory levels in use L1L1 L2L2 RAMRAM

18 Heavily based on slides by Lars Arge I/O-algorithms 18 I/O-Bottleneck I/O is often bottleneck when handling massive datasets –Disk access is 10 6 times slower than main memory access –Large transfer block size (typically 8-16 Kbytes) Important to obtain “locality of reference” –Need to store and access data to take advantage of blocks

19 Heavily based on slides by Lars Arge I/O-algorithms 19 I/O-Model Parameters N = # elements in problem instance B = # elements that fits in disk block M = # elements that fits in main memory T = # output size in searching problem We often assume that M>B 2 I/O: Movement of block between memory and disk D P M Block I/O

20 Heavily based on slides by Lars Arge I/O-algorithms 20 Fundamental Bounds Internal External Scanning: N Sorting: N log N Searching: Note: –Linear I/O: O(N/B) –B factor VERY important: –Cannot sort optimally with search tree

21 Heavily based on slides by Lars Arge I/O-algorithms 21 –If nodes stored arbitrarily on disk  Search in I/Os Binary search tree: –Standard method for search among N elements –Search traces at least one root-leaf path External Search Trees

22 Heavily based on slides by Lars Arge I/O-algorithms 22 External Search Trees BFS blocking: –Block height –Output elements blocked  Rangesearch in I/Os Optimal: O(N/B) space and query

23 Heavily based on slides by Lars Arge I/O-algorithms 23 Maintaining BFS blocking during updates? –Balance normally maintained in search trees using rotations Seems very difficult to maintain BFS blocking during rotation External Search Trees x y x y

24 Heavily based on slides by Lars Arge I/O-algorithms 24 B-trees BFS-blocking naturally corresponds to tree with fan-out B-trees balanced by allowing node degree to vary –Rebalancing performed by splitting and merging nodes

25 Heavily based on slides by Lars Arge I/O-algorithms 25 (a,b)-tree uses linear space and has height  Choosing a,b = each node/leaf stored in one disk block   space and query (a,b)-tree T is an (a,b)-tree (a≥2 and b≥2a-1) –All leaves on the same level and contain between a and b elements –Except for the root, all nodes have degree between a and b –Root has degree between 2 and b  tree

26 Heavily based on slides by Lars Arge I/O-algorithms 26 (a,b)-Tree Insert Insert: Search and insert element in leaf v DO v has b+1 elements/children Split v: make nodes v’ and v’’ with and elements insert element (ref) in parent(v) (make new root if necessary) v=parent(v) Insert touch nodes v v’v’’

27 Heavily based on slides by Lars Arge I/O-algorithms 27 (2,4)-Tree Insert

28 Heavily based on slides by Lars Arge I/O-algorithms 28 (a,b)-Tree Delete Delete: Search and delete element from leaf v DO v has a-1 elements/children Fuse v with sibling v’: move children of v’ to v delete element (ref) from parent(v) (delete root if necessary) If v has >b (and ≤ a+b-1<2b) children split v v=parent(v) Delete touch nodes v v

29 Heavily based on slides by Lars Arge I/O-algorithms 29 (2,4)-Tree Delete

30 Heavily based on slides by Lars Arge I/O-algorithms 30 (a,b)-tree properties: –If b=2a-1 every update can cause many rebalancing operations –If b≥2a update only cause O(1) rebalancing operations amortized –If b>2a only rebalancing operations amortized *Both somewhat hard to show –If b=4a easy to show that update causes rebalance operations amortized *After split during insert a leaf contains  4a/2=2a elements *After fuse during delete a leaf contains between  2a and  5a elements (split if more than 3a  between 3/2a and 5/2a) (a,b)-Tree insert delete (2,3)-tree

31 Heavily based on slides by Lars Arge I/O-algorithms 31 Summary/Conclusion: B-tree B-trees: (a,b)-trees with a,b = –O(N/B) space –O(log B N) query –O(log B N) update B-trees with elements in the leaves sometimes called B + -tree Construction in I/Os –Sort elements and construct leaves –Build tree level-by-level bottom-up

32 Heavily based on slides by Lars Arge I/O-algorithms 32 Merge Sort Merge sort: –Create N/M memory sized sorted runs –Merge runs together M/B at a time  phases using I/Os each Distribution sort similar (but harder – partition elements)


Download ppt "Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012."

Similar presentations


Ads by Google