Gerth Stølting Brodal Universitets-Samvirket Århus, Statsbiblioteket, Århus, November 16, 2010 Udfordringer ved håndtering af massive datamængder: Forskingen.

Slides:



Advertisements
Similar presentations
Time-Space Trade-Offs for 2D Range Minimum Queries Gerth Stølting BrodalPooya Davoodi Aarhus University S. Srinivasa Rao Seoul National University 18 th.
Advertisements

Kandidatorientering, April 20, 2012 Algorithms and Data Structures.
Lars Arge 1/43 Big Terrain Data Analysis Algorithms in the Field Workshop SoCG June 19, 2012 Lars Arge.
Lars Arge 1/13 Efficient Handling of Massive (Terrain) Datasets Lars Arge A A R H U S U N I V E R S I T E T Department of Computer Science.
Kandidatorientering, November 1, 2010 Algorithms and Data Structures.
MADALGO ― Center for Massive Data Algorithmics MADALGO is a major new basic research center funded by The Danish National Research Foundation initially.
Kandidatorientering, October 28, 2011 Algorithms and Data Structures.
Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.
Lasse Deleuran 1/37 Homotopic Polygonal Line Simplification Lasse Deleuran PhD student.
I/O-Algorithms Lars Arge January 31, Lars Arge I/O-algorithms 2 Random Access Machine Model Standard theoretical model of computation: –Infinite.
Time-Space Trade-Offs for 2D Range Minimum Queries
Computer Science Day 2008 Algorithms and Data Structures for Faulty Memory Gerth Stølting Brodal Department of Computer Science, University of Aarhus,
External-Memory and Cache-Oblivious Algorithms: Theory and Experiments Gerth Stølting Brodal Oberwolfach Workshop on ”Algorithm Engineering”, Oberwolfach,
I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.
I/O-Algorithms Lars Arge Spring 2009 January 27, 2009.
I/O-Algorithms Lars Arge Spring 2007 January 30, 2007.
I/O-Algorithms Lars Arge Aarhus University February 16, 2006.
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
I/O-Algorithms Lars Arge Aarhus University February 6, 2007.
I/O-Algorithms Lars Arge Spring 2006 February 2, 2006.
Lars Arge 1/14 A A R H U S U N I V E R S I T E T Department of Computer Science Efficient Handling of Massive (Terrain) Datasets Professor Lars Arge University.
I/O-Algorithms Lars Arge Aarhus University February 9, 2006.
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
Massive Data Algorithmics Faglig Dag, January 17, 2008 Gerth Stølting Brodal University of Aarhus Department of Computer Science.
Comparison Based Dictionaries: Fault Tolerance versus I/O Efficiency Gerth Stølting Brodal Allan Grønlund Jørgensen Thomas Mølhave University of Aarhus.
Chapter 4 Parallel Sort and GroupBy 4.1Sorting, Duplicate Removal and Aggregate 4.2Serial External Sorting Method 4.3Algorithms for Parallel External Sort.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
Cache Oblivious Search Trees via Binary Trees of Small Height
Flow modeling on grid terrains. Why GIS?  How it all started.. Duke Environmental researchers: computing flow accumulation for Appalachian Mountains.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Lars Arge 1/12 Lars Arge. 2/12  Pervasive use of computers and sensors  Increased ability to acquire/store/process data → Massive data collected everywhere.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
I/O-Algorithms Lars Arge Spring 2008 January 31, 2008.
Retreat Sandbjerg, November 19-20, Homework Session 16:30 Homework handout 16:40 Homework 19:00 Dinner 21:00 Homework presentations.
Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher A bunch of scientific.
TerraStream: From Elevation Data to Watershed Hierarchies Thursday, 08 November 2007 Andrew Danner (Swarthmore), T. Moelhave (Aarhus), K. Yi (HKUST), P.
The Study of Computer Science Chapter 0 Intro to Computer Science CS1510, Section 2.
I/O-Algorithms Lars Arge Fall 2014 August 28, 2014.
Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.
Overview of Computing. Computer Science What is computer science? The systematic study of computing systems and computation. Contains theories for understanding.
The Study of Computer Science Chapter 0 Intro to Computer Science CS1510.
Algorithms and Data Structures Gerth Stølting Brodal Computer Science Day, Department of Computer Science, Aarhus University, May 25, 2012.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
X-Stream: Edge-Centric Graph Processing using Streaming Partitions
Analysis of Algorithms
1  2004 Morgan Kaufmann Publishers Multilevel cache Used to reduce miss penalty to main memory First level designed –to reduce hit time –to be of small.
IT253: Computer Organization
CSE332: Data Abstractions Lecture 8: Memory Hierarchy Tyler Robison Summer
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
Algorithms and Data Structures Gerth Stølting Brodal Computer Science Day, Department of Computer Science, Aarhus University, May 25, 2012.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Flow Modeling on Massive Grids Laura Toma, Rajiv Wickremesinghe with Lars Arge, Jeff Chase, Jeff Vitter Pat Halpin, Dean Urban in collaboration with.
Equivalence Between Priority Queues and Sorting in External Memory
ProblemAssumption and PreliminariesAlgorithm  How does the water flow and depressions fill during non-uniform rain over a terrain?  Can we efficiently.
The Sort Benchmark AlgorithmsSolid State Disks External Memory Multiway Mergesort  Phase 1: Run Formation  Phase 2: Merge Runs  Careful parameter selection.
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
COMP 5704 Project Presentation Parallel Buffer Trees and Searching Cory Fraser School of Computer Science Carleton University, Ottawa, Canada
Internal Memory Pointer MachineRandom Access MachineStatic Setting Data resides in records (nodes) that can be accessed via pointers (links). The priority.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
The Sort Benchmark AlgorithmsSolid State Disks External Memory Multiway Mergesort  Phase 1: Run Formation  Phase 2: Merge Runs  Careful parameter selection.
Digital Terrain Analysis for Massive Grids
Advanced Topics in Data Management
Thesis Preparation Algorithms Gerth Stølting Brodal.
2.C Memory GCSE Computing Langley Park School for Boys.
8th Workshop on Massive Data Algorithms, August 23, 2016
CS246: Search-Engine Scale
CSE 332: Data Abstractions Memory Hierarchy
Presentation transcript:

Gerth Stølting Brodal Universitets-Samvirket Århus, Statsbiblioteket, Århus, November 16, 2010 Udfordringer ved håndtering af massive datamængder: Forskingen ved Grundforskningscenteret for Massive Data Algortihmics

AU 1983 September 1989 August 1993 January 1994 November M.Sc January PhD 2004 April Associate Professor 1969 September 1998 August MPII April AU M.Sc. PhD Post DocFaculty AU Erik Meineche Schmidt Kurt Mehlhorn Gerth Stølting Brodal

Outline of Talk – who, where, what ? – reseach areas External memory algorithmics – models – searching and sorting Flow simulation Fault-tolerant searching

– Where?

 Center of  Lars Arge, Professor, Centerleader  Gerth S. Brodal, Associate Professor  5 Post Docs, 10 PhD students, 4 TAP  Total budget for 5 years ca. 60 million DKR Arge BrodalMehlhorn MeyerDemaine Indyk AUMITMPII Frankfurt

Faculty Lars Arge Gerth Stølting Brodal Researchers Henrik Blunck Brody Sandel Nodari Sitchinava Elad Verbin Qin Zhang PhD Students Lasse Kosetski DeleuranJakob Truelsen Freek van Walderveen Kostas Tsakalidis Casper Kejlberg-RasmussenMark Greve Kasper Dalgaard LarsenMorten Revsbæk Jesper Erenskjold MoeslundPooya Davoodi

”3+5” 8. år PhD Part B 7. år 6. år PhD Part A 5. år MSc 4. år 3. år 2. år Bachelor 1. år 00’erne PhD AU ”5+3” 8. år 7. år Licentiat 6. år (PhD) 5. år 4. år 3. år MSc 2. år 1. år 80’erne ”4+4” 8. år PhD Part B 7. år 6. år PhD Part A 5. år MSc 4. år 3. år 2. år Bachelor 1. år 90’erne merit

PhD MADALGO ”3+5” 8. år PhD 7. år Part B 6. år PhD Part A 5. år MSc 4. år 3. år 2. år Bachelor 1. år Kasper Morten, Pooya, Freek Mark, Jakob, Lasse, Kostas Casper 6 months abroad merit

High level objectives – Advance algorithmic knowledge in “massive data” processing area – Train researchers in world-leading international environment – Be catalyst for multidisciplinary/industry collaboration Building on – Strong international team – Vibrant international environment (focus on people)

Pervasive use of computers and sensors Increased ability to acquire/store/process data → Massive data collected everywhere Society increasingly “data driven” → Access/process data anywhere any time Nature special issues – 2/06: “2020 – Future of computing” – 9/08: “BIG DATA Scientific data size growing exponentially, while quality and availability improving Paradigm shift: Science will be about mining data → Computer science paramount in all sciences Massive Data Obviously not only in sciences:  Economist 02/10:  From 150 Billion Gigabytes five years ago to 1200 Billion today  Managing data deluge difficult; doing so will transform business/public life

Example: Massive Terrain Data New technologies: Much easier/cheaper to collect detailed data – Previous ‘manual’ or radar based methods − Often 30 meter between data points − Sometimes 10 meter data available – New laser scanning methods (LIDAR) − Less than 1 meter between data points − Centimeter accuracy (previous meter) Denmark ~ 2 million points at 30 meter (<1GB) ~ 18 billion points at 1 meter (>1TB)

Algorithm Inadequacy Algorithms: Problem solving “recipies” Importance of scalability/efficiency → Algorithmics core computer science area Traditional algorithmics: Transform input to output using simple machine model Inadequate with e.g. – Massive data – Small/diverse devices – Continually arriving data → Software inadequacies! 2 n n 3 n 2 n·log n n

Cache Oblivious Algorithms Streaming Algorithms Algorithm Engineering I/O Efficient Algorithms

The problem... input size running time Normal algorithm I/O-efficient algorithm bottleneck = memory size

Memory Hierarchies Processor L1L2 A R M L3Disk bottleneck increasing access times and memory sizes CPU

Memory Hierarkies vs. Running Time input size running time L1 RAM L2L3

Disk Mechanics “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer) I/O is often bottleneck when handling massive datasets Disk access is 10 7 times slower than main memory access! Disk systems try to amortize large access time transferring large contiguous blocks of data Need to store and access data to take advantage of blocks ! I/O-efficient algorithms Move as few disk blocks as possible to solve given problem !

Memory Access Times LatencyRelative to CPU Register0.5 ns1 L1 cache0.5 ns1-2 L2 cache3 ns2-7 DRAM150 ns TLB500+ ns Disk10 ms10 7

I/O-Efficient Algorithms Matter Example: Traversing linked list (List ranking) – Array size N = 10 elements – Disk block size B = 2 elements – Main memory size M = 4 elements (2 blocks) Difference between N and N/B large since block size is large – Example: N = 256 x 10 6, B = 8000, 1ms disk access time  N I/Os take 256 x 10 3 sec = 4266 min = 71 hr  N/B I/Os take 256/8 sec = 32 sec Algorithm 2: N/B=5 I/OsAlgorithm 1: N=10 I/Os

I/O Efficient Scanning N B A O ( N / B ) I/Os

External-Memory Merging Merging k sequences with N elements requires O ( N / B ) IOs (provided k ≤ M / B – 1)

External-Memory Sorting MergeSort uses O ( N/B·log M/B ( N/B)) I/Os Practice number I/Os: 4-6 x scanning input M M Partition into runs Sort each run Merge pass I Merge pass II... Run 1Run 2Run N/M Sorted N Sorted ouput Unsorted input

Energy-Efficient Sorting using Solid State Disks (Bechman, Meyer, Sanders, Siegler 2010) Size [GB] Time [s] Energy [kJ] Rec./JTime [s] Energy [kJ] Rec./JEnergy Saving Factor *2920* Sorting large data sets  Is easily described  Has many applications  Stresses both CPU and the I/O system  Benchmark introduced 1985 Energy Efficiency  Energy (and cooling) is a significant cost factor in data centers  Energy consumption correlates to pollution

B-trees - The Basic Searching Structure  Searches Practice: 4-5 I/Os.... B Search path Internal memory  Repeated searching Practice: 1-2 I/Os

B-trees  Searches O(log B N) I/Os.... B Search path Internal memory  Updates O(log B N) I/Os Best possible ?

B-trees with Buffered Updates.... √B√B Brodal and Fagerberg (2003) B xxxx  N updates cost O(N /√B ∙ log B N) I/Os  Searches cost O(log B N) I/Os Trade-off between search and update times – optimal !

B-trees with Buffered Updates Brodal and Fagerberg (2003)

B-trees with Buffered Updates Experimental Study elements Search time basically unchanged with buffersize Updates 100 times faster Hedegaard (2004)....

Flood Prediction Important Prediction areas susceptible to floods – Due to e.g. raising sea level or heavy rainfall Example: Hurricane Floyd Sep. 15, am 3pm

Detailed Terrain Data Essential Mandø with 2 meter sea-level raise 80 meter terrain model 2 meter terrain model

Surface Flow Modeling Conceptually flow is modeled using two basic attributes – Flow direction: The direction water flows at a point – Flow accumulation: Amount of water flowing through a point 7 am 3pm Hurricane Floyd (September 15, 1999)

Flow accumulation on grid terrain model: – Initially one unit of water in each grid cell – Water (initial and received) distributed from each cell to lowest lower neighbor cell (if existing) – Flow accumulation of cell is total flow through it Note: – Flow accumulation of cell = size of “upstream area” – Drainage network = cells with high flow accumulation Flow Accumulation

Massive Data Problems Commercial systems: – Often very slow – Performance somewhat unpredictable – Cannot handle 2-meter Denmark model Collaboration environmental researchers in late 90’ties – US Appalachian mountains dataset 800x800km at 100m resolution  a few Gigabytes Customized software on ½ GB machine: Appalachian dataset would be Terabytes sized at 1m resolution! 14 days!!

Flow Accumulation Performance Natural algorithm access disk for each grid cell – “Push” flow down the terrain by visiting cells in height order  Problem since cells of same height scattered over terrain Natural to try “tiling” (commercial systems?) – But computation in different tiles not independent

I/O-Efficient Flow Accumulation We developed I/O-optimal algorithms (assessing disk a lot less) Avoiding scattered access by: – Grid storing input: Data duplication – Grid storing flow: “Lazy write” Implementation very efficient – Appalachian Mountains flow accumulation in 3 hours! – Denmark at 2-meter resolution in a few days

Flood Modeling Not all terrain below height h is flooded when water rise to h meters! Theoretically not too hard to compute area flooded when rise to h meters – But no software could do it for Denmark at 2-meter resolution Use of I/O-efficient algorithms  Denmark in a few days Even compute new terrain where terrain below h is flooded when water rise to h

TerraSTREAM Flow/flooding work part of comprehensive software package – TerraSTREAM: Whole pipeline of terrain data processing software TerraSTREAM used on full 2 meter Denmark model (~ 25 billion points, ~ 1.5 TB) – Terrain model (grid) from LIDAR point data – Surface flow modeling: Flow directions and flow accumulation – Flood modeling

Interdisciplinary Collaboration Flow/flooding work example of center interdisciplinary collaboration – Flood modeling and efficient algorithms in biodiversity modeling – Allow for used of global and/or detailed geographic data Brings together researchers from – Biodiversity – Ecoinformatics – Algorithms – Datamining –…–…

A bit in memory changed value because of e.g. background radiation, system heating,...

"You have to provide reliability on a software level. If you're running 10,000 machines, something is going to die every day." ― fellow Jeff Dean

Binary Search for 16 O(log N) comparisons

soft memory error Binary Search for = =8 Requirement: If the search key ocours in the array as an uncorrupted value, then we should report a match !

Where is Laurits ?

If at most 4 faulty answers then Laurits is somewhere here

Faulty-Memory RAM Model Finocchi and Italiano, STOC’04  Content of memory cells can get corrupted  Corrupted and uncorrupted content cannot be distinguished  O(1) safe registers  Assumption: At most δ corruptions

Faulty-Memory RAM: Searching 16? Problem? High confidenceLow confidence

Faulty-Memory RAM: Searching When are we done (δ=3)? Contradiction, i.e. at least one fault If range contains at least δ+1 and δ+1 then there is at least one uncorrupted and, i.e. x must be contained in the range

If verification fails → contradiction, i.e. ≥1 memory-fault → ignore 4 last comparisons → backtrack one level of search Faulty-Memory RAM: Θ(log N + δ) Searching Brodal, Fagerberg, Finocchi, Grandoni, Italiano, Jørgensen, Moruz, Mølhave, ESA’07

Summary Basic research center in Aarhus – Organization – PhD education Examples of research – Theoretical external memory algorithmics – Practical (flow and flooding simulation)

Tau ∙ Jërë-jëf ∙ Tashakkur ∙ S.aHHa ∙ Sag olun Giihtu ∙ Djakujo ∙ Dâkujem vám ∙ Thank you Tesekkür ederim ∙ To-siä ∙ Merci ∙ Tashakur Taing ∙ Dankon ∙ Efharisto´ ∙ Shukriya ∙ Kiitos Dhanyabad ∙ Rakhmat ∙ Trugarez ∙ Asante Köszönöm ∙ Blagodarya ∙ Dziekuje ∙ Eskerrik asko Grazie ∙ Tak ∙ Bayarlaa ∙ Miigwech ∙ Dank u Spasibo ∙ Dêkuji vám ∙ Ngiyabonga ∙ Dziakuj Obrigado ∙ Gracias ∙ A dank aych ∙ Salamat Takk ∙ Arigatou ∙ Tack ∙ Tänan ∙ Aciu Korp kun kah ∙ Multumesk ∙ Terima kasih ∙ Danke Rahmat ∙ Gratias ∙ Mahalo ∙ Dhanyavaad Paldies ∙ Faleminderit ∙ Diolch ∙ Hvala Kam-sa-ham-ni-da ∙ Xìe xìe ∙ Mèrcie ∙ Dankie Thank You Gerth Stølting Brodal