Massive Data Algorithmics Faglig Dag, January 17, 2008 Gerth Stølting Brodal University of Aarhus Department of Computer Science.

Slides:



Advertisements
Similar presentations
Fabián E. Bustamante, Spring 2007
Advertisements

Lars Arge 1/43 Big Terrain Data Analysis Algorithms in the Field Workshop SoCG June 19, 2012 Lars Arge.
Lars Arge 1/13 Efficient Handling of Massive (Terrain) Datasets Lars Arge A A R H U S U N I V E R S I T E T Department of Computer Science.
MADALGO ― Center for Massive Data Algorithmics MADALGO is a major new basic research center funded by The Danish National Research Foundation initially.
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
Modeling & Analyzing Massive Terrain Data Sets (STREAM Project) Pankaj K. Agarwal Workshop on Algorithms for Modern Massive Data Sets.
I/O-Algorithms Lars Arge January 31, Lars Arge I/O-algorithms 2 Random Access Machine Model Standard theoretical model of computation: –Infinite.
Gerth Stølting Brodal Cache-Oblivious Algorithms - A Unified Approach to Hierarchical Memory Algorithms Current Trends in Algorithms, Complexity Theory,
External-Memory and Cache-Oblivious Algorithms: Theory and Experiments Gerth Stølting Brodal Oberwolfach Workshop on ”Algorithm Engineering”, Oberwolfach,
I/O-Algorithms Lars Arge Spring 2009 February 2, 2009.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
I/O-Algorithms Lars Arge Spring 2009 January 27, 2009.
I/O-Algorithms Lars Arge Spring 2007 January 30, 2007.
I/O-Algorithms Lars Arge Aarhus University February 16, 2006.
I/O-Algorithms Lars Arge Aarhus University February 7, 2005.
I/O-Algorithms Lars Arge Aarhus University February 6, 2007.
I/O-Algorithms Lars Arge Spring 2006 February 2, 2006.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Lars Arge 1/14 A A R H U S U N I V E R S I T E T Department of Computer Science Efficient Handling of Massive (Terrain) Datasets Professor Lars Arge University.
I/O-Algorithms Lars Arge Aarhus University February 9, 2006.
I/O-Algorithms Lars Arge Aarhus University February 14, 2008.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
Flow modeling on grid terrains. Why GIS?  How it all started.. Duke Environmental researchers: computing flow accumulation for Appalachian Mountains.
From Elevation Data to Watershed Hierarchies Pankaj K. Agarwal Duke University Supported by ARO W911NF
Lars Arge 1/12 Lars Arge. 2/12  Pervasive use of computers and sensors  Increased ability to acquire/store/process data → Massive data collected everywhere.
I/O-Algorithms Lars Arge Spring 2008 January 31, 2008.
From Topographic Maps to Digital Elevation Models Daniel Sheehan IS&T Academic Computing Anne Graham MIT Libraries.
1 A Novel Page-Based Data Structure for Interactive Walkthroughs Behzad Sajadi Yan Huang Pablo Diaz-Gutierrez Sung-Eui Yoon M. Gopi.
Systems I Locality and Caching
Algorithms for Information Retrieval Prologue. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher A bunch of scientific.
1 Building National Cyberinfrastructure Alan Blatecky Office of Cyberinfrastructure EPSCoR Meeting May 21,
TerraStream: From Elevation Data to Watershed Hierarchies Thursday, 08 November 2007 Andrew Danner (Swarthmore), T. Moelhave (Aarhus), K. Yi (HKUST), P.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Lecture 11: DMBS Internals
I/O-Algorithms Lars Arge Fall 2014 August 28, 2014.
Heavily based on slides by Lars Arge I/O-Algorithms Thomas Mølhave Spring 2012 February 9, 2012.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
1  2004 Morgan Kaufmann Publishers Multilevel cache Used to reduce miss penalty to main memory First level designed –to reduce hit time –to be of small.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
CSE332: Data Abstractions Lecture 8: Memory Hierarchy Tyler Robison Summer
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, March 29, 2000.
INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Flow Modeling on Massive Grids Laura Toma, Rajiv Wickremesinghe with Lars Arge, Jeff Chase, Jeff Vitter Pat Halpin, Dean Urban in collaboration with.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.
Internal Memory Pointer MachineRandom Access MachineStatic Setting Data resides in records (nodes) that can be accessed via pointers (links). The priority.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
Supercomputing versus Big Data processing — What's the difference?
Big Data Analytics and HPC Platforms
Memory COMPUTER ARCHITECTURE
Lecture 16: Data Storage Wednesday, November 6, 2006.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Digital Terrain Analysis for Massive Grids
Advanced Topics in Data Management
External-Memory and Cache-Oblivious Algorithms: Theory and Experiments
CSE 332: Data Abstractions Memory Hierarchy
Presentation transcript:

Massive Data Algorithmics Faglig Dag, January 17, 2008 Gerth Stølting Brodal University of Aarhus Department of Computer Science

Gerth Stølting Brodal 2 The core problem... data size running time Normal algorithm I/O-efficient algorithm Main memory size

Gerth Stølting Brodal 3 Outline of Talk  Examples of massive data  Hierarchical memory  Basic I/O efficient techniques  MADALGO center presentation  A MADALGO project

Gerth Stølting Brodal 4 Massive Data Examples  Massive data being acquired/used everywhere  Storage management software is billion-$ industry  Phone: AT&T 20TB phone call database, wireless tracking  Consumer: WalMart 70TB database, buying patterns  WEB: Google index 8 billion web pages  Bank: Danske Bank 250TB DB2  Geography: NASA satellites generate Terrabytes each day

Gerth Stølting Brodal 5 Massive Data Examples  Society will become increasingly “data driven”  Sensors in building, cars, phones, goods, humans  More networked devices that both acquire and process data → Access/process data anywhere any time  Nature 2/06 issue highlight trends in sciences: “2020 – Future of computing”  Exponential growth of scientific data  Due to e.g. large experiments, sensor networks, etc  Paradigm shift: Science will be about mining data → Computer science paramount in all sciences  Increased data availability: “nano-technology-like” opportunity

Gerth Stølting Brodal 6 Where does the slowdown come from ? data size running time

Gerth Stølting Brodal 7 Hierarchical Memory Basics CPUL1L2 A R M L3Disk Bottleneck Increasing access time and space

Gerth Stølting Brodal 8 Memory Hierarchy vs Running Time data size running time L1 RAM L2L3

Gerth Stølting Brodal 9 Memory Access Times LatencyRelative to CPU Register0.5 ns1 L1 cache0.5 ns1-2 L2 cache3 ns2-7 DRAM150 ns TLB500+ ns Disk10 ms10 7 Increasing

Gerth Stølting Brodal 10 Disk Mechanics “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil using a sharpener on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk.” (D. Comer)  I/O is often bottleneck when handling massive datasets  Disk access is 10 7 times slower than main memory access!  Disk systems try to amortize large access time transferring large contiguous blocks of data  Need to store and access data to take advantage of blocks !

Gerth Stølting Brodal 11 The Algorithmic Challenge  Modern hardware is not uniform — many different parameters  Number of memory levels  Cache sizes  Cache line/disk block sizes  Cache associativity  Cache replacement strategy  CPU/BUS/memory speed...  Programs should ideally run for many different parameters  by knowing many of the parameters at runtime, or  by knowing few essential parameters, or  ignoring the memory hierarchies  Programs are executed on unpredictable configurations  Generic portable and scalable software libraries  Code downloaded from the Internet, e.g. Java applets  Dynamic environments, e.g. multiple processes Practice

Gerth Stølting Brodal 12 Basic Algorithmic I/O Efficient Techniques  Scanning  Sorting  Recursion  B-trees

Gerth Stølting Brodal 13 I/O Efficient Scanning sum = 0 for i = 1 to N do sum = sum + A[i] sum = 0 for i = 1 to N do sum = sum + A[i] N B A O ( N / B ) I/Os

Gerth Stølting Brodal 14 External-Memory Merging Merging k sequences with N elements requires O ( N / B ) IOs (provided k ≤ M / B – 1)

Gerth Stølting Brodal 15 External-Memory Sorting  MergeSort uses O ( N/B·log M/B ( N/B)) I/Os  Practice number I/Os: 4-6 x scanning input M M Partition into runs Sort each run Merge pass I Merge pass II... Run 1Run 2Run N/M Sorted N Sorted ouput Unsorted input

Gerth Stølting Brodal 16 B-trees - The Basic Searching Structure  Searches Practice: 4-5 I/Os.... B Search path Internal memory  Repeated searching Practice: 1-2 I/Os !!! Bottleneck !!! Use sorting instead of B-tree (if possible)

Gerth Stølting Brodal 17

Gerth Stølting Brodal 18 About MADALGO (AU)  Center of  Lars Arge, Professor  Gerth S. Brodal, Assoc. Prof.  3 PostDocs, 9 PhD students, 5 MSc students  Total 5 year budget ~60 million kr (8M Euro)  High level objectives  Advance algorithmic knowledge in massive data processing area  Train researchers in world-leading international environment  Be catalyst for multidisciplinary collaboration Center Leader Prof. Lars Arge

Gerth Stølting Brodal 19 Center Team  International core team of algorithms researchers  Including top ranked US and European groups  Leading expertise in focus areas  AU: I/O, cache and algorithm engineering  MPI: I/O (graph) and algorithm engineering  MIT: Cache and streaming AU MPI MIT Arge Brodal Mehlhorn Meyer Demaine Indyk

Gerth Stølting Brodal 20 Center Collaboration  COWI, DHI, DJF, DMU, Duke, NSCU  Support from Danish Strategic Research Counciland US Army Research Office  Software platform for Galileo GPS  Various Danish academic/industry partners  Support from Danish High-Tech Foundation  European massive data algorithmics network  8 main European groups in area

Gerth Stølting Brodal 21 MADALGO Focus Areas Cache Oblivious Algorithms Streaming Algorithms Algorithm Engineering I/O Efficient Algorithms

Gerth Stølting Brodal 22 A MADALGO Project

Gerth Stølting Brodal 23 Massive Terrain Data

Gerth Stølting Brodal 24 Terrain Data  New technologies: Much easier/cheaper to collect detailed data  Previous ‘manual’ or radar based methods  Often 30 meter between data points  Sometimes 10 meter data available  New laser scanning methods (LIDAR)  Less than 1 meter between data points  Centimeter accuracy (previous meter) Denmark  ~2 million points at 30 meter (<<1GB)  ~18 billion points at 1 meter (>>1TB)  COWI (and other) now scanning DK  NC scanned after Hurricane Floyd in 1999

Gerth Stølting Brodal 25 Hurricane Floyd Sep. 15, pm7 am

Gerth Stølting Brodal 26 Denmark Flooding +1 meter +2 meter

Gerth Stølting Brodal 27  Conceptually flow is modeled using two basic attributes  Flow direction: The direction water flows at a point  Flow accumulation: Amount of water flowing through a point  Flow accumulation used to compute other hydrological attributes: drainage network, topographic convergence index… Example: Terrain Flow

Gerth Stølting Brodal 28 Example: Flow on Terrains  Modeling of water flow on terrains has many important applications  Predict location of streams  Predict areas susceptible to floods  Compute watersheds  Predict erosion  Predict vegetation distribution  ……

Gerth Stølting Brodal 29 Terrain Flow Accumulation  Collaboration with environmental researchers at Duke University  Appalachian mountains dataset:  800x800km at 100m resolution  a few Gigabytes  On ½GB machine:  ArcGIS:  Performance somewhat unpredictable  Days on few gigabytes of data  Many gigabytes of data…..  Appalachian dataset would be Terabytes sized at 1m resolution 14 days!!

Gerth Stølting Brodal 30 Terrain Flow Accumulation: TerraFlow  We developed theoretically I/O-optimal algorithms  TPIE implementation was very efficient  Appalachian Mountains flow accumulation in 3 hours!  Developed into comprehensive software package for flow computation on massive terrains: TerraFlow  Efficient: times faster than existing software  Scalable: >1 billion elements!  Flexible: Flexible flow modeling (direction) methods  Extension to ArcGIS

Gerth Stølting Brodal 31 Examples of Ongoing ¨Terrain Work  Terrain modeling, e.g  “Raw” LIDAR to point conversion (LIDAR point classification) (incl feature, e.g. bridge, detection/removal)  Further improved flow and erosion modeling (e.g. carving)  Contour line extraction (incl. smoothing and simplification)  Terrain (and other) data fusion (incl format conversion)  Terrain analysis, e.g  Choke point, navigation, visibility, change detection,…  Major grand goal:  Construction of hierarchical (simplified) DEM where derived features (water flow, drainage, choke points) are preserved/consistent

Gerth Stølting Brodal 32 Summary  Massive datasets appear everywhere  Leads to scalability problems  Due to hierarchical memory and slow I/O  I/O-efficient algorithms greatly improves scalability  New major research center will focus on massive data algorithms issues