For further information Please contact Joe Buck: This work was supported by the Systems Research Lab at the University.

Slides:

Advertisements

Similar presentations

A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.

Advertisements

Overview of MapReduce and Hadoop

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Streaming NetCDF John Caron July What does NetCDF do for you? Data Storage: machine-, OS-, compiler-independent Standard API (Application Programming.

October 19th, 2010 Update on Damasc Joe Buck. A year later ✤ Last year: we outlined our vision ✤ Next year: Carlos and Alkis covered that ✤ Today: Where.

Physical Design CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 Physical Design Steps 1. Develop standards 2.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

Chapter 6 Physical Database Design. Introduction The purpose of physical database design is to translate the logical description of data into the technical.

Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.

Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.

Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,

DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES A.A. Poyda 1, M.N. Zhizhin 1, D.P. Medvedev 2, D.Y.

1 High level view of HDF5 Data structures and library HDF Summit Boeing Seattle September 19, 2006.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.

An Integration Framework for Sensor Networks and Data Stream Management Systems.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Integrated Grid workflow for mesoscale weather modeling and visualization Zhizhin, M., A. Polyakov, D. Medvedev, A. Poyda, S. Berezin Space Research Institute.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Data Structures and Algorithms Lecture 1 Instructor: Quratulain Date: 1 st Sep, 2009.

The Volcano Optimizer Generator Extensibility and Efficient Search.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

CCD-410 Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera.

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Memory Management.

Large-scale file systems and Map-Reduce

Hadoop MapReduce Framework

Spark Presentation.

Introduction to MapReduce and Hadoop

Hadoop Clusters Tess Fulkerson.

Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.

CSCE 990: Advanced Distributed Systems

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

MapReduce: Data Distribution for Reduce

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Yi Wang, Wei Jiang, Gagan Agrawal

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Design principles for packet parsers

Map Reduce, Types, Formats and Features

Presentation transcript:

For further information Please contact Joe Buck: This work was supported by the Systems Research Lab at the University of California, Santa Cruz, the Department of Energy and the HPC-5 group at Los Alamos National Laboratory. SciHadoop Processing Array-Based Scientific Data in Hadoop MapReduce Figure 1. The MapReduce process takes a logical description of the data, integrates that with a view of the physical layout of the same data, and generates an execution plan for the query that maximizes data locality Data Model An array-based data model maps an n-dimensional shape onto a byte- stream for storage in a file, as can be seen in Figure 2. Proportional data placement (total logical space / units desire) Partitioning One of the goals of MapReduce systems is to reduce remote data access (thereby conserving network resources and accelerating the processing of the data). In order to ensure that a large percentage of reads are done locally, efficient partitioning of the input data is required. We have implemented three methods of partitioning that provide different tradeoffs. Introduction Scientific data is growing in size at a rapid rate. Examples include: 1)Large Synaptic Sky Telescope (LSST) generating 30 TB / night 2)CERN’s Large Hadron Collider (LHC) generates around 15 PB / year 3)AR4 climate model (2007) 12TB 4)AR5 climate model (2013) > 300 TB This increased size is creating issues in terms of data management, storage, and efficient processing. This data is often stored in highly structured, array-based binary file formats (HDF5, NetCDF3, etc). MapReduce, and specifically the Hadoop implementation, has become a popular framework for enabling large-scale, parallel data processing. Using MapReduce to process scientific data stored in array-based binary formats is not a straight-forward solution. The interface to data stored in these formats is a logical model expressed as n-dimensional arrays. In contrast, Hadoop’s interface is based on a byte-stream interface. By resolving the logical interface to the byte-stream interface, not only is it then possible to execute MapReduce programs over scientific data, but several optimizations are then possible. Traditionally, a MapReduce program creates tasks by analyzing the input file(s) and assigning regions of their byte-streams to map tasks that are then assigned to nodes. Joe Buck, Noah Watkins, Kleoni Ioannidou, Carlos Maltzahn, Scott Brandt UCSC – Systems Research Lab Jeff LeFevre, Neoklis Polyzotis UCSC – Database Systems Group John Bent, Meghan Wingate, Gary Grider LANL Figure 2. An example of a 2-dimensional data set on the left and the software stack that maps accesses of the array to an underlying byte- stream. The shaded area on the left represents a specific geographical sub- region within the larger data set. Queries The purpose of the MapReduce paradigm is to apply a function over a set of data. When processing scientific data it is often desirable to process only the part of the total input that is required to satisfy the query. In order to accomplish this, our query uses a constraining space to specify the contiguous portion of data set to process. Array-based data typically holds values of one type, say floats representing observed temperatures, and the various dimensions represent information describing the recorded data. In figure 2, the two dimensions represent latitude and longitude where the given temperature was recorded. It’s simple to see how a third dimension could be added that represents time and that would result in a data set that was a time-series of temperature measurements at specified latitude, longitude coordinates. Figure 3. The MapReduce process takes a logical description of the data, integrates that with a view of the physical layout of the same data, and generates an execution plan for the query that maximizes data locality A Slab Extraction function is then applied to the constraining space. Given that dimensions represent attributes, such as longitude and latitude, the slab extraction shape dictates what data points will be processed together. For example, if the function was meant to apply to measurements from fixed geographical areas, then the slab extraction function would produce Input Sets that had that fixed shape on the latitude and longitude dimensions. Each Input Shape is then processed by the function (f in figure 3) to produce the Result Set. Physical-to-Logical (search the logical space to find where physical boundaries are, sampling, and produce Input Sets that map precisely to the underlying byte-stream) Chunking & Grouping (use smaller shapes, use sampling to place them, aggregate all shapes that are mostly on the same Input Set) Holistic Functions This class of functions, which requires that all elements be processed at the same time, are not readily amenable to efficient application via a MapReduce program. Two optimizations allow holistic function to be more efficiently processed by SciHadoop: Opportunistic Holistic Combiner and Holistic-aware partitioning. Holistic Combiner: determines if all the elements required happened to be present at a map node and applies the function if they are present Holistic-aware Partitioning: adjust partitions to increase the chances that all elements needed by the function are present at a single mapped (thereby increasing the efficacy of the Holistic Combiner) Figure 5. On the left, data for the desired partition (light gray) is split across two mappers, preventing the Holistic Combiner from executing. All data is sent across the network and stored on both the map and reduce nodes as intermediate data On the right, the paritions are adjusted by Holistic-aware Partitioning. The Holistic Combiner can now execute, greatly reducing data passed through the system Figure 4. Three different partitioning strategies The table above shows several interesting results: 1) Chunking & Grouping as well as the Physical-to-Logical partitioning strategies greatly increase read locality. 2) Holistic Combiner greatly (20x) reduced the amount of intermediate data generated and, in turn, the execution time (5-7x) of the query. 3) Read locality isn’t the only useful metric. Compare Holistic-aware partitioning with Chunking & Grouping to Physical-to-Logical (the third to last and last tests in the table). A reduction in read locality did not have a negative effect on the runtime, but rather the increased efficacy of the Holistic Combiner, seen by the reduction in temporary data, resulted in a ~10% reduction in run-time. Results Experiments were executed on a 30-node cluster where each node had two 1.8 GHz dual-core Opterons, 8 GB RAM, 4x 250GB sata hard drives and all nodes were interconnected via gigabit Ethernet on a single switch. The sample NetCDF file was extrapolated from an environmental dataset and stored 132 GB of data in a variable that represented wind pressure measurements across four dimensions (time, latitude, longitude, elevation). The query executed calculated the median pressure in across two time steps for a fixed latitude x longitude area and a fixed range of elevation. Future Work The SciHadoop project is currently being extended to determine how controlling the mapping of data from map tasks to reduce tasks can be leveraged to decrease the mount of intermediate data, reduce network communication and increase reducer data locality. Work is also being done to leverage structural knowledge of the input to aggressively start reducer execution prior to all mappers completing. This allows for access to partial results much more quickly and a reduction in total resource usage during. Additionally, adding support for the HDF5 file format, and exploring the optimizations that are possible given its more flexible internal structure, are underway.