Download presentation
Presentation is loading. Please wait.
Published byLincoln Wolford Modified over 10 years ago
1
A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or halem @umbc.edu 1
2
Introduction Scientific workflow systems – Support the integration of components for data acquisition, transformation, and analysis to build complex data-analysis – Computation is often used to derive new data from old one – Many scientific workflow systems are data-driven or even – Computational steps are represented as nodes and data between these tasks is made explicit using edges between them Scientific workflow advantages compare to script base approaches – Component model – Data source discovery – Provenance framework – Parallelism Limitations – Little Support for Data Modeling, NetCDF, XML limited in Kepler and Triana – Scientific workflows should tolerate certain changes in the structure of its input data and easy to modify – Parallelism – Optimization is not Performed Automatically 2
3
The MapReduce workflow system Exploiting data parallelism through MapReduce programming model We propose a workflow system for integrating structure, and orchestrating MapReduce jobs for scientific data intensive workflows. The system consists of workflow design tools: A simple workflow design C++ API, will support XML design tool. Automatically optimization: a job scheduler, and a runtime support system for Hadoop or Sector/Sphere frameworks. Support for Data Modeling: data parallel processing scales out linearly with Hbase data storage and management (random access, real time read/write access to data intensive applications) data transfer tools ( for HDF, NetCDF) Use case: A climate satellite data intensive processing and analysis application. 3
4
The proposed workflow system w= Where w is the workflow J is a set of jobs as vertices in a DAG graph. E is a set of directed edges. D is a set of datasets. The input dataset I and output dataset O for the workflow w are special types of D. J1 =>J2 a directed edge states that job j1 produces a dataset d and job j2 consumes it as an input. Let is the estimated execution time of a job j, and is estimated based on the previous execution of the job on the unit input of the dataset. For given job j the estimated remaining time is 4
5
The MapReduce Constructs and Examples MapReduce – map(k1,v1)->list(k2,v2) – reduce(k2,list(v2))->list(v3). Figure 1. An example of the climate data processing and analysis workflow in steps 5
6
User case: a climate data intensive application AIRS and MODIS level 1B data over 10 years ~ Petabytes AIRS, MODIS gridded dataset at 100km ~ Terabytes 6 Fig 2 use case: scientific climate application
7
Fig 3. left. The performance of gridding MODIS 1 day 48 GB on bluegrit the PPC cluster 524MB RAM each nodes right. The speed up comparison between sequential data processing using Sector/Sphere system on 48GB MODIS data on bluegrit Intel cluster 32GB RAM each node Fig 4. left. Hadoop MapReduce on average AIRS gridded dataset stored on Hbase tables performance on 9 Intel nodes. right. Hadoop gridding with different input dataset size vs non Hadoop gridding (text format) 7
8
Summary We propose a workflow system for scientific applications focus on – Exploiting data parallelism through MapReduce programming model – Automatically optimization scheduling MapReduce jobs to Hadoop or Sector/Sphere clusters – Support for Data Modeling support data storage and management through Hadoop database (Hbase), data transfer tools (HDF, NetCDF) 8
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.