Jianwu Wang, Daniel Crawl, Ilkay Altintas San Diego Supercomputer Center, University of California, San Diego 9500 Gilman Drive, MC 0505 La Jolla, CA 92093-0505,

Slides:



Advertisements
Similar presentations
Nimrod/K: Towards Massively Parallel Dynamic Grid Workflows David Abramson, Colin Enticott, Monash Ilkay Altinas, UCSD.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.
EUFORIA FP7-INFRASTRUCTURES , Grant Scientific Workflows Kepler and Java API 4 HPC/GRID ITM meeting Juelich 2009 Michał Owsiak Marcin Płóciennik.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
UCSD SAN DIEGO SUPERCOMPUTER CENTER Ilkay Altintas Scientific Workflow Automation Technologies Provenance Collection Support in the Kepler Scientific Workflow.
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
Distributed Computations
6th Biennial Ptolemy Miniconference Berkeley, CA May 12, 2005 Distributed Computing in Kepler Ilkay Altintas Lead, Scientific Workflow Automation Technologies.
Computational Physics Kepler Dr. Guy Tel-Zur. This presentations follows “The Getting Started with Kepler” guide. A tutorial style manual for scientists.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Composing Models of Computation in Kepler/Ptolemy II Summary. A model of computation (MoC) is a formal abstraction of execution in a computer. There is.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
January, 23, 2006 Ilkay Altintas
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Composing Models of Computation in Kepler/Ptolemy II
University of California, Davis Daniel Zinn 1 University of California, Davis Daniel Zinn 1 Parallel Virtual Machines in Kepler Daniel Zinn Xuan Li Bertram.
HAMS Technologies 1
Workflow Project Luciano Piccoli Illinois Institute of Technology.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
HAMS Technologies 1
Nimrod/K using Opal Services for Virtual Screening David Abramson, Ilkay Altintas, Daniel Crawl, Wilfred Li, Jane Ren, Jianwu Wang, Colin Enticott(presenter)
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
An Introduction to HDInsight June 27 th,
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Chad Berkley NCEAS National Center for Ecological Analysis and Synthesis (NCEAS), University of California Santa Barbara Long Term Ecological Research.
1 Ilkay ALTINTAS - July 24th, 2007 Ilkay ALTINTAS Director, Scientific Workflow Automation Technologies Laboratory San Diego Supercomputer Center, UCSD.
Workflow Project Status Update Luciano Piccoli - Fermilab, IIT Nov
Center for Component Technology for Terascale Simulation Software CCA is about: Enhancing Programmer Productivity without sacrificing performance. Supporting.
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
Kepler includes contributors from GEON, SEEK, SDM Center and Ptolemy II, supported by NSF ITRs (SEEK), EAR (GEON), DOE DE-FC02-01ER25486.
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
6 February 2009 ©2009 Cesare Pautasso | 1 JOpera and XtremWeb-CH in the Virtual EZ-Grid Cesare Pautasso Faculty of Informatics University.
Toward interactive visualization in a distributed workflow Steven G. Parker Oscar Barney Ayla Khan Thiago Ize Steven G. Parker Oscar Barney Ayla Khan Thiago.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,
Ocean Observatories Initiative OOI Cyberinfrastructure Life Cycle Objectives Review January 8-9, 2013 Scientific Workflows for OOI Ilkay Altintas Charles.
ACCESSING DATA IN THE NIS USING THE KEPLER WORKFLOW SYSTEM Corinna Gries.
Workflow-Driven Science using Kepler Ilkay Altintas, PhD San Diego Supercomputer Center, UCSD words.sdsc.edu.
Next Generation of Apache Hadoop MapReduce Owen
By: Joel Dominic and Carroll Wongchote 4/18/2012.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Scientific workflow in Kepler – hands on tutorial
Hadoop Aakash Kag What Why How 1.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Hadoop MapReduce Framework
OGCE OGCE The Open Grid Computing Environments Collaboratory
Applying Twister to Scientific Applications
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Hadoop Technopoints.
Computational Physics Kepler
A Semantic Type System and Propagation
CS639: Data Management for Data Science
Scientific Workflows Lecture 15
Presentation transcript:

Jianwu Wang, Daniel Crawl, Ilkay Altintas San Diego Supercomputer Center, University of California, San Diego 9500 Gilman Drive, MC 0505 La Jolla, CA , U.S.A. {jianwu, crawl, Presentation by Woodrow H. Edwards

Kepler  Open source scientific workflow system  Executable model of the many stages transforming data into the desired result in a scientific domain  Scientific domains using Kepler Bioinformatics, Computational Chemistry, Ecoinformatics, and Geoinformatics  All have large data sets and require a lot of computation

Kepler  User friendly GUI to connect data sources to built-in procedures or independent applications with the ease of drag and drop  Promotes component reuse and sharing  Written in Java  Designed to run on clusters, grids, or the Web  A nice match to integrate with MapReduce

Kepler  Components of a Kepler workflow Actors ○ Independently process data ○ Atomic or composite ○ Ports input and ouput data (tokens) or signals ○ Could be R or MATLAB scripts or an outside application Channels ○ Link actors ○ Carry data or other signals Directors ○ Specify when actors run ○ Sequential (SPD) or parallel (PN)

Figure 1: Example Kepler workflow [2]

Hadoop  Open source implementation of MapReduce map(in_key, in_value)  (out_key, intermediate_value) list reduce(out_key, intermediate_value list)  out_value list  HDFS  Data partitioning, scheduling, load balancing, and fault tolerance  Also written in Java

Kepler + Hadoop  Implement a MapReduce composite actor Map actor ○ MapInputKey: in_key ○ MapInputValue: in_value ○ MapOutputList: (out_key, intermediate_value) list Reduce actor ○ ReduceInputKey: out_key ○ ReduceInputList: intermediate_value list ○ ReduceOutputValue: out_value list Figure 2: (a) MapReduce composite actor. (b) Map actor. (c) Reduce actor. [1]

Kepler + Hadoop Figure 3: Hierarchical execution of MapReduce composite actor with Hadoop [1]

Kepler + Hadoop Figure 4: (a) Word Count workflow. (b) Map actor. (c) Reduce actor. (d) IterateOverArray actor. [1]

Kepler + Hadoop  Takes 10 to 15% longer over native Hadoop MapReduce  Makes up for it in ease of implementation  Scientist can use MapReduce without needing to know the framework  They only need to know where they can benefit from parallelism in their workflow

References 1. J. Wang, D. Crawl, and I. Altintas. Kepler + Hadoop: A General Architecture Facilitating Data- Intensive Applications in Scientific Workflow Systems. In WORKS 09, ACM, Nov The Kepler Project The Apache Hadoop Project.