MPI and MapReduce CCGSC 2010 Flat Rock NC September 8 2010 Geoffrey Fox

Slides:

Advertisements

Similar presentations

SALSA HPC Group School of Informatics and Computing Indiana University.

Advertisements

SALSASALSASALSASALSA Applying Twister for Scientific Applications NSF Cloud PI Workshop March 17, 2011 Judy Qiu School of Informatics.

Introduction to Programming Paradigms Activity at Data Intensive Workshop Shantenu Jha represented by Geoffrey Fox

Twister4Azure Iterative MapReduce for Windows Azure Cloud Thilina Gunarathne Indiana University Iterative MapReduce for Azure Cloud.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Clouds from FutureGrid’s Perspective April Geoffrey Fox Director, Digital Science Center, Pervasive.

Hybrid MapReduce Workflow Yang Ruan, Zhenhua Guo, Yuduo Zhou, Judy Qiu, Geoffrey Fox Indiana University, US.

SALSASALSASALSASALSA Using MapReduce Technologies in Bioinformatics and Medical Informatics Computing for Systems and Computational Biology Workshop SC09.

Interpolative Multidimensional Scaling Techniques for the Identification of Clusters in Very Large Sequence Sets April 27, 2011.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

1 Clouds and Sensor Grids CTS2009 Conference May Alex Ho Anabas Inc. Geoffrey Fox Computer Science, Informatics, Physics Chair Informatics Department.

Clouds Cyberinfrastructure and Collaboration CTS2010 Chicago IL May Geoffrey Fox

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,

Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.

Grids and Clouds for Cyberinfrastructure IIT June Geoffrey Fox

SALSASALSASALSASALSA Digital Science Center June 25, 2010, IIT Geoffrey Fox Judy Qiu School.

Iterative computation is a kernel function to many data mining and data analysis algorithms. Missing in current MapReduce frameworks is collective communication,

Panel Session The Challenges at the Interface of Life Sciences and Cyberinfrastructure and how should we tackle them? Chris Johnson, Geoffrey Fox, Shantenu.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

SCSI: Platforms & Foundations: Cyberinfrastructure Socially Coupled Systems & Informatics: Science, Computing & Decision Making in a Complex Interdependent.

Cloud Data mining and FutureGrid SC10 New Orleans LA AIST Booth November Geoffrey Fox

SALSASALSASALSASALSA AOGS, Singapore, August 11-14, 2009 Geoffrey Fox 1,2 and Marlon Pierce 1

Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.

Science of Cloud Computing Panel Cloud2011 Washington DC July Geoffrey Fox

Experimenting with FutureGrid CloudCom 2010 Conference Indianapolis December Geoffrey Fox

SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.

Science Clouds and FutureGrid’s Perspective June Science Clouds Workshop HPDC 2012 Delft Geoffrey Fox

Pregel: A System for Large-Scale Graph Processing Presented by Dylan Davis Authors: Grzegorz Malewicz, Matthew H. Austern, Aart J.C. Bik, James C. Dehnert,

Bioinformatics on Cloud Cyberinfrastructure Bio-IT April Geoffrey Fox

Biomedical Cloud Computing iDASH Symposium San Diego CA May Geoffrey Fox

Portable Parallel Programming on Cloud and HPC: Scientific Applications of Twister4Azure Thilina Gunarathne Bingjing Zhang, Tak-Lon.

FutureGrid Dynamic Provisioning Experiments including Hadoop Fugang Wang, Archit Kulshrestha, Gregory G. Pike, Gregor von Laszewski, Geoffrey C. Fox.

SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.

Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.

SALSA HPC Group School of Informatics and Computing Indiana University.

Cloud Technologies and Data Intensive Applications INGRID 2010 Workshop Poznan Poland May Geoffrey Fox

Cloud Technologies and Data Intensive Applications INGRID 2010 Workshop Poznan Poland May Geoffrey Fox

SALSASALSASALSASALSA FutureGrid Venus-C June Geoffrey Fox

SALSASALSASALSASALSA Scalable Programming and Algorithms for Data Intensive Life Science Applications Data Intensive Seattle, WA Judy Qiu

SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox

X-Informatics MapReduce February Geoffrey Fox Associate Dean for Research.

Clouds will win! CTS Conference 2011 Philadelphia May Geoffrey Fox

Looking at Use Case 19, 20 Genomics 1st JTC 1 SGBD Meeting SDSC San Diego March Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox

Performance of MapReduce on Multicore Clusters

Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.

Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,

SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 

SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox

Memcached Integration with Twister Saliya Ekanayake - Jerome Mitchell - Yiming Sun -

SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu

Cloud Data mining and FutureGrid SC10 New Orleans LA AIST Booth November Geoffrey Fox

SALSASALSA Dynamic Virtual Cluster provisioning via XCAT on iDataPlex Supports both stateful and stateless OS images iDataplex Bare-metal Nodes Linux Bare-

SALSASALSASALSASALSA IU Twister Supports Data Intensive Science Applications School of Informatics and Computing Indiana University.

Bioinformatics on Cloud Cyberinfrastructure Bio-IT April Geoffrey Fox

Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.

SALSA HPC Group School of Informatics and Computing Indiana University Workshop on Petascale Data Analytics: Challenges, and.

Private Public FG Network NID: Network Impairment Device

Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.

FutureGrid: a Grid Testbed

Applying Twister to Scientific Applications

MapReduce for Data Intensive Scientific Analyses

Biology MDS and Clustering Results

SC09 Doctoral Symposium, Portland, 11/18/2009

Parallel Applications And Tools For Cloud Computing Environments

Clouds from FutureGrid’s Perspective

Cloud versus Cloud: How Will Cloud Computing Shape Our World?

Iterative and non-Iterative Computations

Presentation transcript:

MPI and MapReduce CCGSC 2010 Flat Rock NC September Geoffrey Fox Director, Digital Science Center, Pervasive Technology Institute Associate Dean for Research and Graduate Studies, School of Informatics and Computing Indiana University Bloomington

MapReduce Implementations (Hadoop – Java; Dryad – Windows) support: – Splitting of data with customized file systems – Passing the output of map functions to reduce functions – Sorting the inputs to the reduce function based on the intermediate keys – Quality of service 20 petabytes per day (on an average of 400 machines) processed by Google using MapReduce September 2007 Map(Key, Value) Reduce(Key, List ) Data Partitions Reduce Outputs A hash function maps the results of the map tasks to reduce tasks

MapReduce “File/Data Repository” Parallelism Instruments Disks Map 1 Map 2 Map 3 Reduce Communication Map = (data parallel) computation reading and writing data Reduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram Portals /Users MPI or Iterative MapReduce Map Reduce Map Reduce Map

Typical Application Challenge: DNA Sequencing Pipeline Illumina/Solexa Roche/454 Life Sciences Applied Biosystems/SOLiD Modern Commercial Gene Sequencers Internet Read Alignment Visualization Blocking Sequence Alignment/ Assembly Sequence Alignment/ Assembly MDS Dissimilarity Matrix N(N-1)/2 values Dissimilarity Matrix N(N-1)/2 values FASTA File N Sequences block Pairings Pairwise clustering Pairwise clustering MapReduce MPI Linear Algebra or Expectation Maximization based data mining poor on MapReduce – equivalent to using MPI writing messages to disk and restarting processes each step/iteration of algorithm

Metagenomics This visualizes results of dimension reduction to 3D of gene sequences from an environmental sample. The many different genes are classified by clustering algorithm and visualized by MDS dimension reduction

All-Pairs Using MPI or DryadLINQ Calculate Pairwise Distances (Smith Waterman Gotoh) 125 million distances 4 hours & 46 minutes 125 million distances 4 hours & 46 minutes Calculate pairwise distances for a collection of genes (used for clustering, MDS) Fine grained tasks in MPI Coarse grained tasks in DryadLINQ Performed on 768 cores (Tempest Cluster) Moretti, C., Bui, H., Hollingsworth, K., Rich, B., Flynn, P., & Thain, D. (2009). All-Pairs: An Abstraction for Data Intensive Computing on Campus Grids. IEEE Transactions on Parallel and Distributed Systems, 21,

Smith Waterman MPI DryadLINQ Hadoop Hadoop is Java; MPI and Dryad are C#

Twister(MapReduce++) Streaming based communication Intermediate results are directly transferred from the map tasks to the reduce tasks – eliminates local files Cacheable map/reduce tasks Static data remains in memory Combine phase to combine reductions User Program is the composer of MapReduce computations Extends the MapReduce model to iterative computations Data Split D MR Driver User Program Pub/Sub Broker Network D File System M R M R M R M R Worker Nodes M R D Map Worker Reduce Worker MRDeamon Data Read/Write Communication Reduce (Key, List ) Iterate Map(Key, Value) Combine (Key, List ) User Program Close() Configure() Static data Static data δ flow Different synchronization and intercommunication mechanisms used by the parallel runtimes

Iterative and non-Iterative Computations K-means Performance of K-Means Smith Waterman is a non iterative case and of course runs fine

Matrix Multiplication 64 cores Square blocks Twister Row/Col decomp Twister Square blocks OpenMPI

Overhead OpenMPI v Twister negative overhead due to cache

Performance of Pagerank using ClueWeb Data (Time for 20 iterations) using 32 nodes (256 CPU cores) of Crevasse

Fault Tolerance and MapReduce MPI does “maps” followed by “communication” including “reduce” but does this iteratively There must (for most communication patterns of interest) be a strict synchronization at end of each communication phase – Thus if a process fails then everything grinds to a halt In MapReduce, all Map processes and all reduce processes are independent and stateless and read and write to disks – As 1 or 2 (reduce+map) iterations, no difficult synchronization issues Thus failures can easily be recovered by rerunning process without other jobs hanging around waiting Re-examine MPI fault tolerance in light of MapReduce – Relevant for Exascale? Re-examine MapReduce in light of MPI experience …..

MPI & Iterative MapReduce papers MapReduce on MPI Torsten Hoefler, Andrew Lumsdaine and Jack Dongarra, Towards Efficient MapReduce Using MPI, Recent Advances in Parallel Virtual Machine and Message Passing Interface Lecture Notes in Computer Science, 2009, Volume 5759/2009, MPI with generalized MapReduce Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, Geoffrey Fox Twister: A Runtime for Iterative MapReduce, Proceedings of the First International Workshop on MapReduce and its Applications of ACM HPDC 2010 conference, Chicago, Illinois, June 20-25, Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski Pregel: A System for Large-Scale Graph Processing, Proceedings of the 2010 international conference on Management of data Indianapolis, Indiana, USA Pages: Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst HaLoop: Efficient Iterative Data Processing on Large Clusters, Proceedings of the VLDB Endowment, Vol. 3, No. 1, The 36th International Conference on Very Large Data Bases, September 1317, 2010, Singapore. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica Spark: Cluster Computing with Working Sets poster at

AzureMapReduce

Scaled Timing with Azure/Amazon MapReduce

Cap3 Cost

Smith Waterman: “Scaled Speedup” Timing

Smith Waterman: daily effect

SWG Cost

TwisterMPIReduce Runtime package supporting subset of MPI mapped to Twister Set-up, Barrier, Broadcast, Reduce TwisterMPIReduce PairwiseClustering MPI Multi Dimensional Scaling MPI Generative Topographic Mapping MPI Other … Azure Twister (C# C++) Java Twister Microsoft Azure FutureGrid Local Cluster Amazon EC2

Some Issues with AzureTwister and AzureMapReduce Transporting data to Azure: Blobs (HTTP), Drives (GridFTP etc.), Fedex disks Intermediate data Transfer: Blobs (current choice) versus Drives (should be faster but don’t seem to be) Azure Table v Azure SQL: Handle all metadata Messaging Queues: Use real publish-subscribe system in place of Azure Queues to get scaling (?) with multiple brokers – especially AzureTwister Azure Affinity Groups: Could allow better data-compute and compute-compute affinity

Research Issues Clouds are suitable for “Loosely coupled” data parallel applications “Map Only” (really pleasingly parallel) certainly run well on clouds (subject to data affinity) with many programming paradigms Parallel FFT and adaptive mesh PDE solver very bad on MapReduce but suitable for classic MPI engines. MapReduce is more dynamic and fault tolerant than MPI; it is simpler and easier to use Is there an intermediate class of problems for which Iterative MapReduce useful? – Long running processes? – Mutable data small in size compared to fixed data(base)? – Only support reductions? – Is it really different from a fault tolerant MPI? – Multicore implementation – Link to HDFS or equivalent data parallel file system – Will AzureTwister run satisfactorily?

FutureGrid in a Nutshell FutureGrid provides a testbed with a wide variety of computing services to its users – Supporting users developing new applications and new middleware using Cloud, Grid and Parallel computing (Hypervisors – Xen, KVM, ScaleMP, Linux, Windows, Nimbus, Eucalyptus, Hadoop, Globus, Unicore, MPI, OpenMP …) – Software supported by FutureGrid or users – ~5000 dedicated cores distributed across country The FutureGrid testbed provides to its users: – A rich development and testing platform for middleware and application users looking at interoperability, functionality and performance – A rich education and teaching platform for advanced cyberinfrastructure classes Each use of FutureGrid is an experiment that is reproducible Cloud infrastructure supports loading of general images on Hypervisors like Xen; FutureGrid dynamically provisions software as needed onto “bare-metal” using Moab/xCAT based environment

FutureGrid: a Grid/Cloud Testbed Operational: IU Cray operational; IU, UCSD, UF & UC IBM iDataPlex operational Network, NID operational TACC Dell running acceptance tests – ready ~September 15 NID : Network Impairment Device Private Public FG Network

FutureGrid Dynamic Provisioning Results Time elapsed between requesting a job and the jobs reported start time on the provisioned node. The numbers here are an average of 2 sets of experiments. Time minutes Number of nodes

200 papers submitted to main track; 4 days of tutorials