Early Experience with Cloud Technologies

Slides:



Advertisements
Similar presentations
SALSA HPC Group School of Informatics and Computing Indiana University.
Advertisements

SCALABLE PARALLEL COMPUTING ON CLOUDS : EFFICIENT AND SCALABLE ARCHITECTURES TO PERFORM PLEASINGLY PARALLEL, MAPREDUCE AND ITERATIVE DATA INTENSIVE COMPUTATIONS.
SALSASALSASALSASALSA Using MapReduce Technologies in Bioinformatics and Medical Informatics Computing for Systems and Computational Biology Workshop SC09.
SALSASALSASALSASALSA Chemistry in the Digital Age Workshop, Penn State University, June 11, 2009 Geoffrey Fox
SALSASALSASALSASALSA Using Cloud Technologies for Bioinformatics Applications MTAGS Workshop SC09 Portland Oregon November Judy Qiu
SALSASALSASALSASALSA Large Scale DNA Sequence Analysis and Biomedical Computing using MapReduce, MPI and Threading Workshop on Enabling Data-Intensive.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
1 Clouds and Sensor Grids CTS2009 Conference May Alex Ho Anabas Inc. Geoffrey Fox Computer Science, Informatics, Physics Chair Informatics Department.
1 Multicore and Cloud Futures CCGSC September Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
SALSASALSA Programming Abstractions for Multicore Clouds eScience 2008 Conference Workshop on Abstractions for Distributed Applications and Systems December.
SALSASALSASALSASALSA Performance Analysis of High Performance Parallel Applications on Virtualized Resources Jaliya Ekanayake and Geoffrey Fox Indiana.
SALSASALSASALSASALSA High Performance Biomedical Applications Using Cloud Technologies HPC and Grid Computing in the Cloud Workshop (OGF27 ) October 13,
Service Aggregated Linked Sequential Activities GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data mining.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
SALSASALSASALSASALSA AOGS, Singapore, August 11-14, 2009 Geoffrey Fox 1,2 and Marlon Pierce 1
Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.
SALSASALSASALSASALSA Proposal Review Meeting with CTSI Translating Research Into Practice Project Development Team, July 8, 2009, IUPUI Gil Liu, Judy Qiu,
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Service Aggregated Linked Sequential Activities GOALS: Increasing number of cores accompanied by continued data deluge. Develop scalable parallel data.
1 Performance of a Multi-Paradigm Messaging Runtime on Multicore Systems Poster at Grid 2007 Omni Austin Downtown Hotel Austin Texas September
SALSASALSASALSASALSA MSR Internship – Final Presentation Jaliya Ekanayake School of Informatics and Computing Indiana University.
SALSASALSA Microsoft eScience Workshop December Indianapolis, Indiana Geoffrey Fox
SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
SALSASALSASALSASALSA CloudComp 09 Munich, Germany Jaliya Ekanayake, Geoffrey Fox School of Informatics and Computing Pervasive.
SALSA HPC Group School of Informatics and Computing Indiana University.
1 Performance Measurements of CCR and MPI on Multicore Systems Expanded from a Poster at Grid 2007 Austin Texas September Xiaohong Qiu Research.
Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.
Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li School of Informatics and Computing Indiana University 11/1/2012.
Service Aggregated Linked Sequential Activities: GOALS: Increasing number of cores accompanied by continued data deluge Develop scalable parallel data.
SALSASALSASALSASALSA Clouds Ball Aerospace March Geoffrey Fox
X-Informatics MapReduce February Geoffrey Fox Associate Dean for Research.
1 Multicore for Science Multicore Panel at eScience 2008 December Geoffrey Fox Community Grids Laboratory, School of informatics Indiana University.
Security: systems, clouds, models, and privacy challenges iDASH Symposium San Diego CA October Geoffrey.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
SALSA Group Research Activities April 27, Research Overview  MapReduce Runtime  Twister  Azure MapReduce  Dryad and Parallel Applications 
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
SALSASALSASALSASALSA Digital Science Center February 12, 2010, Bloomington Geoffrey Fox Judy Qiu
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
HPC in the Cloud – Clearing the Mist or Lost in the Fog Panel at SC11 Seattle November Geoffrey Fox
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
SALSASALSASALSASALSA Cloud Technologies for Data Intensive Biomedical Computing OGF27 Workshop October 13, 2009, Banff Judy Qiu
SALSASALSA Dynamic Virtual Cluster provisioning via XCAT on iDataPlex Supports both stateful and stateless OS images iDataplex Bare-metal Nodes Linux Bare-
SALSASALSASALSASALSA Data Intensive Biomedical Computing Systems Statewide IT Conference October 1, 2009, Indianapolis Judy Qiu
Community Grids Laboratory
Spark Presentation.
Hadoop Clusters Tess Fulkerson.
Science Clouds and Campus Clouds
Our Objectives Explore the applicability of Microsoft technologies to real world scientific domains with a focus on data intensive applications Expect.
Geoffrey Fox, Huapeng Yuan, Seung-Hee Bae Xiaohong Qiu
Microsoft eScience Workshop December 2008 Geoffrey Fox
Applying Twister to Scientific Applications
High Performance Big Data Computing in the Digital Science Center
MapReduce for Data Intensive Scientific Analyses
Biology MDS and Clustering Results
SC09 Doctoral Symposium, Portland, 11/18/2009
GCC2008 (Global Clouds and Cores 2008) October Geoffrey Fox
CS110: Discussion about Spark
Scalable Parallel Interoperable Data Analytics Library
Clouds from FutureGrid’s Perspective
FutureGrid and Applications
Gateway and Web Services
Hybrid Programming with OpenMP and MPI
Cloud versus Cloud: How Will Cloud Computing Shape Our World?
Big Data, Simulations and HPC Convergence
CReSIS Cyberinfrastructure
Clouds and Grids Multicore and all that
Presentation transcript:

Early Experience with Cloud Technologies Microsoft External Research Symposium , March 31 2009, Microsoft Seattle Geoffrey Fox gcf@indiana.edu www.infomall.org/salsa Community Grids Laboratory, Chair Department of Informatics School of Informatics Indiana University

Collaboration in SALSA Project Microsoft Research Technology Collaboration Dryad Roger Barga CCR George Chrysanthakopoulos DSS Henrik Frystyk Nielsen Indiana University SALSA Team Geoffrey Fox Xiaohong Qiu Scott Beason Seung-Hee Bae Jaliya Ekanayake Jong Youl Choi Yang Ruan Others Application Collaboration Bioinformatics, CGB Haiku Tang, Mina Rho, Qufeng Dong IU Medical School Gilbert Liu Demographics (GIS) Neil Devadasan Cheminformatics Rajarshi Guha, David Wild Physics CMS group at Caltech (Julian Bunn) Community Grids Lab and UITS RT -- PTI Sangmi Pallickara, Shrideep Pallickara, Marlon Pierce

Data Intensive (Science) Applications 1) Data starts on some disk/sensor/instrument It needs to be partitioned; often partitioning natural from source of data 2) One runs a filter of some sort extracting data of interest and (re)formatting it Pleasingly parallel with often “millions” of jobs Communication latencies can be many milliseconds and can involve disks 3) Using same (or map to a new) decomposition, one runs a parallel application that could require iterative steps between communicating processes or could be pleasing parallel Communication latencies may be at most some microseconds and involves shared memory or high speed networks Workflow links 1) 2) 3) with multiple instances of 2) 3) Pipeline or more complex graphs Filters are “Maps” or “Reductions” in MapReduce language

“File/Data Repository” Parallelism Instruments Map = (data parallel) computation reading and writing data Reduce = Collective/Consolidation phase e.g. forming multiple global sums as in histogram Communication via Messages/Files Portals /Users Map1 Map2 Map3 Reduce Disks Computers/Disks

Data Analysis Examples LHC Particle Physics analysis: File parallel over events Filter1: Process raw event data into “events with physics parameters” Filter2: Process physics into histograms Reduce2: Add together separate histogram counts Information retrieval similar parallelism over data files Bioinformatics - Gene Families: Data parallel over sequences Filter1: Calculate similarities (distances) between sequences Filter2: Align Sequences (if needed) Filter3: Cluster to find families Filter 4/Reduce4: Apply Dimension Reduction to 3D Filter5: Visualize

Philosophy Clouds are (by definition) commercially supported approach to large scale computing So we should expect Clouds to replace Compute Grids Current Grid experience gives a not so positive evaluation of “non-commercial” software solutions Informational Retrieval is major data intensive commercial application so we can expect technologies from this field (Dryad, Hadoop) to be relevant for related scientific (File/Data parallel) applications Need technology to be packaged for general use

reduce(key, list<value>) MapReduce implemented by Hadoop using files for communication or CGL-MapReduce using in memory queues as “Enterprise bus” (pub-sub) D M 4n S Y H n X U N Example: Word Histogram Start with a set of words Each map task counts number of occurrences in each data partition Reduce phase adds these counts reduce(key, list<value>) map(key, value) Dryad supports general dataflow – currently communicate via files; will use queues

Distributed Grep - Performance Performs “grep” operation on a collection of documents Results not normalized for machine performance CGL-MapReduce and Hadoop both used all the cores of 4 gridfarm nodes while Dryad used only 1 core per node in four nodes of Barcelona. Abstraction of real Information Retrieval use of Dryad

Histogramming of Words- Performance Perform a “histogramming” operation on a collection of documents Results not normalized for machine performance Also, CGL-MapReduce and Hadoop both used all the cores of 4 gridfarm nodes while Dryad used only 1 core per node in four nodes of Barcelona

Particle Physics (LHC) Data Analysis MapReduce for LHC data analysis LHC data analysis, execution time vs. the volume of data (fixed compute resources) Root running in distributed fashion allowing analysis to access distributed data – computing next to data LINQ not optimal for expressing final merge 9/18/2018 Jaliya Ekanayake

Reduce Phase of Particle Physics “Find the Higgs” using Dryad Combine Histograms produced by separate Root “Maps” (of event data to partial histograms) into a single Histogram delivered to Client

Cluster Configuration Configurations CGL-MapReduce and Hadoop Dryad Number of nodes and processor cores 4 Nodes => 4x8 =32 processor cores Processors Quad Core Intel Xeon E5335 – 2 processors 2000.12 MHz Quad Core AMD Opteron 2356 – 2 processors 2.29 GHz Memory 16GB Operating System Red Hat Enterprise Linux 4 Windows Server 2008 (HPC Edition) Language Java C# Data Placement Hadoop -> Hadoop Distributed File System (HDFS) CGL-MapReduce -> Shared File System (NFS) Individual nodes with shared directories Note: Our current version of Dryad can only run one PN process per node. Therefore we have configured, Hadoop and CGL-MapReduce to use only one parallel task in each node.

Notes on Performance Speed up = T(1)/T(P) =  (efficiency ) P with P processors Overhead f = (PT(P)/T(1)-1) = (1/ -1) is linear in overheads and usually best way to record results if overhead small For MPI communication f  ratio of data communicated to calculation complexity = n-0.5 for matrix multiplication where n (grain size) matrix elements per node MPI Communication Overheads decrease in size as problem sizes n increase (edge over area rule) Dataflow communicates all data – Overhead does not decrease Scaled Speed up: keep grain size n fixed as P increases Conventional Speed up: keep Problem size fixed n  1/P VMs and Windows Threads have runtime fluctuation /synchronization overheads

Comparison of MPI and Threads on Classic parallel Code Parallel Overhead  1-efficiency = (PT(P)/T(1)-1) On P processors = (1/efficiency)-1 24-way Speedup = 24/(1+f) 16-way 2-way 4-way 8-way 1-way Speedup 28 MPI 1 2 1 4 2 1 8 4 2 1 16 8 4 2 1 24 12 8 6 4 3 2 1 Processes CCR 1 1 2 1 2 4 1 2 4 8 1 2 4 8 16 1 2 3 4 6 8 12 24 Threads 4 Intel Six Core Xeon E7450 2.4GHz 48GB Memory 12M L2 Cache 3 Dataset sizes

Performance of Parallel Pairwise Clustering Scaled Speedup Tests on eight nodes 16-core System (Different choices of MPI and Threading) 128-way Parallelism 2000 Points 8 nodes 16 MPI Processes per node 1 Thread per process Parallel Overhead Runtime Fluctuations/Synchronization (VM, Threads) + Communication Time /(n * Calculation Time) n = Total Points/Number of Execution Units varies from 10000 to 2000/128 = 16 Communication Time = 0 (Threads) 128-way Parallelism 2000 Points 8 nodes 16 Threads per process 64-way 96-way 48-way 16-way 32-way 4000 Points 8-way 4-way 2-way 10,000 Points 1x1x1 1x1x2 1x2x1 2x1x1 1x2x2 1x4x1 2x1x2 2x2x1 1x4x2 1x8x1 2x2x2 2x4x1 4x1x2 4x2x1 1x8x2 2x4x2 2x8x1 4x2x2 4x4x1 8x1x2 8x2x1 1x16x1 1x16x2 2x8x2 4x4x2 8x2x2 16x1x2 1x16x3 2x4x6 2x8x3 4x2x6 4x4x3 1x8x8 2x4x8 2x8x4 8x2x4 16x1x4 4x4x6 4x2x8 8x1x8 1x16x8 2x8x8 4x4x8 8x2x8 16x1x8

Performance of Parallel Pairwise Clustering Scaled Speedup Tests on eight nodes 16-core System (Different choices of MPI and Threading) Parallel Overhead 2000 Points 16-way 32-way 4000 Points Parallel Overhead 8-way 4-way 2-way 64-way 48-way 128-way 10,000 Points 96-way 1x1x1 1x1x2 1x2x1 2x1x1 1x2x2 1x4x1 2x1x2 2x2x1 1x4x2 1x8x1 2x2x2 2x4x1 4x1x2 4x2x1 1x8x2 2x4x2 2x8x1 4x2x2 4x4x1 8x1x2 8x2x1 1x16x1 1x16x2 2x8x2 4x4x2 8x2x2 4x4x3 16x1x2 1x16x3 2x4x6 2x8x3 4x2x6 1x8x8 2x4x8 2x8x4 8x2x4 16x1x4 4x4x6 2x8x8 4x2x8 8x1x8 1x16x8 4x4x8 8x2x8 16x1x8

HEP Data Analysis - Overhead Overhead of Different Runtimes vs. Amount of Data Processed

Some Other File/Data Parallel Examples from Indiana University Biology Dept EST (Expressed Sequence Tag) Assembly: 2 million mRNA sequences generates 540000 files taking 15 hours on 400 TeraGrid nodes (CAP3 run dominates) MultiParanoid/InParanoid gene sequence clustering: 476 core years just for Prokaryotes Population Genomics: (Lynch) Looking at all pairs separated by up to 1000 nucleotides Sequence-based transcriptome profiling: (Cherbas, Innes) MAQ, SOAP Systems Microbiology (Brun) BLAST, InterProScan Metagenomics (Fortenberry, Nelson) Pairwise alignment of 7243 16s sequence data took 12 hours on TeraGrid All can use Dryad

Cap3 Data Analysis - Performance Normalized Average Time vs. Amount of Data Processed

Cap3 Data Analysis - Overhead Overhead of Different Runtimes vs. Amount of Data Processed

The many forms of MapReduce MPI, Hadoop, Dryad, (Web or Grid) services, workflow (Taverna .. Mashup .. BPEL), (Enterprise) Service Buses all consist of execution units exchanging messages They differ in performance, long v short lived processes, communication mechanism, control v data communication, fault tolerance, user interface, flexibility (dynamic v static processes) etc. As MPI can do all parallel problems, so can Hadoop, Dryad … (famous paper on MapReduce for datamining) MPI is “data-parallel”, it is actually “memory-parallel” as “owner computes” rule says “computer evolves points in its memory” Dryad and Hadoop support “File/Repository-parallel” (attach computing to data on disk) which is natural for vast majority of experimental science Dryad/Hadoop typically transmit all the data between steps (maps) by either queues or files (process lasts as long as map does) MPI will only transmit needed state changes using rendezvous semantics with long running processes which is higher performance but less dynamic and less fault tolerant

Kmeans Clustering in MapReduce So Dryad will be better when uses pipes not files as communication “CGL-MapReduce Millisecond MPI” “Microsecond MPI”

MapReduce in MPI.NET(C#) A couple of Setup calls and one for Reduce …. Follow a data decomposed MPI calculation (the map) with NO communication by MPI_communicator.Allreduce<UserDataStructure>(LocalStructure, UserReductionRoutine) with Struct UserDataStructure instance LocalStructure and a general reduction routine ReducedStruct = UserReductionRoutine(Struct1, Struct2) Or for example MPI_communicator.Allreduce<double>( Histogram, Operation<double>.Add) with Histogram as a double array gives particle physics Root application to summing histograms Could drive with higher level language which could choose Dryad or MPI depending on needed trade-offs

Data Intensive Cloud Architecture MPI/GPU Cloud Instruments User Data Linux Cloud Windows Cloud Users Files Files Files Files Dryad should manage decomposed data from database/file to Windows cloud (Azure) to Linux Cloud and specialized engines (MPI, GPU …) Does Dryad replace Workflow? How does it link to MPI-based daatmining?

MPI Cloud Overhead Eucalyptus (Xen) versus “Bare Metal Linux” on communication Intensive trivial problem (2D Laplace) and matrix multiplication Cloud Overhead ~3 times Bare Metal; OK if communication modest Grid size in each of 2 dimensions Grid size in each of 2 dimensions 7200 by 7200 Grid