1 Models and Frameworks for Data Intensive Cloud Computing Douglas Thain University of Notre Dame IDGA Cloud Computing 8 February 2011.

Slides:

Advertisements

Similar presentations

1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009.

Advertisements

1 Real-World Barriers to Scaling Up Scientific Applications Douglas Thain University of Notre Dame Trends in HPDC Workshop Vrije University, March 2012.

BXGrid: A Data Repository and Computing Grid for Biometrics Research Hoang Bui University of Notre Dame 1.

Experience with Adopting Clouds at Notre Dame Douglas Thain University of Notre Dame IEEE CloudCom, November 2010.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Introduction to Scalable Programming using Makeflow and Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.

1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011.

1 High Throughput Scientific Computing with Condor: Computer Science Challenges in Large Scale Parallelism Douglas Thain University of Notre Dame UAB 27.

1 Opportunities and Dangers in Large Scale Data Intensive Computing Douglas Thain University of Notre Dame Large Scale Data Mining Workshop at SIGKDD August.

1 Scaling Up Data Intensive Science with Application Frameworks Douglas Thain University of Notre Dame Michigan State University September 2011.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

1 Science in the Clouds: History, Challenges, and Opportunities Douglas Thain University of Notre Dame GeoClouds Workshop 17 September 2009.

1 Scaling Up Data Intensive Science to Campus Grids Douglas Thain Clemson University 25 Septmber 2009.

Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame Cloud Computing and Applications (CCA-08) University.

Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.

Using Small Abstractions to Program Large Distributed Systems Douglas Thain University of Notre Dame 11 December 2008.

Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009.

Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008.

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame

Introduction to Makeflow Li Yu University of Notre Dame 1.

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

Using Abstractions to Scale Up Applications to Campus Grids Douglas Thain University of Notre Dame 28 April 2009.

Building Scalable Elastic Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame Tutorial at CCGrid, May Delft, Netherlands.

Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Peter Sempolinski University of Notre Dame.

Building Scalable Applications on the Cloud with Makeflow and Work Queue Douglas Thain and Patrick Donnelly University of Notre Dame Science Cloud Summer.

Introduction to Makeflow and Work Queue CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Ch 4. The Evolution of Analytic Scalability

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Elastic Applications in the Cloud Dinesh Rajan University of Notre Dame CCL Workshop, June 2012.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame 23 October 2008.

Toward a Common Model for Highly Concurrent Applications Douglas Thain University of Notre Dame MTAGS Workshop 17 November 2013.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Building Scalable Scientific Applications with Makeflow Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts University.

Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame.

The Cooperative Computing Lab  We collaborate with people who have large scale computing problems in science, engineering, and other fields.  We operate.

1 Computational Abstractions: Strategies for Scaling Up Applications Douglas Thain University of Notre Dame Institute for Computational Economics University.

BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Introduction to Scalable Programming using Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Building Scalable Elastic Applications using Work Queue Dinesh Rajan and Douglas Thain University of Notre Dame Tutorial at CCGrid, May Delft,

Demonstration of Scalable Scientific Applications Peter Sempolinski and Dinesh Rajan University of Notre Dame.

1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon.

Directions in eScience Interoperability and Science Clouds June Interoperability in Action – Standards Implementation.

Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.

Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.

BIG DATA/ Hadoop Interview Questions.

HPC In The Cloud Case Study: Proteomics Workflow

Scaling Up Scientific Workflows with Makeflow

Introduction to Makeflow and Work Queue

Haiyan Meng and Douglas Thain

Weaving Abstractions into Workflows

Introduction to Makeflow and Work Queue

Ch 4. The Evolution of Analytic Scalability

BXGrid: A Data Repository and Computing Grid for Biometrics Research

What’s New in Work Queue

Creating Custom Work Queue Applications

Introduction to MapReduce

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

1 Models and Frameworks for Data Intensive Cloud Computing Douglas Thain University of Notre Dame IDGA Cloud Computing 8 February 2011

The Cooperative Computing Lab We collaborate with people who have large scale computing problems in science, engineering, and other fields. We operate computer systems on the scale of 1000 cores. (Small) We conduct computer science research in the context of real people and problems. We publish open source software that captures what we have learned. 2

3 I have a standard, debugged, trusted application that runs on my laptop. A toy problem completes in one hour. A real problem will take a month (I think.) Can I get a single result faster? Can I get more results in the same time? Last year, I heard about this grid thing. What do I do next? This year, I heard about this cloud thing.

4 Our Application Communities Bioinformatics –I just ran a tissue sample through a sequencing device. I need to assemble 1M DNA strings into a genome, then compare it against a library of known human genomes to find the difference. Biometrics –I invented a new way of matching iris images from surveillance video. I need to test it on 1M hi-resolution images to see if it actually works. Data Mining –I have a terabyte of log data from a medical service. I want to run 10 different clustering algorithms at 10 levels of sensitivity on 100 different slices of the data.

Why Consider Scientific Apps? Highly motivated to get a result that is bigger, faster, or higher resolution. Willing to take risks and move rapidly, but don’t have the effort/time for major retooling. Often already have access to thousands of machines in various forms. Security is usually achieved by selecting resources appropriately at a high level. g5

Cloud Broadly Considered Cloud: Any system or method that allows me to allocate as many machines as I need in short order: –My own dedicated cluster. (ssh) –A shared institutional batch system (SGE) –A cycle scavenging system (Condor) –A pay-as-you-go IaaS system (Amazon EC2) –A pay-as-you-go PaaS system (Google App) Scalable, Elastic, Dynamic, Unreliable… 6

7

8

9

greencloud.crc.nd.edu 10

What they want. 11 What they get.

The Most Common Application Model? 12 Every program attempts to grow until it can read mail. - Jamie Zawinski

13 An Old Idea: The Unix Model input output

Advantages of Little Processes Easy to distribute across machines. Easy to develop and test independently. Easy to checkpoint halfway. Easy to troubleshoot and continue. Easy to observe the dependencies between components. Easy to control resource assignments from an outside process. 14

15 Our approach: Encourage users to decompose their applications into simple programs. Give them frameworks that can assemble them into programs of massive scale with high reliability.

16 Working with Abstractions F A1 A2 An AllPairs( A, B, F ) Cloud or Grid A1 A2 Bn Custom Workflow Engine Compact Data Structure

MapReduce User provides two simple programs: Map( x ) -> list of (key,value) Reduce( key, list of (value) ) -> Output The Map-Reduce implementation puts them together in a way to maximize data parallelism. Open source implementation: Hadoop 17

18 R R RO2 O1 O0 Key0 Key1 KeyN V V V V V V V V V V V V VV VV V M M M M

19 Of course, not all science fits into the Map-Reduce model!

20 Example: Biometrics Research Goal: Design robust face comparison function. F 0.05 F 0.97

21 Similarity Matrix Construction Challenge Workload: 60,000 images 1MB each.02s per F 833 CPU-days 600 TB of I/O

22 All-Pairs Abstraction AllPairs( set A, set B, function F ) returns matrix M where M[i][j] = F( A[i], B[j] ) for all i,j B1 B2 B3 A1A2A3 FFF A1 An B1 Bn F AllPairs(A,B,F) F FF FF F allpairs A B F.exe

23 How Does the Abstraction Help? The custom workflow engine: –Chooses right data transfer strategy. –Chooses the right number of resources. –Chooses blocking of functions into jobs. –Recovers from a larger number of failures. –Predicts overall runtime accurately. All of these tasks are nearly impossible for arbitrary workloads, but are tractable (not trivial) to solve for a specific abstraction.

24

25 Choose the Right # of CPUs

26 Resources Consumed

27 All-Pairs in Production Our All-Pairs implementation has provided over 57 CPU-years of computation to the ND biometrics research group in the first year. Largest run so far: 58,396 irises from the Face Recognition Grand Challenge. The largest experiment ever run on publically available data. Competing biometric research relies on samples of images, which can miss important population effects. Reduced computation time from 833 days to 10 days, making it feasible to repeat multiple times for a graduate thesis. (We can go faster yet.)

28 All-Pairs Abstraction AllPairs( set A, set B, function F ) returns matrix M where M[i][j] = F( A[i], B[j] ) for all i,j B1 B2 B3 A1A2A3 FFF A1 An B1 Bn F AllPairs(A,B,F) F FF FF F allpairs A B F.exe

29 Are there other abstractions?

30 M[4,2] M[3,2]M[4,3] M[4,4]M[3,4]M[2,4] M[4,0]M[3,0]M[2,0]M[1,0]M[0,0] M[0,1] M[0,2] M[0,3] M[0,4] F x yd F x yd F x yd F x yd F x yd F x yd F F y y x x d d x FF x ydyd Wavefront( matrix M, function F(x,y,d) ) returns matrix M such that M[i,j] = F( M[i-1,j], M[I,j-1], M[i-1,j-1] ) F Wavefront(M,F) M

31 What if your application doesn’t fit a regular pattern?

32 Another Old Idea: Make part1 part2 part3: input.data split.py./split.py input.data out1: part1 mysim.exe./mysim.exe part1 >out1 out2: part2 mysim.exe./mysim.exe part2 >out2 out3: part3 mysim.exe./mysim.exe part3 >out3 result: out1 out2 out3 join.py./join.py out1 out2 out3 > result

Xgrid Cluster Campus Condor Pool Public Cloud Provider Private SGE Cluster Makefile Makeflow submit jobs Local Files and Programs Makeflow: First Try

Problems with the First Try Software Engineering: too many batch systems with too many slight differences. Performance: Starting a new job or a VM takes seconds. Stability: An accident could result in you purchasing thousands of cores! Solution: Overlay our own work management system into multiple clouds. –Technique used widely in the grid world. 34

Private Cluster Campus Condor Pool Public Cloud Provider Private SGE Cluster Makefile Makeflow Local Files and Programs Makeflow: Second Try sge_submit_workers W W W ssh WW WW W WvWv W condor_submit_workers W W W Hundreds of Workers in a Personal Cloud submit tasks

36 worker work queue afilebfile put prog put afile exec prog afile > bfile get bfile 100s of workers dispatched to the cloud makeflow master queue tasks done prog detail of a single worker: Makeflow: Second Try bfile: afile prog prog afile >bfile Two optimizations: Cache inputs and output. Dispatch tasks to nodes with data.

Makeflow Applications

The best part is that I don’t have to learn anything about cloud computing! - Anonymous Student 38

I would like to posit that computing’s central challenge how not to make a mess of it has not yet been met. - Edsger Djikstra 39

Too much concurrency! Vendors of multi-core projects are pushing everyone to make their code multi-core. Hence, many applications now attempt to use all available cores at their disposal, without regards to RAM, I/O, Disk… Two apps running on the same machine almost always conflict in bad ways. Opinion: Keep the apps simple and sequential, and let the framework handle concurrency. 40

The 0, 1 … N attitude. Code designed for a single machine doesn’t worry about resources, because there isn’t any alternative. (Virtual Memory) But in the cloud, you usually scale until some resource is exhausted! App devels are rarely trained to deal with this problem. (Can malloc or close fail?) Opinion: All software needs to do a better job of advertising and limiting resources. (Frameworks could exploit this.) 41

To succeed, get used to failure. Any system of 1000s of parts has failures, and many of them are pernicious: –Black holes, white holes, deadlock… To discover failures, you need to have a reasonably detailed model of success: –Output format, run time, resources consumed. Need to train coders in classic engineering: –Damping, hysteresis, control systems. Keep failures at a sufficient trickle, so that everyone must confront them. 42

Cloud Federation? 10 years ago, we had (still have) multiple independent computing grids, each centered on a particular institution. Grid federation was widely desired, but never widely achieved, for many technical and social reasons. But, users ended up developing frameworks to harness multiple grids simultaneously, which was nearly as good. Same story with clouds? 43

Cloud Reliability? High reliability needed for something like a widely shared read-write DB in the cloud. But in many cases, a VM is one piece in a large system with higher level redundancy. End to end principle: The topmost layer is ultimately responsible for reliability. Opinion: There is a significant need for resources of modest reliability at low cost. 44

Many hard problems remain… The cloud forces end users to think about their operational budget at a fine scale. Can software frameworks help with this? Data locality is an unavoidable reality that cannot be hidden. (We tried.) Can we expose it to users in ways that help, rather than confuse them? 45

46 A Team Effort Grad Students –Hoang Bui –Li Yu –Peter Bui –Michael Albrecht –Peter Sempolinski –Dinesh Rajan Faculty: –Patrick Flynn –Scott Emrich –Jesus Izaguirre –Nitesh Chawla –Kenneth Judd NSF Grants CCF , CNS , and CNS Undergrads –Rachel Witty –Thomas Potthast –Brenden Kokosza –Zach Musgrave –Anthony Canino

For More Information The Cooperative Computing Lab – Prof. Douglas Thain 47