1 Scaling Up Data Intensive Science to Campus Grids Douglas Thain Clemson University 25 Septmber 2009.

Slides:



Advertisements
Similar presentations
1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009.
Advertisements

BXGrid: A Data Repository and Computing Grid for Biometrics Research Hoang Bui University of Notre Dame 1.
SLA-Oriented Resource Provisioning for Cloud Computing
Experience with Adopting Clouds at Notre Dame Douglas Thain University of Notre Dame IEEE CloudCom, November 2010.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Introduction to Scalable Programming using Makeflow and Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.
Research Issues in Cooperative Computing Douglas Thain
1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011.
1 High Throughput Scientific Computing with Condor: Computer Science Challenges in Large Scale Parallelism Douglas Thain University of Notre Dame UAB 27.
1 Opportunities and Dangers in Large Scale Data Intensive Computing Douglas Thain University of Notre Dame Large Scale Data Mining Workshop at SIGKDD August.
1 Scaling Up Data Intensive Science with Application Frameworks Douglas Thain University of Notre Dame Michigan State University September 2011.
1 Models and Frameworks for Data Intensive Cloud Computing Douglas Thain University of Notre Dame IDGA Cloud Computing 8 February 2011.
1 Science in the Clouds: History, Challenges, and Opportunities Douglas Thain University of Notre Dame GeoClouds Workshop 17 September 2009.
Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame Cloud Computing and Applications (CCA-08) University.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.
Using Small Abstractions to Program Large Distributed Systems Douglas Thain University of Notre Dame 11 December 2008.
1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.
Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009.
Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008.
An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame
Introduction to Makeflow Li Yu University of Notre Dame 1.
Workload Management Massimo Sgaravatto INFN Padova.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Using Abstractions to Scale Up Applications to Campus Grids Douglas Thain University of Notre Dame 28 April 2009.
Building Scalable Elastic Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame Tutorial at CCGrid, May Delft, Netherlands.
Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Peter Sempolinski University of Notre Dame.
Building Scalable Applications on the Cloud with Makeflow and Work Queue Douglas Thain and Patrick Donnelly University of Notre Dame Science Cloud Summer.
Introduction to Makeflow and Work Queue CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain.
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013.
Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.
Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame 23 October 2008.
Introduction to Hadoop and HDFS
Building Scalable Scientific Applications with Makeflow Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts University.
Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame.
The Cooperative Computing Lab  We collaborate with people who have large scale computing problems in science, engineering, and other fields.  We operate.
Distributed Framework for Automatic Facial Mark Detection Graduate Operating Systems-CSE60641 Nisha Srinivas and Tao Xu Department of Computer Science.
1 Computational Abstractions: Strategies for Scaling Up Applications Douglas Thain University of Notre Dame Institute for Computational Economics University.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
VIPIN VIJAYAN 11/11/03 A Performance Analysis of Two Distributed Computing Abstractions.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
Review of Condor,SGE,LSF,PBS
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Introduction to Scalable Programming using Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
Demonstration of Scalable Scientific Applications Peter Sempolinski and Dinesh Rajan University of Notre Dame.
1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon.
Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.
Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.
BIG DATA/ Hadoop Interview Questions.
Presented by Robust Storage Management On Desktop, in Machine Room, and Beyond Xiaosong Ma Computer Science and Mathematics Oak Ridge National Laboratory.
Fundamental Operations Scalability and Speedup
Getting the Most out of Scientific Computing Resources
Workload Management Workpackage
Getting the Most out of Scientific Computing Resources
Scaling Up Scientific Workflows with Makeflow
Genomic Data Clustering on FPGAs for Compression
Integration of Singularity With Makeflow
Introduction to Makeflow and Work Queue
CSCI1600: Embedded and Real Time Software
Rui Wu, Jose Painumkal, Sergiu M. Dascalu, Frederick C. Harris, Jr
Dean Martin Cadwallader Dean of the Graduate School
BXGrid: A Data Repository and Computing Grid for Biometrics Research
What’s New in Work Queue
Overview of Workflows: Why Use Them?
CSCI1600: Embedded and Real Time Software
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

1 Scaling Up Data Intensive Science to Campus Grids Douglas Thain Clemson University 25 Septmber 2009

2

3 The Cooperative Computing Lab We collaborate with people who have large scale computing problems. We build new software and systems to help them achieve meaningful goals. We run a production computing system used by people at ND and elsewhere. We conduct computer science research, informed by real world experience, with an impact upon problems that matter.

4 What is a Campus Grid? A campus grid is an aggregation of all available computing power found in an institution: –Idle cycles from desktop machines. –Unused cycles from dedicated clusters. Examples of campus grids: –700 CPUs at the University of Notre Dame – ,000 CPUs at Clemson University –20,000 CPUs at Purdue University

5 Provides robust batch queueing on a complex distributed system. Resource owners control consumption: –“Only run jobs on this machine at night.” –“Prefer biology jobs over physics jobs.” End users express needs: –“Only run this job where RAM>2GB” –“Prefer to run on machines

6

7

8

9

10

11 Clusters, clouds, and grids give us access to unlimited CPUs. How do we write programs that can run effectively in large systems?

12 Example: Biometrics Research Goal: Design robust face comparison function. F 0.05 F 0.97

13 Similarity Matrix Construction Challenge Workload: 60,000 iris images 1MB each.02s per F 833 CPU-days 600 TB of I/O

14 I have 60,000 iris images acquired in my research lab. I want to reduce each one to a feature space, and then compare all of them to each other. I want to spend my time doing science, not struggling with computers. I have a laptop. I own a few machinesI can buy time from Amazon or TeraGrid. Now What?

15 Non-Expert User Using 500 CPUs Try 1: Each F is a batch job. Failure: Dispatch latency >> F runtime. HN CPU FFFF F Try 2: Each row is a batch job. Failure: Too many small ops on FS. HN CPU FFFF F F F F F F F F F F F F F F F F Try 3: Bundle all files into one package. Failure: Everyone loads 1GB at once. HN CPU FFFF F F F F F F F F F F F F F F F F Try 4: User gives up and attempts to solve an easier or smaller problem.

16 Observation In a given field of study, many people repeat the same of work many times, making slight changes to the data and algorithms. In a given field of study, many people repeat the same pattern of work many times, making slight changes to the data and algorithms. If the system knows the overall pattern in advance, then it can do a better job of executing it reliably and efficiently. If the user knows in advance what patterns are allowed, then they have a better idea of how to construct their workloads.

17 Abstractions for Distributed Computing Abstraction: a declarative specification of the computation and data of a workload. A restricted pattern, not meant to be a general purpose programming language. Uses instead of files. Uses data structures instead of files. Provide users with a. Provide users with a bright path. Regular structure makes it tractable to model and predict performance.

18 Working with Abstractions F A1 A2 An AllPairs( A, B, F ) Cloud or Grid A1 A2 Bn Custom Workflow Engine Compact Data Structure

19 All-Pairs Abstraction AllPairs( set A, set B, function F ) returns matrix M where M[i][j] = F( A[i], B[j] ) for all i,j B1 B2 B3 A1A2A3 FFF A1 An B1 Bn F AllPairs(A,B,F) F FF FF F allpairs A B F.exe

20 How Does the Abstraction Help? The custom workflow engine: –Chooses right data transfer strategy. –Chooses the right number of resources. –Chooses blocking of functions into jobs. –Recovers from a larger number of failures. –Predicts overall runtime accurately. All of these tasks are nearly impossible for arbitrary workloads, but are tractable (not trivial) to solve for a specific abstraction.

21

22 Choose the Right # of CPUs

23 Resources Consumed

24 All-Pairs in Production Our All-Pairs implementation has provided over 57 CPU-years of computation to the ND biometrics research group over the last year. Largest run so far: 58,396 irises from the Face Recognition Grand Challenge. The largest experiment ever run on publically available data. Competing biometric research relies on samples of images, which can miss important population effects. Reduced computation time from 833 days to 10 days, making it feasible to repeat multiple times for a graduate thesis. (We can go faster yet.)

25

26

27 Are there other abstractions?

28 All-Pairs Abstraction AllPairs( set A, set B, function F ) returns matrix M where M[i][j] = F( A[i], B[j] ) for all i,j B1 B2 B3 A1A2A3 FFF A1 An B1 Bn F AllPairs(A,B,F) F FF FF F allpairs A B F.exe

29 M[4,2] M[3,2]M[4,3] M[4,4]M[3,4]M[2,4] M[4,0]M[3,0]M[2,0]M[1,0]M[0,0] M[0,1] M[0,2] M[0,3] M[0,4] F x yd F x yd F x yd F x yd F x yd F x yd F F y y x x d d x FF x ydyd Wavefront( matrix M, function F(x,y,d) ) returns matrix M such that M[i,j] = F( M[i-1,j], M[I,j-1], M[i-1,j-1] ) F Wavefront(M,F) M

30 Some-Pairs Abstraction SomePairs( set A, list (i,j), function F(x,y) ) returns list of F( A[i], A[j] ) A1 A2 A3 A1A2A3 F A1 An F SomePairs(A,L,F) FF F (1,2) (2,1) (2,3) (3,3)

31 What if your application doesn’t fit a regular pattern?

32 Makeflow part1 part2 part3: input.data split.py./split.py input.data out1: part1 mysim.exe./mysim.exe part1 >out1 out2: part2 mysim.exe./mysim.exe part2 >out2 out3: part3 mysim.exe./mysim.exe part3 >out3 result: out1 out2 out3 join.py./join.py out1 out2 out3 > result

33 worker work queue afilebfile put prog put afile exec prog afile > bfile get bfile 100s of workers dispatched to the cloud makeflow master queue tasks done prog detail of a single worker: Makeflow Implementation bfile: afile prog prog afile >bfile Two optimizations: Cache inputs and output. Dispatch tasks to nodes with data.

34 Experience with Makeflow Reusing a good old idea in a new way. Easy to test and debug on a desktop machine or a multicore server. The workload says nothing about the distributed system. (This is good.) Graduate students in bioinformatics running codes at production speeds on hundreds of nodes in less than a week. Student from Clemson got complex biometrics workload running in a few weeks.

35 Putting it All Together Web Portal Data Repository Campus Grid F Y Z X Abstraction

36 BXGrid Schema fileid = size = 300K type = jpg sum = abc123… replicaid=423 state=ok replicaid=105 state=ok replicaid=293 state=creating replicaid=102 state=deletingTypeSubjectEyeColorFileIDIrisS100RightBlue10486 IrisS100LeftBlue10487 IrisS203RightBrown24304 IrisS203LeftBrown24305 Scientific Metadata General Metadata Immutable Replicas

37

38

39

40 Results from Campus Grid

41 Biocompute

42

43 Parallel BLAST Makeflow

44

45 Abstractions as a Social Tool Collaboration with outside groups is how we encounter the most interesting, challenging, and important problems, in computer science. However, often neither side understands which details are essential or non-essential: –Can you deal with files that have upper case letters? –Oh, by the way, we have 10TB of input, is that ok? –(A little bit of an exaggeration.) An abstraction is an excellent chalkboard tool: –Accessible to anyone with a little bit of mathematics. –Makes it easy to see what must be plugged in. –Forces out essential details: data size, execution time.

46 Conclusion Grids, clouds, and clusters provide enormous computing power, but are very challenging to use effectively. An abstraction provides a robust, scalable solution to a narrow category of problems; each requires different kinds of optimizations. Limiting expressive power, results in systems that are usable, predictable, and reliable. Portal + Repository + Abstraction + Grid = New Science Capabilities = New Science Capabilities

47 Acknowledgments Cooperative Computing Lab – Grad Students –Chris Moretti –Hoang Bui –Li Yu –Mike Olson –Michael Albrecht Faculty: –Patrick Flynn –Nitesh Chawla –Kenneth Judd –Scott Emrich NSF Grants CCF and CNS and CNS Undergrads –Mike Kelly –Rory Carmichael –Mark Pasquier –Christopher Lyon –Jared Bulosan –Kameron Srimoungach –Rachel Witty –Ryan Jansen –Joey Rich