Using Abstractions to Scale Up Applications to Campus Grids Douglas Thain University of Notre Dame 28 April 2009.

Slides:

Advertisements

Similar presentations

1 Scaling Up Data Intensive Scientific Applications to Campus Grids Douglas Thain University of Notre Dame LSAP Workshop Munich, June 2009.

Advertisements

BXGrid: A Data Repository and Computing Grid for Biometrics Research Hoang Bui University of Notre Dame 1.

Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011.

1 High Throughput Scientific Computing with Condor: Computer Science Challenges in Large Scale Parallelism Douglas Thain University of Notre Dame UAB 27.

1 Opportunities and Dangers in Large Scale Data Intensive Computing Douglas Thain University of Notre Dame Large Scale Data Mining Workshop at SIGKDD August.

1 Scaling Up Data Intensive Science with Application Frameworks Douglas Thain University of Notre Dame Michigan State University September 2011.

1 Models and Frameworks for Data Intensive Cloud Computing Douglas Thain University of Notre Dame IDGA Cloud Computing 8 February 2011.

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

1 Science in the Clouds: History, Challenges, and Opportunities Douglas Thain University of Notre Dame GeoClouds Workshop 17 September 2009.

Threads Section 2.2. Introduction to threads A thread (of execution) is a light-weight process –Threads reside within processes. –They share one address.

Xyleme A Dynamic Warehouse for XML Data of the Web.

1 Scaling Up Data Intensive Science to Campus Grids Douglas Thain Clemson University 25 Septmber 2009.

Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame Cloud Computing and Applications (CCA-08) University.

Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.

Using Small Abstractions to Program Large Distributed Systems Douglas Thain University of Notre Dame 11 December 2008.

1 Scaling Up Classifiers to Cloud Computers Christopher Moretti, Karsten Steinhaeuser, Douglas Thain, Nitesh V. Chawla University of Notre Dame.

Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009.

Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:

Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008.

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame

FLANN Fast Library for Approximate Nearest Neighbors

Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Parallelization with the Matlab® Distributed Computing Server CBI cluster December 3, Matlab Parallelization with the Matlab Distributed.

Computer System Architectures Computer System Software

Massively Parallel Ensemble Methods Using Work Queue Badi’ Abdul-Wahid Department of Computer Science University of Notre Dame CCL Workshop 2012.

Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame 23 October 2008.

Introduction to Hadoop and HDFS

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

DLS on Star (Single-level tree) Networks Background: A simple network model for DLS is the star network with a master-worker platform. It consists of a.

Building Scalable Scientific Applications with Makeflow Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts University.

Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.

INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.

The Cooperative Computing Lab  We collaborate with people who have large scale computing problems in science, engineering, and other fields.  We operate.

Distributed Framework for Automatic Facial Mark Detection Graduate Operating Systems-CSE60641 Nisha Srinivas and Tao Xu Department of Computer Science.

1 Computational Abstractions: Strategies for Scaling Up Applications Douglas Thain University of Notre Dame Institute for Computational Economics University.

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

VIPIN VIJAYAN 11/11/03 A Performance Analysis of Two Distributed Computing Abstractions.

The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.

Server to Server Communication Redis as an enabler Orion Free

Distributed System Services Fall 2008 Siva Josyula

CPSC 171 Introduction to Computer Science System Software and Virtual Machines.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Introduction to Scalable Programming using Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.

Advanced Computer Networks Lecture 1 - Parallelization 1.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Building Scalable Elastic Applications using Work Queue Dinesh Rajan and Douglas Thain University of Notre Dame Tutorial at CCGrid, May Delft,

1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon.

Background Computer System Architectures Computer System Software.

Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.

INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.

Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Fundamental Operations Scalability and Speedup

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Operating Systems (CS 340 D)

Scaling Up Scientific Workflows with Makeflow

Genomic Data Clustering on FPGAs for Compression

Building a Database on S3

Ch 4. The Evolution of Analytic Scalability

CSE8380 Parallel and Distributed Processing Presentation

BXGrid: A Data Repository and Computing Grid for Biometrics Research

COMP60621 Fundamentals of Parallel and Distributed Systems

Creating Custom Work Queue Applications

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

Using Abstractions to Scale Up Applications to Campus Grids Douglas Thain University of Notre Dame 28 April 2009

Outline What is a Campus Grid? Challenges of Using Campus Grids. Solution: Abstractions Examples and Applications –All-Pairs: Biometrics –Wavefront: Economics –Assembly: Genomics Combining Abstractions Together

What is a Campus Grid? A campus grid is an aggregation of all available computing power found in an institution: –Idle cycles from desktop machines. –Unused cycles from dedicated clusters. Examples of campus grids: –600 CPUs at the University of Notre Dame –2000 CPUs at the University of Wisconsin –13,000 CPUs at Purdue University

Provides robust batch queueing on a complex distributed system. Resource owners control consumption: –“Only run jobs on this machine at night.” –“Prefer biology jobs over physics jobs.” End users express needs: –“Only run this job where RAM>2GB” –“Prefer to run on machines

The Assembly Language of Campus Grids User Interface: –N x { run program X with files F and G } System Properties: –Wildly varying resource availability. –Heterogeneous resources. –Unpredictable preemption. Effect on Applications: –Jobs can’t run for too long... –But, they can’t run too quickly, either! –Use file I/O for inter-process communication. –Bad choices cause chaos on the network and heartburn for system administrators.

I have 10,000 iris images acquired in my research lab. I want to reduce each one to a feature space, and then compare all of them to each other. I want to spend my time doing science, not struggling with computers. I have a laptop. I own a few machinesI can get cycles from ND and Purdue Now What?

Observation In a given field of study, a single person may repeat the same of work many times, making slight changes to the data and algorithms. In a given field of study, a single person may repeat the same pattern of work many times, making slight changes to the data and algorithms. If we knew in advance the intended pattern, then we could do a better job of mapping a complex application to a complex system.

Abstractions for Distributed Computing Abstraction: a declarative specification of the computation and data of a workload. A restricted pattern, not meant to be a general purpose programming language. Uses instead of files. Uses data structures instead of files. Provide users with a. Provide users with a bright path. Regular structure makes it tractable to model and predict performance.

All-Pairs Abstraction AllPairs( set A, set B, function F ) returns matrix M where M[i][j] = F( A[i], B[j] ) for all i,j B1 B2 B3 A1A2A3 FFF A1 An B1 Bn F AllPairs(A,B,F) F FF FF F Moretti, Bulosan, Flynn, Thain, AllPairs: An Abstraction… IPDPS 2008 allpairs A B F.exe

Example Application Goal: Design robust face comparison function. F 0.05 F 0.97

Similarity Matrix Construction F Current Workload: 4000 images 256 KB each 10s per F (five days) Future Workload: images 1MB each 1s per F (three months)

Non-Expert User on a Campus Grid Try 1: Each F is a batch job. Failure: Dispatch latency >> F runtime. HN CPU FFFF F Try 2: Each row is a batch job. Failure: Too many small ops on FS. HN CPU FFFF F F F F F F F F F F F F F F F F Try 3: Bundle all files into one package. Failure: Everyone loads 1GB at once. HN CPU FFFF F F F F F F F F F F F F F F F F Try 4: User gives up and attempts to solve an easier or smaller problem.

All-Pairs Abstraction AllPairs( set A, set B, function F ) returns matrix M where M[i][j] = F( A[i], B[j] ) for all i,j B1 B2 B3 A1A2A3 FFF A1 An B1 Bn F AllPairs(A,B,F) F FF FF F

% allpairs compare.exe adir bdir

Distribute Data Via Spanning Tree

An Interesting Twist Send the absolute minimum amount of data needed to each of N nodes from a central server –Each job must run on exactly 1 node. –Data distribution time: O( D sqrt(N) ) Send all data to all N nodes via spanning tree distribution: –Any job can run on any node. –Data distribution time: O( D log(N) ) It is both faster and more robust to send all data to all nodes via spanning tree.

Choose the Right # of CPUs

What is the right metric?

What’s the right metric? Speedup? –Seq Runtime / Parallel Runtime Parallel Efficiency? –Speedup / N CPUs? Neither works, because the number of CPUs varies over time and between runs. Better Choice: Cost Efficiency –Work Completed / Resources Consumed –Cars: Miles / Gallon –Planes: Person-Miles / Gallon –Results / CPU-hours –Results / $$$

All-Pairs Abstraction

Wavefront ( R[x,0], R[0,y], F(x,y,d) ) R[4,2] R[3,2]R[4,3] R[4,4]R[3,4]R[2,4] R[4,0]R[3,0]R[2,0]R[1,0]R[0,0] R[0,1] R[0,2] R[0,3] R[0,4] F x yd F x yd F x yd F x yd F x yd F x yd F F y y x x d d x FF x ydyd

% wavefront func.exe infile outfile

The Performance Problem Dispatch latency really matters: a delay in one holds up all of its children. If we dispatch larger sub-problems: –Concurrency on each node increases. –Distributed concurrency decreases. If we dispatch smaller sub-problems: –Concurrency on each node decreases. –Spend more time waiting for jobs to be dispatched. So, model the system to choose the block size. And, build a fast-dispatch execution system.

Wavefront ( R[x,0], R[0,y], F(x,y,d) ) R[4,2] R[3,2]R[4,3] R[4,4]R[3,4]R[2,4] R[4,0]R[3,0]R[2,0]R[1,0]R[0,0] R[0,1] R[0,2] R[0,3] R[0,4] F x yd F x yd F x yd F x yd F x yd F x yd F F y y x x d d x FF x ydyd Block Size = 2

Model of 1000x1000 Wavefront

worker work queue F In.txtout.txt put F.exe put in.txt exec F.exe out.txt get out.txt wavefront queue tasks done 100s of workers dispatched to Notre Dame, Purdue, and Wisconsin

worker work queue F In.txtout.txt put F.exe put in.txt exec F.exe out.txt get out.txt wavefront queue tasks done wavefront F 100s of workers dispatched to Notre Dame, Purdue, and Wisconsin

500x500 Wavefront on ~200 CPUs

Wavefront on a 200-CPU Cluster

Wavefront on a 32-Core CPU

The Genome Assembly Problem AGTCGATCGATCGATAATCGATCCTAGCTAGCTACGA AGTCGATCGATCGAT AGCTAGCTACGA TCGATAATCGATCCTAGCTA Chemical Sequencing Computational Assembly AGTCGATCGATCGAT AGCTAGCTACGA TCGATAATCGATCCTAGCTA Millions of “reads” 100s bytes long.

Sample Genomes ReadsDataPairsSequentialTime A. gambiae scaffold101K80MB738K 12 hours A. gambiae complete180K1.4GB12M 6 days S. bicolor 7.9M5.7GB84M 30 days

Assemble( set S, Test(), Align(), Assm() ) 0. AGCCTGCATTA… 1. CATTAACGAAC… 2. GACTGACTAGC… 3, TGACCGATAAA… Candidate Pairs 0 is similar to 1 1 is similar to 3 1 is similar to 4 0. AGCCTGCATTA 1. CATTAACGAAC… Sequence Data AGTCGATCGATCGATAATC… List of Alignments Assembled Sequence CPU Bound I/O Bound RAM Bound Align Test Assem

worker work queue in.txtout.txt put align.exe put in.txt exec F.exe out.txt get out.txt 100s of workers dispatched to Notre Dame, Purdue, and Wisconsin align master queue tasks done align.exe detail of a single worker: Distributed Genome Assembly test assemble

Small Genome (101K reads)

Medium Genome (180K reads)

Large Genome (7.9M)

From Workstation to Grid

What’s the Upshot? We can do full-scale assemblies as a routine matter on existing conventional machines. Our solution is faster (wall-clock time) than the next faster assembler run on 1024x BG/L. You could almost certainly do better with a dedicated cluster and a fast interconnect, but such systems are not universally available. Our solution opens up research in assembly to labs with “NASCAR” instead of “Formula-One” hardware.

What Other Abstractions Might Be Useful? Map( set S, F(s) ) Explore( F(x), x: [a….b] ) Minimize( F(x), delta ) Minimax( state s, A(s), B(s) ) Search( state s, F(s), IsTerminal(s) ) Query( properties ) -> set of objects FluidFlow( V[x,y,z], F(v), delta )

How do we connect multiple abstractions together? Need a meta-language, perhaps with its own atomic operations for simple tasks: Need to manage (possibly large) intermediate storage between operations. Need to handle data type conversions between almost-compatible components. Need type reporting and error checking to avoid expensive errors. If abstractions are feasible to model, then it may be feasible to model entire programs.

Connecting Abstractions in BXGrid B1 B2 B3 A1A2A3 FFF F FF FF F Lbrown Lblue Rbrown R S1 S2 S3 eyecolor F F F ROC Curve S = Select( color=“brown” ) B = Transform( S,F ) M = AllPairs( A, B, F ) Bui, Thomas, Kelly, Lyon, Flynn, Thain BXGrid: A Repository and Experimental Abstraction… poster at IEEE eScience 2008

50

51

Implementing Abstractions S = Select( color=“brown” ) B = Transform( S,F ) M = AllPairs( A, B, F ) DBMS Relational Database (2x) Active Storage Cluster (16x) CPU Relational Database CPU Condor Pool (500x)

Largest Combination so Far Complete Select/Transform/All-Pairs biometric experiment on 58,396 irises from the Face Recognition Grand Challenge. To our knowledge, the largest experiment ever run on publically available data. Competing biometric research relies on samples of images, which can miss important population effects. Reduced computation time from 800 days to 10 days, making it feasible to repeat multiple times for a graduate thesis.

Abstractions Redux Campus grids provide enormous computing power, but are very challenging to use effectively. An abstraction provides a robust, scalable solution to a category of problems. Multiple abstractions can be chained together to solve very large problems. Could a menu of abstractions cover a significant fraction of the application mix?

Acknowledgments Cooperative Computing Lab – Grad Students –Chris Moretti –Hoang Bui –Li Yu –Mike Olson –Michael Albrecht Faculty: –Patrick Flynn –Nitesh Chawla –Kenneth Judd –Scott Emrich NSF Grants CCF , CNS Undergrads –Mike Kelly –Rory Carmichael –Mark Pasquier –Christopher Lyon –Jared Bulosan