1 Real-World Barriers to Scaling Up Scientific Applications Douglas Thain University of Notre Dame Trends in HPDC Workshop Vrije University, March 2012.

Slides:

Advertisements

Similar presentations

Distributed Systems Architectures

Advertisements

1 Towards an Open Service Framework for Cloud-based Knowledge Discovery Domenico Talia ICAR-CNR & UNIVERSITY OF CALABRIA, Italy Cloud.

Hands-On Problem Solving with Remote Electron Microscopy George Motter Andrea Harmer Lehigh University.

CSF4 Meta-Scheduler Tutorial 1st PRAGMA Institute Zhaohui Ding or

11 Application of CSF4 in Avian Flu Grid: Meta-scheduler CSF4. Lab of Grid Computing and Network Security Jilin University, Changchun, China Hongliang.

1 Mixing Public and private clouds a Practical Perspective Maarten Koopmans Nordunet Conference 2009 Maarten Koopmans Nordunet Conference 2009.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Analysis of Affymetrix expression data using R on Azure Cloud Anne Owen Department of Mathematical Sciences University of Essex 15/16 March, 2012 SAICG.

Condor use in Department of Computing, Imperial College Stephen M c Gough, David McBride London e-Science Centre.

Cloud Resource Broker for Scientific Community By: Shahzad Nizamani Supervisor: Peter Dew Co Supervisor: Karim Djemame Mo Haji.

Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.

Auto-scaling Axis2 Web Services on Amazon EC2 By Afkham Azeez.

Configuration management

The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

Student Lab Management IT Services March Our Purpose this evening … o Foster good communication o Improve Relationships and Collaboration o Prepare.

Operating Systems Operating Systems - Winter 2011 Dr. Melanie Rieback Design and Implementation.

Operating Systems Operating Systems - Winter 2012 Dr. Melanie Rieback Design and Implementation.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Introduction to Computer Administration Introduction.

Dan Bassett, Jonathan Canfield December 13, 2011.

Chapter 10: The Traditional Approach to Design

Systems Analysis and Design in a Changing World, Fifth Edition

University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra

Lobster: Personalized Opportunistic Computing for CMS at Large Scale Douglas Thain (on behalf of the Lobster team) University of Notre Dame CVMFS Workshop,

FUTURE TECHNOLOGIES Lecture 13.  In this lecture we will discuss some of the important technologies of the future  Autonomic Computing  Cloud Computing.

Experience with Adopting Clouds at Notre Dame Douglas Thain University of Notre Dame IEEE CloudCom, November 2010.

Introduction to Scalable Programming using Makeflow and Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.

1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011.

1 High Throughput Scientific Computing with Condor: Computer Science Challenges in Large Scale Parallelism Douglas Thain University of Notre Dame UAB 27.

1 Opportunities and Dangers in Large Scale Data Intensive Computing Douglas Thain University of Notre Dame Large Scale Data Mining Workshop at SIGKDD August.

1 Scaling Up Data Intensive Science with Application Frameworks Douglas Thain University of Notre Dame Michigan State University September 2011.

1 Models and Frameworks for Data Intensive Cloud Computing Douglas Thain University of Notre Dame IDGA Cloud Computing 8 February 2011.

Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.

Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009.

Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008.

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame

Introduction to Makeflow Li Yu University of Notre Dame 1.

Building Scalable Elastic Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame Tutorial at CCGrid, May Delft, Netherlands.

Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Peter Sempolinski University of Notre Dame.

Introduction to Makeflow and Work Queue CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain.

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Elastic Applications in the Cloud Dinesh Rajan University of Notre Dame CCL Workshop, June 2012.

Massively Parallel Ensemble Methods Using Work Queue Badi’ Abdul-Wahid Department of Computer Science University of Notre Dame CCL Workshop 2012.

Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France

Building Scalable Scientific Applications with Makeflow Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts University.

Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

The Cooperative Computing Lab  We collaborate with people who have large scale computing problems in science, engineering, and other fields.  We operate.

Introduction to Scalable Programming using Work Queue Dinesh Rajan and Ben Tovar University of Notre Dame October 10, 2013.

Distributed Framework for Automatic Facial Mark Detection Graduate Operating Systems-CSE60641 Nisha Srinivas and Tao Xu Department of Computer Science.

1 Computational Abstractions: Strategies for Scaling Up Applications Douglas Thain University of Notre Dame Institute for Computational Economics University.

Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.

Introduction to Scalable Programming using Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.

Building Scalable Elastic Applications using Work Queue Dinesh Rajan and Douglas Thain University of Notre Dame Tutorial at CCGrid, May Delft,

Demonstration of Scalable Scientific Applications Peter Sempolinski and Dinesh Rajan University of Notre Dame.

1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon.

Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.

Northwest Indiana Computational Grid Preston Smith Rosen Center for Advanced Computing Purdue University - West Lafayette West Lafayette Calumet.

Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.

HPC In The Cloud Case Study: Proteomics Workflow

Spark Presentation.

Scaling Up Scientific Workflows with Makeflow

Introduction to Makeflow and Work Queue

Haiyan Meng and Douglas Thain

Weaving Abstractions into Workflows

Introduction to Makeflow and Work Queue

What’s New in Work Queue

Creating Custom Work Queue Applications

Presentation transcript:

1 Real-World Barriers to Scaling Up Scientific Applications Douglas Thain University of Notre Dame Trends in HPDC Workshop Vrije University, March 2012

The Cooperative Computing Lab We collaborate with people who have large scale computing problems in science, engineering, and other fields. We operate computer systems on the O(1000) cores: clusters, clouds, grids. We conduct computer science research in the context of real people and problems. We release open source software for large scale distributed computing. 2

3 Our Application Communities Bioinformatics –I just ran a tissue sample through a sequencing device. I need to assemble 1M DNA strings into a genome, then compare it against a library of known human genomes to find the difference. Biometrics –I invented a new way of matching iris images from surveillance video. I need to test it on 1M hi-resolution images to see if it actually works. Molecular Dynamics –I have a new method of energy sampling for ensemble techniques. I want to try it out on 100 different molecules at 1000 different temperatures for 10,000 random trials

The Good News: Computing is Plentiful! 4

5

6

greencloud.crc.nd.edu 7

Superclusters by the Hour 8

The Bad News: It is inconvenient. 9

10 I have a standard, debugged, trusted application that runs on my laptop. A toy problem completes in one hour. A real problem will take a month (I think.) Can I get a single result faster? Can I get more results in the same time? Last year, I heard about this grid thing. What do I do next? This year, I heard about this cloud thing.

What users want. 11 What they get.

The Traditional Application Model? 12 Every program attempts to grow until it can read mail. - Jamie Zawinski

What goes wrong? Everything! Scaling up from 10 to 10,000 tasks violates ten different hard coded limits in the kernel, the filesystem, the network, and the application. Failures are everywhere! Exposing error messages is confusing, but hiding errors causes unbounded delays. User didn’t know that program relies on 1TB of configuration files, all scattered around the home filesystem. User discovers that the program only runs correctly on Blue Sock Linux ! User discovers that program generates different results when run on different machines.

Abstractions for Scalable Apps

Work Queue API wq_task_create( files and program ); wq_task_submit( queue, task); wq_task_wait( queue ) -> task C implementation + Python and Perl 15

Work Queue Apps T=10KT=20K T=30K T=40K Ensemble MD Work Queue SAND Work Queue Align x100s AGTCACACTGTACGTAGAAGTCACACTGTACGTAA… ACTGAGC TAATAAG Fully Assembled Genome Raw Sequence Data

17 An Old Idea: Make part1 part2 part3: input.data split.py./split.py input.data out1: part1 mysim.exe./mysim.exe part1 >out1 out2: part2 mysim.exe./mysim.exe part2 >out2 out3: part3 mysim.exe./mysim.exe part3 >out3 result: out1 out2 out3 join.py./join.py out1 out2 out3 > result

Makeflow Applications

Why Users Like Makeflow Use existing applications without change. Use an existing language everyone knows. (Some apps are already in Make.) Via Workers, harness all available resources: desktop to cluster to cloud. Transparent fault tolerance means you can harness unreliable resources. Transparent data movement means no shared filesystem is required. 19

Private Cluster Campus Condor Pool Public Cloud Provider Shared SGE Cluster Application Work Queue API Local Files and Programs Work Queue Overlay sge_submit_workers W W W ssh WW WW W WvWv W condor_submit_workers W W W Hundreds of Workers in a Personal Cloud submit tasks

Private Cluster Campus Condor Pool Public Cloud Provider Shared SGE Cluster Elastic Application Stack W W W W W W W W WvWv Work Queue Library All-PairsWavefrontMakeflow Custom Apps Hundreds of Workers in a Personal Cloud

The Elastic Application Curse Just because you can run on a thousand cores… doesn’t mean you should! The user must make an informed decision about scale, cost, performance efficiency. (Obligatory halting problem reference.) Can the computing framework help the user to make an informed decision? 22

23 Input Files Elastic App Service Provider Scale of System Runtime Cost

Abstractions for Scalable Apps

Different Techniques For regular graphs (All-Pairs), almost everything is known and accurate predications can be made. For irregular graphs (Makeflow) input data and cardinality are known, but time and intermediate data are not. For dynamic programs (Work Queue) prediction is not possible in the general case. (Runtime controls still possible.) 25

Is there a Chomsky Hierarchy for Distributed Computing? 26 Language ConstructRequired Machine Regular ExpressionsFinite State Machine Context Free GrammarPushdown Automaton Context Sensitive GrammarLinear Bounded Automaton Unrestricted GrammarTuring Machine

Papers, Software, Manuals, … 27 This work was supported by NSF Grants CCF , CNS , and CNS