1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011.

Slides:



Advertisements
Similar presentations
1 Real-World Barriers to Scaling Up Scientific Applications Douglas Thain University of Notre Dame Trends in HPDC Workshop Vrije University, March 2012.
Advertisements

Lobster: Personalized Opportunistic Computing for CMS at Large Scale Douglas Thain (on behalf of the Lobster team) University of Notre Dame CVMFS Workshop,
BXGrid: A Data Repository and Computing Grid for Biometrics Research Hoang Bui University of Notre Dame 1.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Experience with Adopting Clouds at Notre Dame Douglas Thain University of Notre Dame IEEE CloudCom, November 2010.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Introduction to Scalable Programming using Makeflow and Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.
Scaling Up Without Blowing Up Douglas Thain University of Notre Dame.
1 High Throughput Scientific Computing with Condor: Computer Science Challenges in Large Scale Parallelism Douglas Thain University of Notre Dame UAB 27.
1 Opportunities and Dangers in Large Scale Data Intensive Computing Douglas Thain University of Notre Dame Large Scale Data Mining Workshop at SIGKDD August.
1 Scaling Up Data Intensive Science with Application Frameworks Douglas Thain University of Notre Dame Michigan State University September 2011.
1 Models and Frameworks for Data Intensive Cloud Computing Douglas Thain University of Notre Dame IDGA Cloud Computing 8 February 2011.
1 Science in the Clouds: History, Challenges, and Opportunities Douglas Thain University of Notre Dame GeoClouds Workshop 17 September 2009.
1 Scaling Up Data Intensive Science to Campus Grids Douglas Thain Clemson University 25 Septmber 2009.
Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame Cloud Computing and Applications (CCA-08) University.
Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.
Using Small Abstractions to Program Large Distributed Systems Douglas Thain University of Notre Dame 11 December 2008.
Getting Beyond the Filesystem: New Models for Data Intensive Scientific Computing Douglas Thain University of Notre Dame HEC FSIO Workshop 6 August 2009.
Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008.
An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame
Introduction to Makeflow Li Yu University of Notre Dame 1.
Using Abstractions to Scale Up Applications to Campus Grids Douglas Thain University of Notre Dame 28 April 2009.
Building Scalable Elastic Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame Tutorial at CCGrid, May Delft, Netherlands.
Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Peter Sempolinski University of Notre Dame.
Building Scalable Applications on the Cloud with Makeflow and Work Queue Douglas Thain and Patrick Donnelly University of Notre Dame Science Cloud Summer.
Introduction to Makeflow and Work Queue CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain.
Portable Resource Management for Data Intensive Workflows Douglas Thain University of Notre Dame.
Design of an Active Storage Cluster File System for DAG Workflows Patrick Donnelly and Douglas Thain University of Notre Dame 2013 November 18 th DISCS-2013.
Elastic Applications in the Cloud Dinesh Rajan University of Notre Dame CCL Workshop, June 2012.
Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.
Software Architecture
Massively Parallel Ensemble Methods Using Work Queue Badi’ Abdul-Wahid Department of Computer Science University of Notre Dame CCL Workshop 2012.
Programming Distributed Systems with High Level Abstractions Douglas Thain University of Notre Dame 23 October 2008.
Toward a Common Model for Highly Concurrent Applications Douglas Thain University of Notre Dame MTAGS Workshop 17 November 2013.
Introduction to Work Queue Applications CSE – Cloud Computing – Fall 2014 Prof. Douglas Thain.
Building Scalable Scientific Applications with Makeflow Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts University.
Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame.
The Cooperative Computing Lab  We collaborate with people who have large scale computing problems in science, engineering, and other fields.  We operate.
Introduction to Scalable Programming using Work Queue Dinesh Rajan and Ben Tovar University of Notre Dame October 10, 2013.
1 Computational Abstractions: Strategies for Scaling Up Applications Douglas Thain University of Notre Dame Institute for Computational Economics University.
Introduction to Work Queue Applications Applied Cyberinfrastructure Concepts Course University of Arizona 2 October 2014 Douglas Thain and Nicholas Hazekamp.
Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.
Introduction to Makeflow and Work Queue Prof. Douglas Thain, University of Notre Dame
Analyzing LHC Data on 10K Cores with Lobster and Work Queue Douglas Thain (on behalf of the Lobster Team)
Techniques for Preserving Scientific Software Executions: Preserve the Mess or Encourage Cleanliness? Douglas Thain, Peter Ivie, and Haiyan Meng.
Introduction to Scalable Programming using Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.
Building Scalable Elastic Applications using Work Queue Dinesh Rajan and Douglas Thain University of Notre Dame Tutorial at CCGrid, May Delft,
Demonstration of Scalable Scientific Applications Peter Sempolinski and Dinesh Rajan University of Notre Dame.
1 Christopher Moretti – University of Notre Dame 4/30/2008 High Level Abstractions for Data-Intensive Computing Christopher Moretti, Hoang Bui, Brandon.
Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.
Next Generation of Apache Hadoop MapReduce Owen
Tutorial on Science Gateways, Roma, Catania Science Gateway Framework Motivations, architecture, features Riccardo Rotondo.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Fundamental Operations Scalability and Speedup
Introduction to Makeflow and Work Queue
U.S. ATLAS Grid Production Experience
Scaling Up Scientific Workflows with Makeflow
Windows Azure Migrating SQL Server Workloads
Introduction to Makeflow and Work Queue
Integration of Singularity With Makeflow
Introduction to Makeflow and Work Queue
Haiyan Meng and Douglas Thain
Introduction to Makeflow and Work Queue
Introduction to Makeflow and Work Queue
Introduction to Makeflow and Work Queue with Containers
BXGrid: A Data Repository and Computing Grid for Biometrics Research
What’s New in Work Queue
Creating Custom Work Queue Applications
Presentation transcript:

1 Condor Compatible Tools for Data Intensive Computing Douglas Thain University of Notre Dame Condor Week 2011

The Cooperative Computing Lab We collaborate with people who have large scale computing problems in science, engineering, and other fields. We operate computer systems on the scale of 1200 cores. (Small) We conduct computer science research in the context of real people and problems. We publish open source software that captures what we have learned. 2

What is Condor Compatible? Work right out of the box with Condor. Respect the execution environment. Interoperate with public Condor interfaces. 3

And the “challenging” users… I submitted 10 jobs yesterday, and that worked, so I submitted 10M this morning! And then I write the output into 10,000 files of 1KB each. Per job. Did I mention each one reads the same input file of 1TB? Sorry, am I reading that file twice? What do you mean, sequential access? Condor is nice, but I also want to use my cluster, and SGE, and Amazon and… 5

Idea: Get the end user into telling us more about their data needs. In exchange, give workflow portability and resource management. 6

Makeflow 7 part1 part2 part3: input.data split.py./split.py input.data out1: part1 mysim.exe./mysim.exe part1 >out1 out2: part2 mysim.exe./mysim.exe part2 >out2 out3: part3 mysim.exe./mysim.exe part3 >out3 result: out1 out2 out3 join.py./join.py out1 out2 out3 > result Douglas Thain and Christopher Moretti, Abstractions for Cloud Computing with Condor, Syed Ahson and Mohammad Ilyas, Cloud Computing and Software Services: Theory and Techniques, pages , CRC Press, July, 2010.Abstractions for Cloud Computing with Condor

Makeflow Core Logic Makeflow = Make + Workflow Abstract System Interface LocalCondorSGEHadoop Work Queue Transaction Log

9 Makeflow Core Logic Abstract System Interface LocalCondorSGE Hadoo p Work Queue Transaction Log NFS Shared-Nothing Running Condor Shared-Disk Running SGE Distributed Disks Running Hadoop Michael Albrecht, Patrick Donnelly, Peter Bui, and Douglas Thain, Makeflow: A Portable Abstraction for Cluster, Cloud, and Grid Computing.

Andrew Thrasher, Rory Carmichael, Peter Bui, Li Yu, Douglas Thain, and Scott Emrich, Taming Complex Bioinformatics Workflows with Weaver, Makeflow, and Starch, Workshop on Workflows in Support of Large Scale Science, pages 1-6, November, Taming Complex Bioinformatics Workflows with Weaver, Makeflow, and Starch

Weaver # Weaver Code jpgs = [str(i)+'. jpg ' for i in range (1000)] conv = SimpleFunction('convert',out_suffix ='png ') pngs = Map(conv,jpgs) # Makeflow Code 0.png: 0.jpg /usr/bin/convert /usr/bin/convert 0.jpg 0.png 1.png: 1.jpg /usr/bin/convert /usr/bin/convert 1.jpg 1.png png: 999.jpg /usr/bin /usr/bin/convert 999.jpg 999.png Peter Bui, Li Yu and Douglas Thain, Weaver: Integrating Distributed Computing Abstractions into Scientific Workflows using Python, CLADE, June, 2010.Weaver: Integrating Distributed Computing Abstractions into Scientific Workflows using Python

12 worker work queue afilebfile put prog put afile exec prog afile > bfile get bfile 100s of workers dispatched to the cloud makeflow master queue tasks done prog detail of a single worker: Makeflow and Work Queue bfile: afile prog prog afile >bfile Two optimizations: Cache inputs and output. Dispatch tasks to nodes with data.

Private Cluster Campus Condor Pool Public Cloud Provider Private SGE Cluster Makefile Makeflow Local Files and Programs Makeflow and Work Queue sge_submit_workers W W W ssh WW WW W WvWv W condor_submit_workers W W W Hundreds of Workers in a Personal Cloud submit tasks $ $ $

SAND - Scalable Assembler Scalable Assembler Work Queue API Align x100s AGTCACACTGTACGTAGAAGTCACACTGTACGTAA… AGTC ACTCAT ACTGAGC TAATAAG Fully Assembled Genome Raw Sequence Data Christopher Moretti, Michael Olson, Scott Emrich, and Douglas Thain, Highly Scalable Genome Assembly on Campus Grids, Many-Task Computing on Grids and Supercomputers (MTAGS), November, 2009 Highly Scalable Genome Assembly on Campus Grids

Replica Exchange on WQ T=10KT=20K T=30K T=40K Replica Exchange Work Queue API

Connecting Condor Jobs to Remote Data Storage 16

Parrot and Chirp Parrot – A User Level Virtual File System –Connects apps to remote data services: –HTTP, FTP, Hadoop, iRODS, XrootD, Chirp –No special privileges to install or use. Chirp – A Personal File Server –Export existing file services beyond the cluster. Local disk, NFS, AFS, HDFS –Add rich access control features. –No special privileges to install or use. 17

18 Unix App Parrot HDFSHTTPFTPChirpxrootd Hadoop Distributed File System Disk File Condor Distributed Batch System CPU Problem: Requires consistent Java and Hadoop libraries installed everywhere.

Campus Condor Pool Hadoop Storage Cloud Patrick Donnelly, Peter Bui, Douglas Thain, Attaching Cloud Storage to a Campus Grid Using Parrot, Chirp, and Hadoop, IEEE Cloud Computing Technology and Science, pages , November, Attaching Cloud Storage to a Campus Grid Using Parrot, Chirp, and Hadoop App Parrot Chirp

20 Putting it All Together

21

Computer Science Challenges With multicore everywhere, we want to run multiple apps per machine, but the local OS is still very poor at managing resources. How many workers does a workload need? Can we even tell when we have too many or too few? How to automatically partition a data intensive DAG across multiple multicore machines? $$$ is now part of the computing interface. Does it make sense to get it inside the workflow and/or API?

What is Condor Compatible? Work right out of the box with Condor. –makeflow –T condor –condor_submit_workers Respect the execution environment. –Accept eviction and failure as normal. –Put data in the right place so it can be cleaned up automatically by Condor. Interoperate with public Condor interfaces. –Servers run happily under the condor_master. –Compatible with Chirp I/O via the Starter. 23

A Team Effort 24 Grad Students –Hoang Bui –Li Yu –Peter Bui –Michael Albrecht –Patrick Donnely –Peter Sempolinski –Dinesh Rajan Faculty: –Patrick Flynn –Scott Emrich –Jesus Izaguirre –Nitesh Chawla –Kenneth Judd NSF Grants CCF , CNS , and CNS Undergrads –Rachel Witty –Thomas Potthast –Brenden Kokosza –Zach Musgrave –Anthony Canino

For More Information The Cooperative Computing Lab – Condor-Compatible Software: –Makeflow, Work Queue, Parrot, Chirp, SAND Prof. Douglas Thain 25