Data-driven Workflow Planning in Cluster management Systems Srinath Shankar David J DeWitt Department of Computer Sciences University of Wisconsin-Madison,

Slides:



Advertisements
Similar presentations
Paging: Design Issues. Readings r Silbershatz et al: ,
Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications Piyush Shivam, Shivnath Babu, Jeffrey Chase Duke University.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Chapter 1 and 2 Computer System and Operating System Overview
©Brooks/Cole, 2003 Chapter 7 Operating Systems Dr. Barnawi.
Computer Organization and Architecture
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Chapter 6: An Introduction to System Software and Virtual Machines
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Authors: Weiwei Chen, Ewa Deelman 9th International Conference on Parallel Processing and Applied Mathmatics 1.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
CISC105 General Computer Science Class 1 – 6/5/2006.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
20 October 2006Workflow Optimization in Distributed Environments Dynamic Workflow Management Using Performance Data David W. Walker, Yan Huang, Omer F.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
Block1 Wrapping Your Nugget Around Distributed Processing.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.
The Owner Share scheduler for a distributed system 2009 International Conference on Parallel Processing Workshops Reporter: 李長霖.
Event Data History David Adams BNL Atlas Software Week December 2001.
Resource Brokering in the PROGRESS Project Juliusz Pukacki Grid Resource Management Workshop, October 2003.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
Transparently Gathering Provenance with Provenance Aware Condor Christine Reilly and Jeffrey Naughton Department of Computer Sciences University of Wisconsin.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
CPSC 171 Introduction to Computer Science System Software and Virtual Machines.
CMS Computing Model Simulation Stephen Gowdy/FNAL 30th April 2015CMS Computing Model Simulation1.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
AMH001 (acmse03.ppt - 03/7/03) REMOTE++: A Script for Automatic Remote Distribution of Programs on Windows Computers Ashley Hopkins Department of Computer.
On the D4Science Approach Toward AquaMaps Richness Maps Generation Pasquale Pagano - CNR-ISTI Pedro Andrade.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
18 May 2006CCGrid2006 Dynamic Workflow Management Using Performance Data Lican Huang, David W. Walker, Yan Huang, and Omer F. Rana Cardiff School of Computer.
Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison
HPC In The Cloud Case Study: Proteomics Workflow
TensorFlow– A system for large-scale machine learning
HPC In The Cloud Case Study: Proteomics Workflow
Condor DAGMan: Managing Job Dependencies with Condor
湖南大学-信息科学与工程学院-计算机与科学系
Haiyan Meng and Douglas Thain
Laura Bright David Maier Portland State University
REMOTE++: A Tool for Automatic Remote
Lecture 29: Distributed Systems
Presentation transcript:

Data-driven Workflow Planning in Cluster management Systems Srinath Shankar David J DeWitt Department of Computer Sciences University of Wisconsin-Madison, USA

Data explosion in science Scientific applications – Traditionally considered as compute-intensive Data explosion in recent years Astronomy – hundreds of TB  Sloan Digital Sky Survey  LIGO – Laser Interferometry Gravitational-wave observer Bioinformatics –  BIRN – Biomedical informatics research network  SwissProt – Protein database

Scientific workflows and files A, B, C and D are programs File1 and File2 are pipeline (intermediate) files FileInput is a batch input file -- common to all DAGs Jobs with dependencies organized in Directed Acyclic Graphs Large number of similar DAGs make up a workflow

Distributed scientific computing Scientists have exploited distributed computing to run their programs and workflows One popular distributed computing system is Condor Condor harvests idle CPU cycles on machines in a network Condor has been installed on roughly 113,000 machines across 1,600 clusters around the world

But … Several advances have been made since the development of Condor in the `80s Machines are getting cheaper  Organizations no longer rely solely on idle desktop machines for computing cycles  The proportion of machines dedicated to Condor computing in a cluster is increasing Disk capacities are increasing  A single machine may have 500 GB of disk space  Thus, desktop machines may also have a lot of free disk space Dedicated and desktop machines have unused disk space Half a petabyte of disk space spread over a modest cluster of 1000 machines

Focus The volume of data processed by scientific applications is increasing. How can we leverage distributed disk space to improve data management in cluster computing systems (like Condor) ? Step 1: Store workflow data across the disks of machines in a cluster Step 2: Schedule workflows based on data location – Exploit disk space to improve workflow execution times

Overview of Condor Planner Submit Machine Execute Machine User Data User Process User input data Output data Data flow Control flow Job info Machine info

Job and workflow submission To submit a job, the user provides a “submit” file containing  Complete job description – The input, output and error files, when to transfer these files etc.  Machine preferences like OS, CPU speed and memory Workflows are managed in a separate layer  The user specifies dependencies between jobs in a separate “DAG description” file  A DAG manager process (DAGMan) on the submit machine continuously monitors job completion events  This process submits a job only when all its parents have completed

Limitations of Condor The “source” of files in Condor is the submit machine, or perhaps a shared or third-party file system  Inefficient handling of files during workflow execution  Files always transferred to and from the submit machine The planner only handles single jobs  It has no direct knowledge of job dependencies.  It only sees a job after DAGMan submits it.

Distributed file caching Keep the files of a job on the disk of machines after execution Utilize local disks on execute machines as sources of files Schedule dependent jobs on same machine whenever feasible Avoid network file transfer Reduce overall workflow execution time

Disk aware planning Goal – reduce workflow execution time by minimizing file transfers Planner must be aware of the locations of cached files Requires a planner that is also aware of workflow structure

Two phase planning algorithm AssignDAGs : Each DAG in a workflow tentatively assigned to the best machine based on disk cache contents But, assigning whole DAGs ignores inter-job parallelism Parallelize : Exploit parallelism in DAG to distribute load  Cost-benefit analysis used when scheduling dependent jobs on different machines

Planning example A B C F1 F2 Sample DAG A C A C A C A C A C A B C Suppose we have 4 machines available to run the workflow shown below Sample Workflow (6 DAGs)

Assignment of DAGs For each DAG in the workflow, we determine the machine that will result in earliest completion time for that DAG, and assign it to that machine. DAG runtime = Sum of job runtimes and file transfer times  File transfer times depends on cache contents of the machine Effectively, each DAG is treated like a single job in this phase.

Schedule after AssignDAGs M1M2M3M4 A B C A B C A B C A B C A B C A B C Time Jobs in the same DAG are of the same color The schedule produced after AssignDAGs entails no transfer of intermediate files

Assignment phase (contd.) While DAGs are being assigned, a cumulative runtime is maintained for each machine Once a DAG has been scheduled on a machine, we assume that machine caches the workflow batch input (common to all DAGs) Thus, batch input transfer times are not included in calculations of the runtime of other DAGs on that machine

Parallelization of DAGs  After assignment phase, uneven load on machines  There are “extra” DAGs on a few heavily loaded machines  There are some machines with a much lighter load  Exploit inter-job parallelism to distribute load  The “extra” DAGs are examined in turn.  If two jobs in a DAG can be run in parallel, we try to move one of them to a lightly loaded machine.

Parallelization – Costs and benefits Cost of parallelization – When you move a job to a different machine than its parents and children, its input and output files have to be transferred to and from that machine. Cost = (input_size +output_size)/net_BW.  Input_size and output_size are the sizes of the input files and output files for the job  Net_BW is the network bandwidth  Cost is the time taken to perform data transfers to and from the different machines Benefit = Time saved due to parallel execution of jobs

C Final Schedule M1M2M3M4 A B C A B C A B C A B C A B A B C Time F2 Network file transfer In the final schedule, files are transferred from M2 to M1 and from M4 to M3 B CC B

Parallelization (contd.) In the formula for the cost of parallelization, input_size and output_size are adjusted for files already cached on either machine If a job being considered for parallelization has no children, output_size is taken as 0 since its output files do not need to be transferred back

Implementation Main feature is a database used to store  File information – checksums, sizes, file type, file locations  Job information – Files used by jobs, job dependencies  Workflow schedules – Produced by the planner The Condor daemons were modified to directly connect to the database and perform insert/updates/queries

Role of database Data- base Workflow and file info Cache info Workflow, file info Schedule Submit Machine Execute Machine Planner File Cache User Data

Implementation – versioning Versions of input and executables are determined by checksums computed at submission time The versions of intermediate and output files are “derived” from the versions of the inputs and executables that produce them

Implementation – Distributed Storage Before a job executes on a machine, its input files are retrieved  Files available in the machine’s local cache are used directly  Unavailable files are retrieved from other machines in the cluster. Any machine can serve as a file server After a job completes, its executable, input and output files are saved in the execute machine’s disk cache. Once a job has completed, the database is updated with the new status and cache information.

Implementation – Workflow submission An entire workflow is submitted at one time The workflow submission tools directly update the database with job and workflow information This information includes files used by the workflow as well as job dependencies in the workflow The planner directly uses the information in the database. Thus  It has knowledge of job dependencies during planning  It has knowledge of the locations of the relevant files during planning

Performance testing Comparison of three systems ORIG – The original Condor system DAG-C – Our caching and DAG- oriented planning framework Job-C  Same caching mechanism as DAG-C  No DAG-based planning. When a job is ready, it is matched to the machine that caches most input

Description of setup Tested on BLAST and synthetic workflows with varying branch-in factor and pipeline volume Cluster of 25 execute machines – all files were in the same network Two submit machines Network bandwidth was 100 Mbps No shared file system was used All experiments run with initially clean disk caches

The BLAST workflow BLAST is a sequence alignment workflow. Given a protein sequence “seq”, blastall checks a database of known proteins for any similarities. Proteins with similar sequences are expected to have similar properties. Javawrap converts the results into CSV and binary format for later use. Batch input :(~4GB) nr_db.psq nr_db.pin nr_db nr_db.phr nr.gz Pipeline volume: seq.blast (~2MB) blastall (3.1 MB) java- wrap (1KB) seq.blast nr_db.psq (986 MB) nr_db.pin (23 MB) nr_db.psq (986 MB) nr_db.pin (23 MB) seqseq.csv seq.bin

BLAST results

Sensitivity to pipeline volume F1, F2, G1 and G2 are distinct files 10 minutes per job Varying size per file – 100MB, 1GB, 1.5 GB, 2GB 50 DAGs per workflow

Pipeline I/O results

DAG breadth File F i, G i are distinct Varied branching factor (n) from 3 to 6 10 min per job Tested a 50 DAG workflow with 1GB per file

DAG breadth results (1GB)

Varying computation time Size of each file set to 1GB Varied the time per job from 10 to 30 minutes. (i.e. time per DAG from 80 to 240 min) Tested a 50 DAG workflow

Increasing computation

Results – Summary Job-C and DAG-C are better than ORIG  In ORIG, all file traffic through submit machine  In Job-C and DAG-C, files can be retrieved from multiple locations Thus, caching helps DAG-C is significantly better than Job-C when pipeline volume, branching factor are high  In Job-C parent jobs often run on different machines  Output files have to be transferred to the machine where their child executes Thus, DAG-oriented planning helps

Distributed file caching – other benefits Scientists frequently reuse files (such as executables) – These can be used directly at their stored locations. Maintaining user data  ‘What were the programs run to obtain this output ?’  ‘When did I last use a particular version of a file?’

Ongoing work Planning  Evaluating planning overhead, dependence on DB size  Make planning scheme more responsive to job failure, machine failure A cache replacement policy based on an LRFU scheme has been implemented, but not validated (See paper for details). Ongoing work includes  Validating the cache replacement policy and determining the best policy for a workflow depending on user’s submission pattern  Including the time needed for generating a file in estimates of its “cache-worthiness”

Related work ZOO, GridDB – data centric workflow management systems Thain et al. – Pipeline and batch sharing in Grid workloads – HPDC 2003 Romosan et al. – Coscheduling of computation and data on computer clusters – SSDBM 2005 Bright et al. – Efficient scheduling and execution of scientific workflow tasks – SSDBM 2005

Questions ?