Migratory File Services for Batch-Pipelined Workloads

Slides:



Advertisements
Similar presentations
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Spark: Cluster Computing with Working Sets
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
Using Application Structure to Handle Failures and Improve Performance in a Migratory File Service John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi.
70-270, MCSE/MCSA Guide to Installing and Managing Microsoft Windows XP Professional and Windows Server 2003 Chapter Nine Managing File System Access.
The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain.
Cross Cluster Migration Remote access support Adianto Wibisono supervised by : Dr. Dick van Albada Kamil Iskra, M. Sc.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access memory.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Chapter 3.1:Operating Systems Concepts 1. A Computer Model An operating system has to deal with the fact that a computer is made up of a CPU, random access.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Projects 6 May 2003 Pipeline and Batch Sharing in.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
INFSO-RI Enabling Grids for E-sciencE DAGs with data placement nodes: the “shish-kebab” jobs Francesco Prelz Enzo Martelli INFN.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Review of Condor,SGE,LSF,PBS
Storage Research Meets The Grid Remzi Arpaci-Dusseau.
STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar University of Wisconsin-Madison May 25 th, 2004 CERN.
CIS250 OPERATING SYSTEMS Chapter One Introduction.
John Bent Computer Sciences Department University of Wisconsin-Madison Explicit Control in a Batch-aware.
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
Explicit Control in a Batch-aware Distributed File System John Bent Douglas Thain Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Miron Livny University of.
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.
Five todos when moving an application to distributed HTC.
Condor DAGMan: Managing Job Dependencies with Condor
OpenPBS – Distributed Workload Management System
Pegasus WMS Extends DAGMan to the grid world
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Curator: Self-Managing Storage for Enterprise Clusters
Management of Virtual Machines in Grids Infrastructures
High Availability in HTCondor
Artem Trunov and EKP team EPK – Uni Karlsruhe
Introduction to MapReduce and Hadoop
Building Grids with Condor
Condor: Job Management
US CMS Testbed.
Management of Virtual Machines in Grids Infrastructures
Introduction to Makeflow and Work Queue
湖南大学-信息科学与工程学院-计算机与科学系
Haiyan Meng and Douglas Thain
An Introduction to Computer Networking
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
Weaving Abstractions into Workflows
Introduction to Makeflow and Work Queue
Ch 4. The Evolution of Analytic Scalability
Overview of big data tools
NeST: Network Storage Technologies
Wide Area Workload Management Work Package DATAGRID project
Overview of Workflows: Why Use Them?
Mats Rynge USC Information Sciences Institute
Condor-G Making Condor Grid Enabled
Frieda meets Pegasus-WMS
JRA 1 Progress Report ETICS 2 All-Hands Meeting
Job Submission Via File Transfer
Presentation transcript:

Migratory File Services for Batch-Pipelined Workloads John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Projects 6 May 2003 Hello... Joint work between the WiND and Condor groups. John just described batch-pipelined applications, a common application model that has a regular structure and significant I/O needs. In this talk, I will describe migratory file services, a system model for executing batch-pipelined workloads.

How to Run a Batch-Pipelined Workload? Shared Data a1 a2 a3 pipe.1 pipe.2 pipe.3 b1 b2 b3 out.1 out.2 out.3

Cluster-to-Cluster Computing Grid Engine Cluster The Internet PBS Cluster Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Home System Condor Pool Node Node Node Node Archive Node Node Node Node Node Node Node Node Node Node Node Node

How to Run a Batch-Pipelined Workload? “Remote I/O” Submit jobs to a remote batch system. Let all I/O come directly home. Inefficient if re-use is common. (( But perfect if no data sharing! )) “FTP-Net” User finds remote clusters. Manually stages data in. Submits jobs, deals with failures. Pulls data out. Lather, rinse, repeat.

Hawk: A Migratory File Service for Batch-Pipelined Workloads Automatically deploys a “task force” across multiple wide-area systems. Manages applications from a high level, using knowledge of process interactions. Provides dependable performance with peer-to-peer techniques. (( Locality is key! )) Understands and reacts to failures using knowledge of the system and workloads.

Dangers Failures Physical: Networks fail, disks crash, CPUs halt. Logical: Out of space/memory, lease expired. Administrative: You can’t use cluster X today. Dependencies A comes before C and D, which are simultaneous. What do we do if the output of C is lost? Risk vs Reward A gamble: Staging input data to a remote CPU. A gamble: Leaving output data at a remote CPU. Dangers associated with distributed computing. Remote I/O simplifies much of these dangers -- you never lose data remotely, but it is extraordinarily slow.

Hawk In Action Home System Batch-Pipelined Workload Grid Engine Cluster The Internet Hawk PBS Cluster Node Node a2 b2 c2 Node o2 i2 a3 b3 c3 Node Node Node o3 i3 Node Node Node Node Node Node Node Node Node Home System Condor Pool Node Node Node a1 b1 c1 Node i1 Archive o1 Node Node Node Node Node Node Node Node o1 o2 o3 Node Node Node Node i1 i2 i3

Workflow Language 1 (Start With Condor DAGMan) job a a.condor job b b.condor job c c.condor job d d.condor parent a child c parent b child d a b c d The key to Hawk is that it allows the user to specify a complex batch-parallel workload in an abstract manner. The workflow is then mapped into the system. I’m going to begin by showing you the Hawk workflow language.

Workflow Language 2 a b c d v1 Archive Storage mydata volume v1 ftp://archive/mydata mount v1 a /data mount v1 b /data volume v2 scratch mount v2 a /tmp mount v2 c /tmp volume v3 scratch mount v3 b /tmp mount v3 d /tmp v2 v3 a b c d

Workflow Language 3 a b c d v1 Archive Storage mydata v2 v3 extract v2 x ftp://home/out.1 extract v3 x ftp://home/out.2 x x c d out.1 out.2

Mapping the Workflow to the Migratory File System Abstract Jobs Become jobs in a batch system May start, stop, fail, checkpoint, restart... Logical “scratch” volumes Become temporary containers on a scratch disk. May be created, replicated, and destroyed... Logical “read” volumes Become blocks in a cooperative proxy cache. May be created, cached, and evicted...

System Components Node Node Node Node Node Node Node Node Node Node PBS Cluster Condor Pool Archive Condor MM Condor SchedD Workflow Manager

Gliding In Node Node Node Node Node Node Node Node Node Node StartD Proxy Master Node StartD Proxy Master StartD Proxy Master Node Node Node StartD Proxy Master Node Node Node Node StartD Proxy Master PBS Cluster Condor Pool Archive Condor MM Condor SchedD StartD Proxy Master Glide-In Job

System Components Node Node Node Node Node Node Node Node Node Node Proxy Proxy Proxy StartD StartD StartD Node Node Node Node Node Proxy Proxy StartD StartD PBS Head Node Condor Pool Archive Condor MM Condor SchedD Workflow Manager

Cooperative Proxies Node Node Node Node Node Node Node Node Node Node Proxy Proxy Proxy StartD StartD StartD Node Node Node Node Node Proxy Proxy StartD StartD PBS Head Node Condor Pool Archive Condor MM Condor SchedD Workflow Manager

System Components Node Node Node Node Node Node Node Node Node Node Proxy Proxy Proxy StartD StartD StartD Node Node Node Node Node Proxy Proxy StartD StartD PBS Head Node Condor Pool Archive Condor MM Condor SchedD Workflow Manager

Batch Execution System Node Node Node Node Node Proxy Proxy Proxy StartD Node Node Node Node Node Proxy Proxy PBS Head Node Condor Pool Archive Condor MM Condor SchedD Workflow Manager

System Components Node PBS Head Node Condor Pool Archive Workflow Proxy PBS Head Node Condor Pool StartD Proxy StartD Archive Condor MM Condor SchedD Workflow Manager

Workflow Manager Detail Archive Proxy StartD Condor MM Condor SchedD Workflow Manager

Proxy StartD Archive Archive Workflow Manager Condor MM Condor SchedD

Proxy StartD Coop Block Input Cache Create Container Archive Workflow Local Area Network Container 120 Create Container Wide Area Network Archive /mydata d15 d16 Condor MM Condor SchedD Workflow Manager

POSIX Library Interface Proxy StartD Job Coop Block Input Cache open(“/data/d15”); creat(“/tmp/outfile”); Local Area Network Container 120 POSIX Library Interface Agent /tmp cont://host5/120 /data cache://host5/archive/mydata outfile Wide Area Network Archive Execute Job /mydata d15 d16 Condor MM Condor SchedD Workflow Manager

Proxy StartD Job Completed Coop Block Input Cache Extract Output Local Area Network Container 120 outfile Extract Output Wide Area Network Archive /mydata Condor MM Condor SchedD Workflow Manager d15 out65 d16

Proxy StartD Job Completed Coop Block Input Cache Delete Container Local Area Network Container 120 outfile Delete Container Wide Area Network Archive /mydata Condor MM Condor SchedD Workflow Manager d15 out65 d16

Proxy StartD Job Completed Coop Block Input Cache Container Deleted Local Area Network Container Deleted Wide Area Network Archive /mydata Condor MM Condor SchedD Workflow Manager d15 out65 d16

Fault Detection and Repair The proxy, startd, and agent detect failures: Job evicted by machine owner. Network disconnection between job and proxy. Container evicted by storage owner. Out of space at proxy. The workflow manager knows the consequences: Job D couldn’t perform it’s I/O. Check: Are volumes V1 and V3 still in place? Aha: Volume V3 was lost -> Run B to create it.

Performance Testbed Controlled “remote” cluster: 32 cluster nodes at UW. Hawk submitter also at UW. Connected by a restricted 800 Kb/s link. Also some preliminary tests on uncontrolled systems: Hawk over PBS cluster at Los Alamos Hawk over Condor system at INFN Italy.

Batch-Pipelined Applications Name Stages Load Remote (jobs/hr) Hawk (jobs/hr) BLAST 1 Batch Heavy 4.67 747.40 CMS 2 and Pipe 33.78 1273.96 HF 3 Pipe 40.96 3187.22

Rollback Cascading Failure Failure Recovery

A Little Bit of Philosophy Most systems build from the bottom up: “This disk must have five nines, or else!” MFS works from the top down: “If this disk fails, we know what to do.” By working from the top down, we finesse many of the hard problems in traditional filesystems.

Future Work Integration with Stork P2P Aspects: Discovery & Replication Optional Knowledge: Size & Time Delegation and Disconnection Names, names, names: Hawk – A migratory file service. Hawkeye – A system monitoring tool.

Let Hawk juggle your work! jobs Let Hawk juggle your work! job data data ? Feeling overwhelmed?