The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Slides:



Advertisements
Similar presentations
Tableau Software Australia
Advertisements

Categories of I/O Devices
Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.
Panel B1 Summary The Present Looms Large D. Petravick A Cluster is an Excellent Error Amplifier. C. Boeheim.
Operating System.
A Computation Management Agent for Multi-Institutional Grids
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002.
The Kangaroo Approach to Data Movement on the Grid Douglas Thain, Jim Basney, Se-Chang Son, and Miron Livny
Reliable I/O on the Grid Douglas Thain and Miron Livny Condor Project University of Wisconsin.
The Kangaroo Approach to Data Movement on the Grid Douglas Thain, Jim Basney, Se-Chang Son, and Miron Livny Condor Project University of Wisconsin.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain.
US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Vladimir Litvin, Harvey Newman Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum, Jamie Frey Wisconsin.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
The SLAC Cluster Chuck Boeheim Assistant Director, SLAC Computing Services.
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 3: Operating-System Structures System Components Operating System Services.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
9 February 2000CHEP2000 Paper 3681 CDF Data Handling: Resource Management and Tests E.Buckley-Geer, S.Lammel, F.Ratnikov, T.Watts Hardware and Resources.
STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar and Miron Livny University of Wisconsin-Madison March 25 th, 2004 Tokyo, Japan.
Alain Roy Computer Sciences Department University of Wisconsin-Madison I/O Access in Condor and Grid.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.
Error Scope on a Computational Grid Douglas Thain University of Wisconsin 4 March 2002.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
Issues on the operational cluster 1 Up to 4.4x times variation of the execution time on 169 cores Using -O2 optimization flag Using IBM MPI without efficient.
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
The Kangaroo Approach to Data Movement on the Grid Author: D. Thain, J. Basney, S.-C. Son, and M. Livny From: HPDC 2001 Presenter: NClab, KAIST, Hyonik.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Reliable and Efficient Grid Data Placement using Stork and DiskRouter Tevfik Kosar University of Wisconsin-Madison April 15 th, 2004.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
BIG DATA/ Hadoop Interview Questions.
Presented by Robust Storage Management On Desktop, in Machine Room, and Beyond Xiaosong Ma Computer Science and Mathematics Oak Ridge National Laboratory.
Condor DAGMan: Managing Job Dependencies with Condor
Applied Operating System Concepts
U.S. ATLAS Grid Production Experience
Operating System.
Migratory File Services for Batch-Pipelined Workloads
US CMS Testbed.
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
Grid Canada Testbed using HEP applications
Haiyan Meng and Douglas Thain
CS 345A Data Mining MapReduce This presentation has been altered.
Chapter 2: Operating-System Structures
Introduction to Operating Systems
Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.
Chapter 2: Operating-System Structures
CS639: Data Management for Data Science
Thursday AM, Lecture 1 Lauren Michael
Presentation transcript:

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

The Condor Project Established in Software for high-throughput cluster computing on sites ranging from 10->1000s of nodes. Example installations: 643 CPUs at UW-Madison in CS building Comp architecture simulations 264 CPUs at INFN all across Italy CMS simulations Serves two communities: Production software and computer science research.

No Repository Here! No master source of anyone’s data at UW-CS Condor! But, large amount of buffer space: 128 * 10 GB + 64 *30 GB Ultimate store is at other sites: NCSA mass store CERN LHC repositories We concentrate on software for loading, buffering, caching, and producing output efficiently.

The Challenges of Large-Scale Data Access are… 1 - Correctness! Single stage: crashed machines, lost connections, missing libraries, wrong permissions, expired proxies… End-to-end: A job is not “complete” until the output has been verified and written to disk. 2 - Heterogeneity By design: aggregated clusters. By situation: Disk layout, buffer capacity, net load.

Your Comments: Jobs need scripts that check readiness of system before execution. (Tim Smith) Single node failures not worth investigating: Reboot, reimage, replace. (Steve DuChene) “A cluster is a large error amplifier.” (Chuck Boeheim)

Data Management in Condor Production -> Research Remote I/O DAGMan Kangaroo Common denominators: Hide errors from jobs -- they cannot deal with “connection refused” or “network down.” Propagate failures first to scheduler, and perhaps later to the user.

Remote I/O Relink job with Condor C library. I/O is performed along TCP connection to the submit site: either fine-grained RPCs or whole-file staging. Exec Site Submit Site Exec Site Exec Site Exec Site Exec Site Exec Site Exec Site Exec Site Exec Site Job On any failure: 1 - Kill -9 job 2 - Log event 3 - user? 4 - Reschedule Some failures: NFS down DNS down Node rebooting Missing input

DAGMan (Directed Acyclic Graph Manager) A persistent ‘make’ for distributed computing. Handles dependencies and failures in multi- job tasks, including cpu and data movement. Run Remote Job Stage Input Run Remote Job Stage Output Check Output Begin DAG DAG Complete If results are bogus… Retry up to 10 times. If transfer fails… Retry up to 5 times.

Kangaroo Simple Idea: Use all available net, mem, and disk to buffer data. “Hop” it to destination. Background process, not job, is responsible for handling both faults and variations. Allows overlap of CPU and I/O. Storage Site Execution Site K K K K Data Movement System App Disk

I/O Models OUTPUT CPU OUTPUT Stage Output: Kangaroo Output: INPUT OUTPUT CPU OUTPUT CPUOUTPUTINPUTOUTPUTCPU PUSH

In Summary… Correctness is a major obstacle to high-throughput cluster computing. Jobs must be protected from all of the possible errors in data access. Handle failures in two ways: Abort, and inform scheduler (not user.) Fall back to alternate resource. Pleasant side effect: higher throughput!