Using Application Structure to Handle Failures and Improve Performance in a Migratory File Service John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Spark: Cluster Computing with Working Sets
Research Issues in Cooperative Computing Douglas Thain
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.
Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008.
An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame
1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.
The Condor Data Access Framework GridFTP / NeST Day 31 July 2001 Douglas Thain.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Ch 4. The Evolution of Analytic Scalability
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Douglas Thain, John Bent, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Projects 6 May 2003 Pipeline and Batch Sharing in.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Networked Storage Technologies Douglas Thain University of Wisconsin GriPhyN NSF Project Review January 2003 Chicago.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Virtualization for Storage Efficiency and Centralized Management Genevieve Sullivan Hewlett-Packard
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
INFSO-RI Enabling Grids for E-sciencE DAGs with data placement nodes: the “shish-kebab” jobs Francesco Prelz Enzo Martelli INFN.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Virtual Batch Queues A Service Oriented View of “The Fabric” Rich Baker Brookhaven National Laboratory April 4, 2002.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Storage Research Meets The Grid Remzi Arpaci-Dusseau.
Operating Systems Objective n The historic background n What the OS means? n Characteristics and types of OS n General Concept of Computer System.
John Bent Computer Sciences Department University of Wisconsin-Madison Explicit Control in a Batch-aware.
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Explicit Control in a Batch-aware Distributed File System John Bent Douglas Thain Andrea Arpaci-Dusseau Remzi Arpaci-Dusseau Miron Livny University of.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
NeST: Network Storage John Bent, Venkateshwaran V Miron Livny, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA/ Hadoop Interview Questions.
Condor DAGMan: Managing Job Dependencies with Condor
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Distributed Data Access and Resource Management in the D0 SAM System
Migratory File Services for Batch-Pipelined Workloads
Introduction to MapReduce and Hadoop
Building Grids with Condor
US CMS Testbed.
Grid Means Business OGF-20, Manchester, May 2007
OffLine Physics Computing
Haiyan Meng and Douglas Thain
Ch 4. The Evolution of Analytic Scalability
NeST: Network Storage Technologies
THE GOOGLE FILE SYSTEM.
Condor-G Making Condor Grid Enabled
Job Submission Via File Transfer
Presentation transcript:

Using Application Structure to Handle Failures and Improve Performance in a Migratory File Service John Bent, Douglas Thain, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, and Miron Livny WiND and Condor Project 14 April 2003

Disclaimer We have a lot of stuff to describe, so hang in there until the end!

Outline Data Intensive Applications –Batch and Pipeline Sharing –Example: AMANDA Hawk: A Migratory File Service –Application Structure –System Architecture –Interactions Evaluation –Performance –Failure Philosophizing

CPU Bound etc... –Excellent application of dist comp. –KB of data, days of CPU time. –Efficient to do tiny I/O on demand. Supporting Systems: –Condor –BOINC –Google Toolbar –Custom software.

I/O Bound D-Zero data analysis: –Excellent app for cluster computing. –GB of data, seconds of CPU time. –Efficient to compute whenever data is ready. Supporting Systems: –Fermi SAM –High-throughput document scanning –Custom software.

Batch Pipelined Applications c1 data b1 a1 xyz c2 data b2 a2 xyz c3 data b3 a3 xyz data Pipeline Shared Data Batch Width Batch Shared Data Pipeline

Example: AMANDA corsika corama mmc amasim NUCNUCCS GLAUBTAR EGSDATA3.3 QGSDATA4 (1 MB) DAT (23 MB) corama.out (26 MB) mmc_input.txt mmc_output.dat (126 MB) amasim_input.dat ice tables (3 files, 3MB) amasim_output.txt (5MB) expt geometry (100s files, 500 MB) corsika_input.txt (4 KB)

Computing Evironment Clusters dominate: –Similar configurations. –Fast interconnects. –Single administrative domain. –Underutilized commodity storage. –En masse, quite unreliable. Users wish to harness multiple clusters, but have jobs that are both I/O and CPU intensive.

Ugly Solutions “FTP-Net” –User finds remote clusters. –Manually stages data in. –Submits jobs, deals with failures. –Pulls data out. –Lather, rinse, repeat. “Remote I/O” –Submit jobs to a remote batch system. –Let all I/O come back to the archive. –Return in several decades.

What We Really Need Access resources outside my domain. –Assemble your own army. Automatic integration of CPU and I/O access. –Forget optimal: save administration costs. –Replacing remote with local always wins. Robustness to failures. –Can’t hire babysitters for New Year’s Eve.

Hawk: A Migratory File Service Automatically deploys a “task force” acorss an existing distributed system. Manages applications from a high level, using knowledge of process interactions. Provides dependable performance through peer-to-peer techniques. Understands and reacts to failures using knowledge of the system and workloads.

Philsophy of Hawk “In allocating resources, strive to avoid disaster, rather than attempt to obtain an optimum.” - Butler Lampson

Why not AFS+Make? Quick answer: –Distributed filesystems provide an unnecessarily strong abstraction that is unacceptably expensive to provide in the wide area. Better answer after we explain what Hawk is and how it works.

Outline Data Intensive Applications –Batch and Pipeline Sharing –Example: AMANDA Hawk: A Migratory File Service –Application Structure –System Architecture –Interactions Evaluation –Performance –Failure Philosophizing

Workflow Language 1 job a a.sub job b b.sub job c c.sub job d d.sub parent a child c parent b child d ab cd

v1 Home Storage mydata v2v3 Workflow Language 2 volume v1 ftp://home/mydata mount v1 a /data mount v1 b /data volume v2 scratch mount v2 a /tmp mount v2 c /tmp volume v3 scratch mount v3 b /tmp mount v3 d /tmp ab cd

v1 Home Storage mydata v2v3 Workflow Language 3 extract v2 x ftp://home/out.1 extract v3 x ftp://home/out.2 ab cd x out.1out.2 x

Mapping Logical to Physical Abstract Jobs –Physical jobs in a batch system –May run more than once! Logical “scratch” volumes –Temporary containers on a scratch disk. –May be created, replicated, and destroyed. Logical “read” volumes –Striped across cooperative proxy caches. –May be created, cached, and evicted.

Node Starting System Match Maker Batch Queue Archive Node PBS Head Node Node Condor Pool Workflow Manager

Node Gliding In Match Maker Batch Queue Archive StartD Proxy Master Node StartD Proxy Master StartD Proxy Master PBS Head Node Node Condor Pool StartD Proxy Master StartD Proxy Master StartD Proxy Master Glide-In Job

Hawk Architecture StartD Proxy Match Maker Batch Queue Archive Workflow Manager StartD Proxy StartD Proxy Wide Area Caching Coop Cache Coop Cache System Model App Flow Job Agent Job Agent Job Agent

I/O Interactions StartD Job Agent Proxy POSIX Library Interface Local Area Network /tmpcontainer://host5/120 /datacache://host5/archive/data Match Maker Batch Queue Archive Workflow Manager Cooperative Block Cache Other Proxies Cont. 119Cont. 120 foo outfile tmpfile barbaz creat(“/tmp/outfile”); open(“/data/d15”);

Cooperative Proxies StartD Proxy A Match Maker Batch Queue Archive Workflow Manager StartD Proxy B StartD Proxy C Job Agent Job Agent Job Agent Discover C C C Hash Map Paths -> Proxies C t1: BC t2: CBA t3: CB t4:

Summary Archive –Sources input data, chooses coordinator. Glide-In –Deploy a “task force” of components. Cooperative Proxies –Provide dependable batch read-only data. Data Containers –Fault-isolated pipeline data. Workflow Manager –Directs the operation.

Outline Data Intensive Applications –Batch and Pipeline Sharing –Example: AMANDA Hawk: A Migratory File Service –Application Structure –System Architecture –Interactions Evaluation –Performance –Failure Philosophizing

Performance Testbed Controlled testbed: – MHZ dual-cpu cluster machines, 1 GB, SCSI disks, 100Mb/s ethernet. –Simulated WAN: restrict archive storage across router to 800 KB/s. Also some preliminary tests on uncontrolled systems: –MFS over PBS cluster at Los Alamos –MFS over Condor system at INFN Italy.

Synthetic Apps a b 10 MB pipe a b 5 MB batch 5 MB pipe a b 10 MB batch Pipe IntensiveMixed Batch Intensive Local Co- Locate Data Don’t Co- Locate Remote System Configurations

Pipeline Optimization

Everything Together

Network Consumption

Failure Handling

Real Applications BLAST –Search tool for proteins and nucleotides in genomic databases. CMS –Simulation of a high energy physics expt to begin operation at CERN in H-F –Simulation of the non relativistic interactions between nuclei and electrons AMANDA –Simulation of a neutrino detector buried in the ice of the South Pole.

Application Throughput NameStagesRemoteHawk BLAST CMS HF AMANDA4

Outline Data Intensive Applications –Batch and Pipeline Sharing –Example: AMANDA Hawk: A Migratory File Service –Application Structure –System Architecture –Interactions Evaluation –Performance –Failure Philosophizing

Related Work Workflow management Dependency managers: TREC, make Private namespaces: UFO, db views Cooperative caching: no writes. P2P systems: wrong semantics. Filesystems: overly strong

Why Not AFS+Make? Namespaces –Constructed per-process at submit-time Consistency –Enforced at the workflow level Selective Commit –Everything tossed unless explicitly saved. Fault Awareness –CPUs and data can be lost at any point. Practicality –No special permission required.

Conclusions Traditional systems build from the bottom up: this disk must have five nines, or we’re in big trouble! MFS builds from the top down: application semantics drive system structure. By posing the right problem, we solve the traditional hard problems of file systems.

For More Info... Paper in progress... Application study: –“Pipeline and Batch Sharing in Grid Workloads”, to appear in HPDC – Talk to us! Questions now?