1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific.

Slides:



Advertisements
Similar presentations
Jens G Jensen Atlas Petabyte store Supporting Multiple Interfaces to Mass Storage Providing Tape and Mass Storage to Diverse Scientific Communities.
Advertisements

HEPiX GFAL and LCG data management Jean-Philippe Baud CERN/IT/GD.
1 SRM-Lite: overcoming the firewall barrier for large scale file replication Arie Shoshani Alex Sim Lawrence Berkeley National Laboratory April, 2007.
A Computation Management Agent for Multi-Institutional Grids
Computing Sciences Directorate, L B N L 1 CHEP 2003 Storage Resource Management In the Grid Environment Alex Sim Junmin Gu Arie Shoshani Scientific Data.
Aug Arie Shoshani Particle Physics Data Grid Request Management working group.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Grid Collector: Enabling File-Transparent Object Access For Analysis Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani.
A. Sim, CRD, L B N L 1 Data Management Foundations Workshop, Mar. 3, 2009 Storage in OSG and BeStMan Alex Sim Scientific Data Management Research Group.
Data Grid Web Services Chip Watson Jie Chen, Ying Chen, Bryan Hess, Walt Akers.
Presented by The Earth System Grid: Turning Climate Datasets into Community Resources David E. Bernholdt, ORNL on behalf of the Earth System Grid team.
A. Sim, CRD, L B N L GIN-Data : SRM Island Inter-Op Testing With SRM-TESTER Alex Sim, Vijaya Natarajan Computational Research Division Lawrence Berkeley.
A. Sim, CRD, L B N L 1 Oct. 23, 2008 BeStMan Extra Slides.
SDM Center February 2, 2005 Progress on MPI-IO Access to Mass Storage System Using a Storage Resource Manager Ekow J. Otoo, Arie Shoshani and Alex Sim.
Summary of Category 3 HENP Computing Systems and Infrastructure Ian Fisk and Michael Ernst CHEP 2003 March 28, 2003.
1 Alex Romosan,Derek Wright, Alex Romosan, Derek Wright, Ekow Otoo, Doron Rotem, Arie Shoshani (Guidance: Doug Olson) Lawrence Berkeley National Laboratory.
A. Sim, CRD, L B N L 1 OSG Applications Workshop 6/1/2005 OSG SRM/DRM Readiness and Plan Alex Sim / Jorge Rodriguez Scientific Data Management Group Computational.
1 Use of SRMs in Earth System Grid Arie Shoshani Alex Sim Lawrence Berkeley National Laboratory.
Computing Sciences Directorate, L B N L 1 SC 2003 Storage Resource Managers: Essential Components for the Grid Arie Shoshani Staff: Alex Sim, Junmin Gu,
File and Object Replication in Data Grids Chin-Yi Tsai.
Data Management The GSM-WG Perspective. Background SRM is the Storage Resource Manager A Control protocol for Mass Storage Systems Standard protocol:
INFSO-RI Enabling Grids for E-sciencE DAGs with data placement nodes: the “shish-kebab” jobs Francesco Prelz Enzo Martelli INFN.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
4 Oct 04Storage Resource Manager, Timur Perelmutov, Don Petravick, Fermilab 1 Storage Resource Management at Fermilab Timur Perelmutov Don Petravick Fermi.
D C a c h e Michael Ernst Patrick Fuhrmann Tigran Mkrtchyan d C a c h e M. Ernst, P. Fuhrmann, T. Mkrtchyan Chep 2003 Chep2003 UCSD, California.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
MAGDA Roger Jones UCL 16 th December RWL Jones, Lancaster University MAGDA  Main authors: Wensheng Deng, Torre Wenaus Wensheng DengTorre WenausWensheng.
Author - Title- Date - n° 1 Partner Logo EU DataGrid, Work Package 5 The Storage Element.
Bulk Data Movement: Components and Architectural Diagram Alex Sim Arie Shoshani LBNL April 2009.
Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.
Andrew C. Smith – Storage Resource Managers – 10/05/05 Functionality and Integration Storage Resource Managers.
1 Meeting Location: LBNL Sept 18, 2003 The functionality of a Replica Registration Service Attendees Michael Haddox-Schatz, JLAB Ann Chervenak, USC/ISI.
The Earth System Grid: A Visualisation Solution Gary Strand.
The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.
Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.
1 Grid File Replication using Storage Resource Management Presented By Alex Sim Contributors: JLAB: Bryan Hess, Andy Kowalski Fermi: Don Petravick, Timur.
Computing Sciences Directorate, L B N L 1 CHEP 2003 Standards For Storage Resource Management BOF Co-Chair: Arie Shoshani * Co-Chair: Peter Kunszt ** *
1 SRM-Lite: overcoming the firewall barrier for data movement Arie Shoshani Alex Sim Viji Natarajan Lawrence Berkeley National Laboratory SDM Center All-Hands.
STAR Collaboration, July 2004 Grid Collector Wei-Ming Zhang Kent State University John Wu, Alex Sim, Junmin Gu and Arie Shoshani Lawrence Berkeley National.
January 26, 2003Eric Hjort HRMs in STAR Eric Hjort, LBNL (STAR/PPDG Collaborations)
Jens G Jensen RAL, EDG WP5 Storage Element Overview DataGrid Project Conference Heidelberg, 26 Sep-01 Oct 2003.
INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.
SDM Center Coupling Parallel IO to SRMs for Remote Data Access Ekow Otoo, Arie Shoshani and Alex Sim Lawrence Berkeley National Laboratory.
1 Use of SRM File Streaming by Gateway Alex Sim Arie Shoshani May 2008.
Computing Sciences Directorate, L B N L 1 SC 2003 Storage Resource Managers: Essential Components for the Grid Arie Shoshani Staff: Alex Sim, Junmin Gu,
David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
PPDG meeting, July 2000 Interfacing the Storage Resource Broker (SRB) to the Hierarchical Resource Manager (HRM) Arie Shoshani, Alex Sim (LBNL) Reagan.
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.
1 Xrootd-SRM Andy Hanushevsky, SLAC Alex Romosan, LBNL August, 2006.
SRM at Brookhaven Ofer Rind BNL RCF/ACF Z. Liu, S. O’Hare, R. Popescu CHEP04, Interlaken 27 September 2004.
Production Mode Data-Replication Framework in STAR using the HRM Grid CHEP ’04 Congress Centre Interlaken, Switzerland 27 th September – 1 st October Eric.
CHEP ‘06Eric Hjort, Jérôme Lauret Data and Computation Grid Decoupling in STAR – An Analysis Scenario using SRM Technology Eric Hjort (LBNL) Jérôme Lauret,
SRM-iRODS Interface Development WeiLong UENG Academia Sinica Grid Computing 1.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
1 Scientific Data Management Group LBNL SRM related demos SC 2002 DemosDemos Robust File Replication of Massive Datasets on the Grid GridFTP-HPSS access.
A. Sim, CRD, L B N L 1 OSG Site Administrators Meeting, Dec. 13, 2007 Berkeley Storage Manager (BeStMan) Alex Sim Scientific Data Management Research Group.
9/20/04Storage Resource Manager, Timur Perelmutov, Jon Bakken, Don Petravick, Fermilab 1 Storage Resource Manager Timur Perelmutov Jon Bakken Don Petravick.
A. Sim, CRD, L B N L 1 Production Data Management Workshop, Mar. 3, 2009 BeStMan and Xrootd Alex Sim Scientific Data Management Research Group Computational.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI solution for high throughput data analysis Peter Solagna EGI.eu Operations.
U.S. ATLAS Grid Production Experience
The Earth System Grid: A Visualisation Solution
SRM V2.1: Additional Design Issues
Data Management cluster summary
A Web-Based Data Grid Chip Watson, Ian Bird, Jie Chen,
DRM Deployment Readiness Plan
Presentation transcript:

1 CHEP 2003 Arie Shoshani Experience with Deploying Storage Resource Managers to Achieve Robust File replication Arie Shoshani Alex Sim Junmin Gu Scientific Data Management Group Lawrence Berkeley National Laboratory

2 CHEP 2003 Arie Shoshani Outline File replication problem - motivationFile replication problem - motivation What are Storage Resource ManagersWhat are Storage Resource Managers General Analysis Scenario and the use of SRMsGeneral Analysis Scenario and the use of SRMs SRM functionalitySRM functionality SRMs use for file replication – robustnessSRMs use for file replication – robustness Advantages of using SRMs for file replicationAdvantages of using SRMs for file replication File monitoring toolFile monitoring tool Analysis of file replicationAnalysis of file replication

3 CHEP 2003 Arie ShoshaniMotivation Multi-File Replication – why is it a problem?Multi-File Replication – why is it a problem? Tedious task – many files, repetitious Lengthy task – long transfer time, can take days Error prone – need to monitor scripts Error recovery – need to restart file transfers Stage and archive from MSS – limited concurrency, down time, transient failures Use of FTP – large windows, concurrent transfer Security – both for local MSS and the network Firewalls – transfer from/to MSS must be internal to the site

4 CHEP 2003 Arie Shoshani What are Storage Resource Managers? Grid architecture needs to include reservation & scheduling of:Grid architecture needs to include reservation & scheduling of: Compute resources Storage resources Network resources Storage Resource Managers (SRMs) role in the data grid architectureStorage Resource Managers (SRMs) role in the data grid architecture Shared storage resource allocation & scheduling Especially important for data intensive applications Often files are archived on a mass storage system (MSS) Wide area networks – minimize transfers large scientific collaborations (100’s of nodes, 1000’s of clients) – opportunities for file sharing File replication and caching may be used Need to support non-blocking (asynchronous) requests

5 CHEP 2003 Arie Shoshani General Analysis Scenario MSS Request Executer Storage Resource Manager Metadata catalog Replica catalog Network Weather Service logical query network client... Request Interpreter request planning A set of logical files Execution plan and site-specific files Client’s site... Disk Cache Disk Cache Compute Engine Disk Cache Compute Resource Manager Storage Resource Manager Compute Engine Disk Cache Requests for data placement and remote computation Site 2 Site 1 Site N Storage Resource Manager Storage Resource Manager Compute Resource Manager result files Execution DAG

6 CHEP 2003 Arie Shoshani SRM is a Service SRM functionalitySRM functionality Manage space Negotiate and assign space to users Manage “lifetime” of spaces Manage files on behalf of a user Pin files in storage till they are released Manage “lifetime” of files Manage action when pins expire (depends on file types) Manage file sharing Policies on what should reside on a storage resource at any one time Policies on what to evict when space is needed Get files from remote locations when necessary Purpose: to simplify client’s task Manage multi-file requests A brokering function: queue file requests, pre-stage when possible Provide grid access to/from mass storage systems HPSS (LBNL, ORNL, BNL), Enstore (Fermi), JasMINE (Jlab), Castor (CERN), MSS (NCAR), …

7 CHEP 2003 Arie Shoshani Types of SRMs Types of storage resource managersTypes of storage resource managers Disk Resource Manager (DRM) Manages one or more disk resources Tape Resource Manager (TRM) Manages access to a tertiary storage system (e.g. HPSS) Hierarchical Resource Manager (HRM=TRM + DRM) An SRM that stages files from tertiary storage into its disk cache SRMs and File transfersSRMs and File transfers SRMs DO NOT perform file transfer SRMs DO invoke file transfer service if needed (GridFTP, FTP, HTTP, …) SRMs DO monitor transfers and recover from failures TRM: from/to MSS DRM: from/to network

8 CHEP 2003 Arie Shoshani Uniformity of Interface  Compatibility of SRMs SRM Enstore JASMine Client USER/APPLICATIONS Grid Middleware SRM DCache SRM CASTOR SRM Disk Cache

9 CHEP 2003 Arie Shoshani SRMs use in STAR for Robust Muti-file replication Anywhere BNL Disk Cache Disk Cache HRM-COPY (thousands of files) SRM-GET (one file at a time) HRM-Client Command-line Interface HRM (performs writes) HRM (performs reads) LBNL GridFTP GET (pull mode) stage files archive files Network transfer Get list of files Recovers from staging failures Recovers from file transfer failures Recovers from archiving failures

10 CHEP 2003 Arie Shoshani Detailed sequence of actions For each file being replicated srmGet (sourceURL) 2 GridFTP GET (pull mode) 6 File staged (BNL’s diskURL) 5 Anywhere srmCopy {(sourceURL=hpss.bnl.gov/xyz/file_x, targetURL =hpss.lbnl.gov/uvw/file_y)} Get list of files from directory Request files Disk Cache Disk Cache HRM-Client Command-line Interface LBNL HRM (performs writes) BNL HRM (performs reads) 1 Allocate Space 3 Allocate Space 4 Stage File Transfer Complete 7 8 Release Space 9 Call_back: file on disk Call_back: file on tape Archive File 11 Release Space Web-based File Monitoring Tool

11 CHEP 2003 Arie Shoshani Web-Based File Monitoring Tool Shows: -Files already transferred - Files during transfer - Files to be transferred Also shows for each file: -Source URL -Target URL -Transfer rate

12 CHEP 2003 Arie Shoshani Tracking multi-file replication performance FILE_REQUEST_FAILED Notified_Client Migration_Finished Migration_Requested Transfered_to_PDSF_from_BNL Staging_finished_at_BNL Staging_started_at BNL Staging_requested_at_BNL File replication request start Helped discover hard-to-find bug

13 CHEP 2003 Arie Shoshani File tracking helps to identify bottlenecks Shows that archiving is the bottleneck

14 CHEP 2003 Arie Shoshani File tracking shows recovery from transient failures Total: 45 GBs

15 CHEP 2003 Arie Shoshani File tracking shows network slowdown and recovery Total: 53 GBs

16 CHEP 2003 Arie Shoshani Conclusion: Key advantages of using SRMs for file replication All HRM communications are part of HRM functionalityAll HRM communications are part of HRM functionality No changes required to HRMs Can replicate files from multiple sitesCan replicate files from multiple sites In a single command to one target Recovers from transient failuresRecovers from transient failures For staging and archiving from MSS For network Uses disk buffers to keep multiple filesUses disk buffers to keep multiple files pre-stage in case of slow network Hold files in case of slow archiving Concurrent transfersConcurrent transfers Concurrent staging, concurrent archiving from/to MSS Concurrent transfers over the network Concurrency limited by parameter setup Automatic cleanup of buffers (garbage collection)Automatic cleanup of buffers (garbage collection) Can replicate files between different MSSs (Enstore, Jasmine, HPSS, Castor, …)Can replicate files between different MSSs (Enstore, Jasmine, HPSS, Castor, …) On-line monitoring, summary generatedOn-line monitoring, summary generated

17 CHEP 2003 Arie Shoshani BNL–LBNL file replication for STAR is in production for 9 months now (nearly daily use to replicate 1000s of files per day) More on SRMs Thursday, at 1:30 pm (Category 3) Final note