Any Data, Anytime, Anywhere Dan Bradley representing the AAA Team At OSG All Hands Meeting March 2013, Indianapolis.

Slides:



Advertisements
Similar presentations
Distributed Xrootd Derek Weitzel & Brian Bockelman.
Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Kathy Benninger, Pittsburgh Supercomputing Center Workshop on the Development of a Next-Generation Cyberinfrastructure 1-Oct-2014 NSF Collaborative Research:
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Skeleton Key: Sharing Data Across Campus Infrastructures Suchandra Thapa Computation Institute / University of Chicago.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Implementing ISA Server Caching. Caching Overview ISA Server supports caching as a way to improve the speed of retrieving information from the Internet.
Agenda  Overview  Configuring the database for basic Backup and Recovery  Backing up your database  Restore and Recovery Operations  Managing your.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 7 Configuring File Services in Windows Server 2008.
Outline Network related issues and thinking for FAX Cost among sites, who has problems Analytics of FAX meta data, what are the problems  The main object.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Minerva Infrastructure Meeting – October 04, 2011.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
Introduction: Following the great success of AAA and FAX data federations of all US ATLAS & CMS T1 and T2 sites, AAA embarked on exploration of extensions.
Experiences Deploying Xrootd at RAL Chris Brew (RAL)
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
A. Mohapatra, HEPiX 2013 Ann Arbor1 UW Madison CMS T2 site report D. Bradley, T. Sarangi, S. Dasu, A. Mohapatra HEP Computing Group Outline  Infrastructure.
ALICE data access WLCG data WG revival 4 October 2013.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
Reliable Data Movement using Globus GridFTP and RFT: New Developments in 2008 John Bresnahan Michael Link Raj Kettimuthu Argonne National Laboratory and.
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Web Cache Redirection using a Layer-4 switch: Architecture, issues, tradeoffs, and trends Shirish Sathaye Vice-President of Engineering.
Wahid, Sam, Alastair. Now installed on production storage Edinburgh: srm.glite.ecdf.ed.ac.uk  Local and global redir work (port open) e.g. root://srm.glite.ecdf.ed.ac.uk//atlas/dq2/mc12_8TeV/NTUP_SMWZ/e1242_a159_a165_r3549_p1067/mc1.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Performance Tests of DPM Sites for CMS AAA Federica Fanzago on behalf of the AAA team.
Implementing ISA Server Caching
Claudio Grandi INFN Bologna CMS Computing Model Evolution Claudio Grandi INFN Bologna On behalf of the CMS Collaboration.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
PanDA Status Report Kaushik De Univ. of Texas at Arlington ANSE Meeting, Nashville May 13, 2014.
CMS Usage of the Open Science Grid and the US Tier-2 Centers Ajit Mohapatra, University of Wisconsin, Madison (On Behalf of CMS Offline and Computing Projects)
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Scalability == Capacity * Density.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Maria Girone, CERN CMS Experiment Status, Run II Plans, & Federated Requirements Maria Girone, CERN XrootD Workshop, January 27, 2015.
A. Mohapatra, T. Sarangi, HEPiX-Lincoln, NE1 University of Wisconsin-Madison CMS Tier-2 Site Report D. Bradley, S. Dasu, A. Mohapatra, T. Sarangi, C. Vuosalo.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
DCache/XRootD Dmitry Litvintsev (DMS/DMD) FIFE workshop1Dmitry Litvintsev.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Maria Girone, CERN  CMS in a High-Latency Environment  CMSSW I/O Optimizations for High Latency  CPU efficiency in a real world environment  HLT 
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Page PearsonAccess™ Technology Training Online Test Configuration.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
BIG DATA/ Hadoop Interview Questions.
SLACFederated Storage Workshop Summary Andrew Hanushevsky SLAC National Accelerator Laboratory April 10-11, 2014 SLAC.
Data Distribution Performance Hironori Ito Brookhaven National Laboratory.
Using the CMS Higher Level Trigger Farm as a Cloud Resource David Colling Imperial College London.
UNICORE and Argus integration Krzysztof Benedyczak ICM / UNICORE Security PT.
Federating Data in the ALICE Experiment
dCache “Intro” a layperson perspective Frank Würthwein UCSD
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Data Federation with Xrootd Wei Yang US ATLAS Computing Facility meeting Southern Methodist University, Oct 11-12, 2011.
Initial job submission and monitoring efforts with JClarens
Any Data, Anytime, Anywhere
CS 295: Modern Systems Organizing Storage Devices
Microsoft Virtual Academy
Presentation transcript:

Any Data, Anytime, Anywhere Dan Bradley representing the AAA Team At OSG All Hands Meeting March 2013, Indianapolis

Any data, Anytime, Anywhere2 AAA Project Goal Use resources more effectively through remote data access in CMS Sub-goals Low-ceremony/latency access to any single event Reduce data access error rate Overflow jobs from busy sites to less busy ones Use opportunistic resources Make life at T3s easier

Any data, Anytime, Anywhere3 xrootd: Federating Storage Systems Step 1: deploy seamless global storage interface But preserve site autonomy: xrootd plugin maps from global logical filename to physical filename at site  Mapping is typically trivial in CMS: /store/* → /store/* xrootd plugin reads from site storage system  Example: HDFS User authentication also pluggable  But we use standard GSI + lcmaps + GUMS

Any data, Anytime, Anywhere4 Status of CMS Federation US T1 (disk) + 7/7 T2s federated Covers 100% of the data for analysis Does not cover files only on tape World 2 T1s + 1/3 T2s accessible Monitored but not a “turns your site red” service (yet)

Any data, Anytime, Anywhere5 WAN xrootd traffic

Any data, Anytime, Anywhere6 Opening Files

Any data, Anytime, Anywhere7 Microscopic View

Any data, Anytime, Anywhere8 Problem Access via xrootd overloads site storage system Florida to Federation, “We are seceding!” Terms of the Feb 2013 treaty: Addition of local xrootd I/O load monitoring Site can configure automatic throttles  When load too high, rejects new transfer requests  End-user only sees error if file unavailable elsewhere in federation But these policies are intended for the exception, not the norm, because...

Any data, Anytime, Anywhere9 Regulation of Requests To 1 st order, jobs still run at sites with the data ~0.25 GB/s average remote read rate O(10) GB/s average local read rate ~1.5 GB/s PhEDEx transfer rate Cases where data is read remotely: Interactive- limited by # humans Fallback- limited by error rate opening files Overflow- limited by scheduling policy Opportunistic- limited by scheduling policy T3- watching this

At the Campus Scale Some sites are using xrootd for access to data from across a campus grid Examples: Nebraska, Purdue, Wisconsin 10Any data, Anytime, Anywhere

11 More on Fallback On file open error, CMS software can retry via alternate location/protocol Configured by site admin We fall back to regional xrootd federation  US, EU  Could also have inter-region fallback Have not configured this … yet Can recover from missing file error, but not missing block within file error (more on this later) Has more uses than just error recovery...

Any data, Anytime, Anywhere12 More about Overflow GlideinWMS scheduling policy Candidates for overflow:  Idle jobs with wait time above threshold (6h)  Desired data available in a region supporting overflow Regulation of overflow:  Limited number of overflow glideins submitted per source site Data access No reconfiguration of job required  Uses fallback mechanism  Try local access, fall back to remote access on failure

Any data, Anytime, Anywhere13 Overflow Small but steady overflow in US region

Any data, Anytime, Anywhere14 Running Opportunistically To run CMS jobs at non-CMS sites, we need Outbound network access Access to CMS datafiles  Xrootd remote access Access to conditions data  http proxy Access to CMS software  CVMFS (also needs http proxy)

Any data, Anytime, Anywhere15 CVMFS Anywhere But non-CMS sites might not happen to mount the CMS CVMFS repository → Run the job under Parrot (from cctools)  Can now access CVMFS without FUSE mount  Also gives us identity boxing Privilege separation between glidein and user job  Has worked well for guinea pig analysis users Working on extending it to more users What about in the cloud? If you control the VM image, just mount CVMFS

Any data, Anytime, Anywhere16 Fallback++ Today we can recover when file is missing from local storage system But missing blocks within files cause jobs to fail And job may come back and fail again... Admin may need to intervene to recover the data User may need to resubmit the job Can we do better?

Any data, Anytime, Anywhere17 Yes, We Hope Concept Fall back on read error Cache remotely read data Insert downloaded data back into storage system

Any data, Anytime, Anywhere18 File Healing

Any data, Anytime, Anywhere19 File Healing Status Currently have it working via whole-file caching Still only triggered by file open error Plans to support partial-file healing Will need to fall back to local xrootd proxy on all read failures Current implementation is HDFS-specific  Modifies HDFS client to do the fallback to xrootd  But it's not CMS-specific

Any data, Anytime, Anywhere20 Cross-site Replication Once we have partial-file healing … Could reduce HDFS replication level from 2 to 1 and use cross-site redundancy instead  Would need to enforce the replication policy at higher level  May not be good idea for hot data  Need to consider impact on performance

Any data, Anytime, Anywhere21 Performance Mostly CMS application-specific stuff Improved remote read performance by combining multiple reads into vector reads  Eliminates many round-trips Working on bit-torrent-like capabilities in CMS application  Read from multiple xrootd sources  Balance load away from slower source  React in O(1) minute time frame

Any data, Anytime, Anywhere22 HTCondor Integration Improved vanilla universe file transfer scheduling and monitoring in 7.9 Used to have one file transfer queue  One misconfigured workflow could starve everything else  Difficult to diagnose Now one per user  Or per arbitrary attribute (e.g. target site) Equal sharing between transfer queues in case of contention Reporting transfer status, bandwidth usage, disk load, and network load And now you can condor_rm those malformed jobs that are transferring GBs of files :)

Any data, Anytime, Anywhere23 Summary xrootd storage federation rapidly expanding and proving useful within CMS We hope to do more Automatic error recovery Opportunistic usage Improving performance