Data access and Storage

Slides:



Advertisements
Similar presentations
Andrew Hanushevsky7-Feb Andrew Hanushevsky Stanford Linear Accelerator Center Produced under contract DE-AC03-76SF00515 between Stanford University.
Advertisements

Distributed Tier1 scenarios G. Donvito INFN-BARI.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.
Experiences Deploying Xrootd at RAL Chris Brew (RAL)
1 The Google File System Reporter: You-Wei Zhang.
The Next Generation Root File Server Andrew Hanushevsky Stanford Linear Accelerator Center 27-September-2004
Distributed File Systems
Scalla/xrootd Andrew Hanushevsky SLAC National Accelerator Laboratory Stanford University 29-October-09 ATLAS Tier 3 Meeting at ANL
File and Object Replication in Data Grids Chin-Yi Tsai.
March 6, 2009Tofigh Azemoon1 Real-time Data Access Monitoring in Distributed, Multi Petabyte Systems Tofigh Azemoon Jacek Becla Andrew Hanushevsky Massimiliano.
Xrootd, XrootdFS and BeStMan Wei Yang US ATALS Tier 3 meeting, ANL 1.
Scalla/xrootd Introduction Andrew Hanushevsky, SLAC SLAC National Accelerator Laboratory Stanford University 6-April-09 ATLAS Western Tier 2 User’s Forum.
July-2008Fabrizio Furano - The Scalla suite and the Xrootd1 cmsd xrootd cmsd xrootd cmsd xrootd cmsd xrootd Client Client A small 2-level cluster. Can.
11-July-2008Fabrizio Furano - Data access and Storage: new directions1.
D C a c h e Michael Ernst Patrick Fuhrmann Tigran Mkrtchyan d C a c h e M. Ernst, P. Fuhrmann, T. Mkrtchyan Chep 2003 Chep2003 UCSD, California.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Xrootd Monitoring Atlas Software Week CERN November 27 – December 3, 2010 Andrew Hanushevsky, SLAC.
LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.
July-2008Fabrizio Furano - The Scalla suite and the Xrootd1.
Xrootd Update Andrew Hanushevsky Stanford Linear Accelerator Center 15-Feb-05
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.
CERN IT Department CH-1211 Genève 23 Switzerland Internet Services Xrootd explained
ROOT-CORE Team 1 PROOF xrootd Fons Rademakers Maarten Ballantjin Marek Biskup Derek Feichtinger (ARDA) Gerri Ganis Guenter Kickinger Andreas Peters (ARDA)
02-June-2008Fabrizio Furano - Data access and Storage: new directions1.
Performance and Scalability of xrootd Andrew Hanushevsky (SLAC), Wilko Kroeger (SLAC), Bill Weeks (SLAC), Fabrizio Furano (INFN/Padova), Gerardo Ganis.
Xrootd Present & Future The Drama Continues Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University HEPiX 13-October-05
CERN IT Department CH-1211 Genève 23 Switzerland t Scalla/xrootd WAN globalization tools: where we are. Now my WAN is well tuned! So what?
Scalla Advancements xrootd /cmsd (f.k.a. olbd) Fabrizio Furano CERN – IT/PSS Andrew Hanushevsky Stanford Linear Accelerator Center US Atlas Tier 2/3 Workshop.
Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN.
SRM Space Tokens Scalla/xrootd Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University 27-May-08
Scalla As a Full-Fledged LHC Grid SE Wei Yang, SLAC Andrew Hanushevsky, SLAC Alex Sims, LBNL Fabrizio Furano, CERN SLAC National Accelerator Laboratory.
Scalla + Castor2 Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University 27-March-07 Root Workshop Castor2/xrootd.
Federated Data Stores Volume, Velocity & Variety Future of Big Data Management Workshop Imperial College London June 27-28, 2013 Andrew Hanushevsky, SLAC.
11-June-2008Fabrizio Furano - Data access and Storage: new directions1.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
09-Apr-2008Fabrizio Furano - Scalla/xrootd status and features1.
DCache/XRootD Dmitry Litvintsev (DMS/DMD) FIFE workshop1Dmitry Litvintsev.
Scalla Update Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University 25-June-2007 HPDC DMG Workshop
DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.
New Features of Xrootd SE Wei Yang US ATLAS Tier 2/Tier 3 meeting, University of Texas, Arlington,
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
CVMFS Alessandro De Salvo Outline  CVMFS architecture  CVMFS usage in the.
PHD Virtual Technologies “Reader’s Choice” Preferred product.
Federating Data in the ALICE Experiment
a brief summary for users
Jean-Philippe Baud, IT-GD, CERN November 2007
CernVM-FS vs Dataset Sharing
Getting the Most out of Scientific Computing Resources
Dynamic Extension of the INFN Tier-1 on external resources
Getting the Most out of Scientific Computing Resources
Introduction to Distributed Platforms
Vincenzo Spinoso EGI.eu/INFN
Diskpool and cloud storage benchmarks used in IT-DSS
Future of WAN Access in ATLAS
Xrootd explained Cooperation among ALICE SEs
dCache “Intro” a layperson perspective Frank Würthwein UCSD
StoRM Architecture and Daemons
Introduction to Data Management in EGI
Ákos Frohner EGEE'08 September 2008
Introduction to Networks
The INFN Tier-1 Storage Implementation
Data access performance through parallelization and vectored access
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Scalla/XRootd Advancements
Grid Canada Testbed using HEP applications
Support for ”interactive batch”
Subject Name: Operating System Concepts Subject Number:
Presentation transcript:

Data access and Storage Some new directions From the xrootd and Scalla perspective Fabrizio Furano CERN IT/GS 02-June-08 Elba SuperB meeting http://savannah.cern.ch/projects/xrootd http://xrootd.slac.stanford.edu

Outline So many new directions Designers and users unleashed fantasy What is Scalla The “many” storage element paradigm Direct WAN data access Clusters globalization Virtual Mass Storage System and 3rd party fetches Xrootd File System + a flavor of SRM integration Conclusion Fabrizio Furano - Data access and Storage: new directions 02-June-2008

What is Scalla? The evolution of the BaBar-initiated xrootd project Data access with HEP requirements in mind But a very generic platform, however Structured Cluster Architecture for Low Latency Access Low Latency Access to data via xrootd servers POSIX-style byte-level random access By default, arbitrary data organized as files Hierarchical directory-like name space Protocol includes high performance features Structured Clustering provided by cmsd servers (formerly olbd) Exponentially scalable and self organizing Tools and methods to cluster, harmonize, connect, … Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Scalla Design Points High speed access to experimental data Small block sparse random access (e.g., root files) High transaction rate with rapid request dispersement (fast concurrent opens) Wide usability Generic Mass Storage System Interface (HPSS, RALMSS, Castor, etc) Full POSIX access Server clustering (up to 200Kper site) for linear scalability Low setup cost High efficiency data server (low CPU/byte overhead, small memory footprint) Very simple configuration requirements No 3rd party software needed (avoids messy dependencies) Low administration cost Robustness Non-Assisted fault-tolerance (the jobs recover failures - no crashes! – any factor of redundancy possible on the srv side) Self-organizing servers remove need for configuration changes No database requirements (high performance, no backup/recovery issues) Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Single point performance Very carefully crafted, heavily multithreaded Server side: promote speed and scalability High level of internal parallelism + stateless Exploits OS features (e.g. async i/o, polling, selecting) Many many speed+scalability oriented features Supports thousands of client connections per server Client: Handles the state of the communication Reconstructs everything to present it as a simple interface Fast data path Network pipeline coordination + latency hiding Supports connection multiplexing + intelligent server cluster crawling Server and client exploit multi core CPUs natively Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Fault tolerance Server side Client side (+protocol) If servers go, the overall functionality can be fully preserved Redundancy, MSS staging of replicas, … Can means that weird deployments can give it up E.g. storing in a DB the physical endpoint addresses for each file. Generally a bad idea. Client side (+protocol) The application never notices errors Totally transparent, until they become fatal i.e. when it becomes really impossible to get to a working endpoint to resume the activity Typical tests (try it!) Disconnect/reconnect network cables Kill/restart servers Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Courtesy of Gerardo Ganis (CERN PH-SFT) Authentication Flexible, multi-protocol system Abstract protocol interface: XrdSecInterface Protocols implemented as dynamic plug-ins Architecturally self-contained NO weird code/libs dependencies (requires only openssl) High quality highly optimized code, great work by Gerri Ganis Embedded protocol negotiation Servers define the list, clients make the choice Servers lists may depend on host / domain One handshake per process-server connection Reduced overhead: # of handshakes ≤ # of servers contacted Exploits multiplexed connections no matter the number of file opens per process-server Courtesy of Gerardo Ganis (CERN PH-SFT) Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Courtesy of Gerardo Ganis (CERN PH-SFT) Available protocols Password-based (pwd)‏ Either system or dedicated password file User account not needed GSI (gsi)‏ Handle GSI proxy certificates VOMS support should be OK now (Andreas, Gerri) No need of Globus libraries (and super-fast!) Kerberos IV, V (krb4, krb5) Ticket forwarding supported for krb5 Fast ID (unix, host) to be used w/ authorization ALICE security tokens Emphasis on ease of setup and performance Courtesy of Gerardo Ganis (CERN PH-SFT) Fabrizio Furano - Data access and Storage: new directions 02-June-2008

The “many” paradigm Creating big clusters scales linearly The throughput and the size, keeping latency very low We like the idea of disk-based cache The bigger (and faster), the better So, why not to use the disk of every WN ? In a dedicated farm 500GB * 1000WN  500TB The additional cpu usage is anyway quite low Can be used to set up a huge cache in front of a MSS No need to buy a bigger MSS, just lower the miss rate ! Adopted at BNL for STAR (up to 6-7PB online) See Pavel Jakl’s (excellent) thesis work They also optimize MSS access to nearly double the staging performance Quite similar to the PROOF approach to storage Only storage. PROOF is very different for the computing part. Fabrizio Furano - Data access and Storage: new directions 02-June-2008

The “many” paradigm This big disk cache Shares the computing power of the WNs Shares the network of the WNs pool i.e. No SAN-like bottlenecks (… reduced costs) Exploits a complete graph of connections (not 1-2) Handled by the farm’s network switch The performance boost varies, depending on: Total disk cache size Total “working set” size It is very well known that most accesses are to a fraction of the repo at a time. In HEP the data locality principle is valid. Caches work! Throughput of a single application Can have many types of jobs/apps Fabrizio Furano - Data access and Storage: new directions 02-June-2008

WAN direct access – Motivation We want to make WAN data analysis convenient A process does not always read every byte in a file Even if it does… no problem The typical way in which HEP data is processed is (or can be) often known in advance TTreeCache does an amazing job for this xrootd: fast and scalable server side Makes things run quite smooth Gives room for improvement at the client side About WHEN transferring the data There might be better moments to trigger a chunk xfer with respect to the moment it is needed The app has not to wait while it receives data… in parallel Fabrizio Furano - Data access and Storage: new directions 02-June-2008

WAN direct access – hiding latency Pre-xfer data “locally” Remote access Remote access+ Data Processing Data access Overhead Need for potentially useless replicas And a huge Bookkeeping! Latency Wasted CPU cycles But easy to understand Interesting! Efficient practical Fabrizio Furano - Data access and Storage: new directions 02-June-2008

One Physical connection per Multiple streams‏ Clients still see One Physical connection per server Application Client1 Server TCP (control)‏ Client2 TCP(data)‏ Client3 Async data gets automatically splitted Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Multiple streams‏ It is not a copy-only tool to move data Can be used to speed up access to remote repos Transparent to apps making use of *_async reqs The app computes WHILE getting data xrdcp uses it (-S option)‏ results comparable to other cp-like tools For now only reads fully exploit it Writes (by default) use it at a lower degree Not easy to keep the client side fault tolerance with writes Automatic agreement of the TCP windowsize You set servers in order to support the WAN mode If requested… fully automatic. Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Dumb WAN Access* Setup: client at CERN, data at SLAC 164ms RTT time, available bandwidth < 100Mb/s Smart features switched OFF Test 1: Read a large ROOT Tree (~300MB, 200k interactions) Expected time: 38000s (latency)+750s (data)+CPU➙10 hrs! No time to waste to precisely measure this! Test 2: Draw a histogram from that tree data (~6k interactions) Measured time 20min Using xrootd with WAN optimizations disabled *Federico Carminati, The ALICE Computing Status and Readiness, LHCC, November 2007 Fabrizio Furano - Scalla/xrootd status and features 09-Apr-2008

Smart WAN Access* Smart features switched ON ROOT TTreeCache + XrdClient Async mode + 15*multistreaming Test 1 actual time: 60-70 seconds Compared to 30 seconds using a Gb LAN Very favorable for sparsely used files … at the end, even much better than certain always-overloaded SEs….. Test 2 actual time: 7-8 seconds Comparable to LAN performance (5-6 secs) 100x improvement over dumb WAN access (was 20 minutes) *Federico Carminati, The ALICE Computing Status and Readiness, LHCC, November 2007 Fabrizio Furano - Scalla/xrootd status and features 09-Apr-2008

Cluster globalization Up to now, xrootd clusters could be populated With xrdcp from an external machine Writing to the backend store (e.g. CASTOR/DPM/HPSS etc.) E.g. FTD in ALICE now uses the first. It “works”… Load and resources problems All the external traffic of the site goes through one machine Close to the dest cluster If a file is missing or lost For disk and/or catalog screwup Job failure ... manual intervention needed With 107 online files finding the source of a trouble can be VERY tricky Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Virtual MSS Purpose: Note that X itself is part of the game A request for a missing file comes at cluster X, X assumes that the file ought to be there And tries to get it from the collaborating clusters, from the fastest one Note that X itself is part of the game And it’s composed by many servers The idea is that Each cluster considers the set of ALL the others like a very big online MSS This is much easier than what it seems And the tests around report high robustness… Very promising, still in alpha test, but not for much more. Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Many pieces Global redirector acts as a WAN xrootd meta-manager Local clusters subscribe to it And declare the path prefixes they export Local clusters (without local MSS) treat the globality as a very big MSS Coordinated by the Global redirector Load balancing, negligible load Priority to files which are online somewhere Priority to fast, least-loaded sites Fast file location True, robust, realtime collaboration between storage elements! Very attractive for tier-2s Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Cluster Globalization… an example cmsd xrootd ALICE global redirector (alirdr) all.role meta manager all.manager meta alirdr.cern.ch:1312 root://alirdr.cern.ch/ Includes CERN, GSI, and others xroot clusters Meta Managers can be geographically replicated Can have several in different places for region-aware load balancing cmsd xrootd GSI cmsd xrootd CERN cmsd xrootd Prague NIHAM … any other all.manager meta alirdr.cern.ch:1312 all.role manager Fabrizio Furano - Data access and Storage: new directions 02-June-2008

The Virtual MSS Realized cmsd xrootd ALICE global redirector all.role meta manager all.manager meta alirdr.cern.ch:1312 But missing a file? Ask to the global metamgr Get it from any other collaborating cluster Note: the security hats could require you use xrootd native proxy support cmsd xrootd GSI cmsd xrootd CERN cmsd xrootd Prague NIHAM … any other Local clients work normally all.manager meta alirdr.cern.ch:1312 all.role manager Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Virtual MSS – The vision Powerful mechanism to increase reliability Data replication load is widely distributed Multiple sites are available for recovery Allows virtually unattended operation Automatic restore due to server failure Missing files in one cluster fetched from another Typically the fastest one which has the file really online No costly out of time DB lookups File (pre)fetching on demand Can be transformed into a 3rd-party GET (by asking for a specific source) Practically no need to track file location But does not stop the need for metadata repositories Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Problems? Not yet. There should be no architectural problems BUT... Striving to keep code quality at maximum level Awesome collaboration (CERN, SLAC, GSI) BUT... The architecture can prove itself to be ultra-bandwidth- efficient Or greedy, as you prefer Need of a way to coordinate the remote connections In and Out We designed the Xrootd BWM and the Scalla DSS Fabrizio Furano - Data access and Storage: new directions 02-June-2008

The Scalla DSS and the BWM Directed Support Services Architecture (DSS) Clean way to associate external xrootd-based services Via ‘artificial’, meaningful pathnames A simple way for a client to ask for a service Like an intelligent queueing service for WAN xfers! Which we called BWM Just an xrootd server with a queueing plugin Can be used to queue incoming and outgoing traffic In a cooperative and symmetrical manner So, clients ask to be queued for xfers Work in progress! Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Virtual MSS The mechanism is there, once it is correctly boxed Work in progress… A (good) side effect: Pointing an app to the “area” global redirector gives complete, load-balanced, low latency view of all the repo An app using the “smart” mode can just run Probably now a full scale production won’t But what about an interactive small analysis on a laptop? After all, HEP sometimes just copies everything, useful and not I cannot say that in some years we will not have a more powerful WAN infrastructure And using it to copy more useless data looks just ugly If a web browser can do it, why not a HEP app? Looks just a little more difficult. Better if used with a clear design in mind Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Data System vs File System Scalla is a data access system Some users/applications want file system semantics More transparent but much less scalable (transactional namespace) For years users have asked …. Can Scalla create a file system experience? The answer is …. It can to a degree that may be good enough We relied on FUSE to show how Users shall rely on themselves to decide If they actually need a huge multi-PB unique filesystem Probably there is something else which is “strange” Fabrizio Furano - Data access and Storage: new directions 02-June-2008

What is FUSE Filesystem in Userspace Used to implement a file system in a user space program Linux 2.4 and 2.6 only Refer to http://fuse.sourceforge.net/ Can use FUSE to provide xrootd access Looks like a mounted file system Several people have xrootd-based versions of this Wei Yang at SLAC Tested and fully functional (used to provide SRM access for ATLAS) Fabrizio Furano - Data access and Storage: new directions 02-June-2008

XrootdFS (Linux/FUSE/Xrootd) User Space POSIX File System Interface Client Host FUSE Kernel Appl FUSE/Xroot Interface xrootd POSIX Client opendir create mkdir mv rm rmdir Redirector xrootd:1094 Name Space xrootd:2094 Redirector Host Should run cnsd on servers to capture non-FUSE events And keep the FS namespace! Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Why XrootdFS? Makes some things much simpler But impacts other things Most SRM implementations run transparently Avoid pre-load library worries But impacts other things Performance is limited Kernel-FUSE interactions are not cheap The implementation is OK but quite simple-minded Rapid file creation (e.g., tar) is limited Remember that the comparison is with a plain xrootd cluster FUSE must be administratively installed to be used Difficult if involves many machines (e.g., batch workers) Easier if it involves an SE node (i.e., SRM gateway) So, it’s good for the SRM-side of a repo But not for the job side Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Conclusion Many new ideas are reality or coming Typically dealing with True realtime data storage distribution Interoperability (Grid, SRMs, file systems, WANs…) Enabling interactivity (and storage is not the only part of it) The next close-to-completion big step is setup encapsulation Will proceed by degrees Trying to avoid common mistakes Both manual and automated setups are honorful and to be honoured! Fabrizio Furano - Data access and Storage: new directions 02-June-2008

Acknowledgements Old and new software Collaborators Andy Hanushevsky, Fabrizio Furano (client-side), Alvise Dorigo Root: Fons Rademakers, Gerri Ganis (security), Bertrand Bellenot (windows porting) Alice: Derek Feichtinger, Andreas Peters, Guenter Kickinger STAR/BNL: Pavel Jackl, Jerome Lauret GSI: Kilian Schwartz Cornell: Gregory Sharp SLAC: Jacek Becla, Tofigh Azemoon, Wilko Kroeger, Bill Weeks Peter Elmer Operational collaborators BNL, CERN, CNAF, FZK, INFN, IN2P3, RAL, SLAC Fabrizio Furano - Data access and Storage: new directions 02-June-2008