Storage Architecture for Tier 2 Sites Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University 8-May-07 INFN Tier 2 Workshop

Storage Architecture for Tier 2 Sites Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University 8-May-07 INFN Tier 2 Workshop http://xrootd.slac.stanford.edu

8-May-20072: http://xrootd.slac.stanford.edu Outline The quintessential Tier 2 & the average SE Constructing an SE today Classic, GPFS, DPM, dCache, Castor, Scalla Requirements, Limitations, and Considerations The current SLAC approach Future possibilities Conclusion

8-May-20073: http://xrootd.slac.stanford.edu The Quintessential Tier 2 Site Lumpy or Constrained Resources Largely compute oriented Enough storage for compute tasks Limited Human Resources Few administrators Narrow expertise in most cases Large number of sites Creates additional support burden

8-May-20074: http://xrootd.slac.stanford.edu Quintessential Generalization Many reasons for being a tier 2 Obvious resource limitations Human or hardware Pre-existing contractual obligations Resources split into too many fractions Financial constraints Just the way money flows There are tier 2 sites that could be tier 1 For example, SLAC

8-May-20075: http://xrootd.slac.stanford.edu That Said... If we assume typical tier 2’s are limited Hardware or humans A set of implied requirements emerges These may be good for tier 1’s Absolutely necessary for tier 2’s This talk concentrates on the storage element

8-May-20076: http://xrootd.slac.stanford.edu Storage Element Aspects Provide data to the compute farm (CE) Random access Interface with the grid Sequential access Data in and data out Integrate with other systems Offline storage (i.e., MSS)

8-May-20077: http://xrootd.slac.stanford.edu General Tier 2 Architecture Compute Farm a.k.a CE Grid Access Storage System a.k.a. SE Bulk Data Sequential I/O App Data Random I/O Client scaling Client scaling: what happens as you add more clients I/O scaling I/O scaling: what happens as you add SE disk servers Two Scaling Aspects

8-May-20078: http://xrootd.slac.stanford.edu Closer Look At The SE S PaPaPaPa S Protocol for application random I/O Protocol for grid sequential bulk I/O srm PaPaPaPa PgPgPgPg PgPgPgPg dcap gpfs nfs rfio xroot gridFTP bbftp bbcp xroot PaPaPaPa PgPgPgPg SRM to co-ordinate Grid-SE transfers

8-May-20079: http://xrootd.slac.stanford.edu Implied SE Requirements Easy to install and configure Sometimes root access is restricted Easy to maintain Configuration as hardware changes Across OS upgrades Now more common due to security patches and shortened lifecycles Easy to debug Integrated monitoring and extensive understandable logging Easy to scale upwards as well as downwards Low machine resources More machines mean more administrative effort Maximize use of a constrained CE High degree of fault tolerance

8-May-200710: http://xrootd.slac.stanford.edu Constructing an SE The choices de jour Classic GPFS DPM dCache Castor Scalla (i.e., xrootd) The following is based on my observations

8-May-200711: http://xrootd.slac.stanford.edu Classic SE Architecture Clients Data Server gridFTPNFS

8-May-200712: http://xrootd.slac.stanford.edu Classic SE Considerations Hardware Almost anything that can support the client Load Software Not much except for gridFTP pre-reqs (e.g., Globus) Data Access NFS and gridFTP (typical), rfio and xroot (extra) Other considerations MSS integration a local matter, SRM not supported Configuration and maintenance is easy No database requirements I/O scaling and client scaling low (limited applicability)

8-May-200713: http://xrootd.slac.stanford.edu GPFS SE Architecture Can be SAN Based Architectural and Design Issues in the General Parallel File System, Benny Mandler - mandler@il.ibm.com May, 2005, Benny Mandler - mandler@il.ibm.com Dominique A. Heger, Fortuitous Technology, (dom@fortuitous.com), Austin, TX, 2006 http://www.fortuitous.com/docs/primers/Cluster_Filesystems_Intro.pdf

8-May-200714: http://xrootd.slac.stanford.edu GPFS SE Considerations Hardware 1-2 Nodes + n Disk Server Nodes (0 if SAN based) 2GHz processors with at least 1GB Software Linux Portability Layer (limited versions), OpenSSL Data Access Local VFS, NFS v3, gridFTP, SRM via StoRM Other considerations MSS integration supported via DMAPI (currently only HPSS) Configuration and maintenance is of average to average+ difficulty No database requirements I/O scaling very good, client scaling average (500-1,000 nodes documented)

8-May-200715: http://xrootd.slac.stanford.edu DPM SE Architecture Store physical files -- Namespace -- Authorization -- Replicas -- DPM config -- All requests (SRM, transfers…) Standard Storage Interface Can all be installed on a single machine DPM Administration for Tier2s Sophie Lemaitre (Sophie.Lemaitre@cern.ch) Jean-Philippe Baud (Jean-Philippe.Baud@cern.ch) Tier2s tutorial – 15 Jun 2006Sophie.Lemaitre@cern.chJean-Philippe.Baud@cern.ch Very important to backup ! mySQL or Oracle

8-May-200716: http://xrootd.slac.stanford.edu DPM SE Considerations Hardware 2 Nodes + n Disk Server Nodes 2GHz processors with 0.5-1GB Software Java, Globus, mySQL or Oracle Several others available via LCG distribution Data Access rfio, integrated xroot, modified gridFTP, SRM Other considerations No MSS support (though in plan) Configuration and maintenance is of average difficulty Database backup/restore plan I/O scaling average, client scaling likely below average

8-May-200717: http://xrootd.slac.stanford.edu dCache SE Architecture https://twiki.grid.iu.edu/pub/Documentation/StorageDcacheOverviewHardwareLayout/OsgTier2DcacheInstall.png

8-May-200718: http://xrootd.slac.stanford.edu dCache SE Considerations Hardware 6 Nodes + n Disk Server Nodes 2GHz multi-core processors with at least 4-8GB Software Java 1.4.2, PostgreSQL 8+, possibly Globus MDS (LCG SE) Data Access dcap, gridFTP, minimal xroot, SRM Other considerations MSS integration supported Configuration and maintenance is medium to hard difficulty Database backup/restore plan I/O scaling average, client scaling likely below average

8-May-200719: http://xrootd.slac.stanford.edu Castor SE Architecture http://castor.web.cern.ch/castor/images/Castor_architecture.png

8-May-200720: http://xrootd.slac.stanford.edu Castor SE Considerations Hardware 2-4 Nodes + n Disk Server Nodes + m Tape Server Nodes 2GHz multi-core processors with at least 2-4GB Software Oracle, LSF or Maui, many co-requisite Castor based RPMs Fortunately have LCG YAIM Data Access rfio, gridFTP, integrated xroot, SRM Other considerations MSS integration included Configuration and maintenance is relatively hard Database backup/restore plan I/O scaling average, client scaling likely to be below average

8-May-200721: http://xrootd.slac.stanford.edu Scalla SE Architecture Clients (Linux, MacOS, Solaris, Windows) Data Servers Managers Optional gridFTP Fire Wall xroot Optional

8-May-200722: http://xrootd.slac.stanford.edu Scalla SE Considerations Hardware 1-3 Nodes + n Disk Server Nodes 1-2GHz processors with 0.5-1GB Software Self-contained, no additional software needed Modulo gridFTP and SRM based access Data Access gridFTP (trivially in proxy mode), xroot Other considerations MSS integration supported Configuration and maintenance is easy No database requirements I/O scaling and client scaling very good

8-May-200723: http://xrootd.slac.stanford.edu Quick Summary LightweightHeavyweight 1 Classic 1 Scalla DPM dCache Castor GPFS In terms of hardware resources and administrative effort 1 Lightweight for very small instances, otherwise quickly becomes administratively heavy

8-May-200724: http://xrootd.slac.stanford.edu The Heavy in Heavyweight Typified by IEEE P1244 Storage Reference Model Well intentioned but misguided effort 1990-2002 I’m allowed to say that as I got an IEEE award for my contribution Yes, I was part of the problem Feature rich Mostly in managerial elements File placement, access control, statistics, etc. All components are deconstructed into services Name Service, Bit File Service, Storage Service, etc. LFN and PFN are intrinsically divorced from each other Requires database to piece individual elements together

8-May-200725: http://xrootd.slac.stanford.edu The Heavyweight Price Deconstruction allows great flexibility However, may lead to architectural inconsistencies Invariably a database in placed in Databases are not inherently bad and are needed It’s the placement that creates a problem Potential for a bottleneck & single point of failure that stops everything High latency design leads to speed matching problems Implies more hardware, schema evolution, & constant tuning Database Clients S p e e d M i s m a t c h P r o b l e m Fewer simultaneous jobs and file opens Server

8-May-200726: http://xrootd.slac.stanford.edu The Light in Lightweight Typified by a lean file system Relies on OS primitives for most core functions Latency follows kernel performance Feature lean to limit overhead Mostly in managerial elements Components are deconstructed only when necessary LFN and PFN are intrinsically tied to each other Does not require a persistent internal database not in External databases are required but not in Naturally supports high transaction rates High number of simultaneous jobs and file opens

8-May-200727: http://xrootd.slac.stanford.edu The Lightweight Price Fewer features Less intricate space management controls More challenging to support multiple experiments Typically solved by physical partitioning Less overhead for more performance Job start-up & file open time vastly reduced Avoids pile-up & meltdown More bang for the euro Less hardware and people are required Sorry, there is no free lunch here

8-May-200728: http://xrootd.slac.stanford.edu The SLAC Approach We prefer lightweight over heavyweight Allows us to scale in a natural way Minimizes “tuning” effort Much more predictable outcome We do deploy heavyweight systems (e.g., HPSS) kept out Always kept out of Maximizes batch farm utilization Minimizes people effort We have a very lean staff head count Hence, we plan to use Scalla for Tier 2 Atlas support

8-May-200729: http://xrootd.slac.stanford.edu Heavy Plus Light At SLACclient olbdxrootd Tape Server HPSS TapeServer DB2 xroot Cluster olbdxrootdolbdxrootd Independent Speed Matched Systems Complete Fault Tolerance Disk Server dCache Postgres SRM gridFTP IN2P3 dCache Door

8-May-200730: http://xrootd.slac.stanford.edu What SLAC Gains Leveraging our infrastructure reduces overall cost We learned vast amounts from BaBar More Tier 2’s running Scalla further justifies out commitment Hence a bit of a sales pitch follows We now have a high performance storage system It can handle over 3,000 simultaneous jobs Deliver performance limited only by the hardware Requires remarkably few people and only mundane hardware Multiple experiments handled via partitioning Not a real issue because experiments are usually always active Partitioning only practical way to keep one group from impacting another For us, hardware is cheaper than full time people

8-May-200731: http://xrootd.slac.stanford.edu The Bump In The Road Scalla does not currently have SRM support gridFTP is supported via POSIX preload library uberftp support virtually completed for pull applications Likely to change but no fixed time line Sidelined integration with FNAL SRM Current code structure creates significant maintenance issues No committed FNAL support for standalone SRM Brookhaven integrating LBL SRM/DRM It’s a challenging project Actively working on StoRM integration Real impact is not clear SRM relevance to Tier 2 Atlas is conflicting

8-May-200732: http://xrootd.slac.stanford.edu The Driving Force Scalla is a new generation forward looking design Enables and encourages new opportunities Performance & fault-tolerance is synergistic with a PetaCache Allows for new ways to analyze physics data in real time Sub-millisecond access relevant to proof as well as ad hoc analysis Structured peer-to-peer methodology for exponential scaling From single server to production clusters of 500+ SE’s (BNL/STAR) Protocol blind directed security model for multiple authentication modes Just in time protocol conversion can eliminate GSI overhead Intrinsic WAN integration is synergistic to new tier models Explosive increase in data flow routings to reduce failure modes Let’s just explore this for a bit....

8-May-200733: http://xrootd.slac.stanford.edu But, Isn’t The WAN An Enemy? Enormous effort spent on bulk transfer Requires significant SE resource near CE’s Can result in large wasted network bandwidth Unless most of data used multiple times Still have the “missing file” problem Requires significant bookkeeping effort Large job startup delays until all required data arrives This is due to historical view of WAN Too high latency Unstable and unpredictable Much of this is really no longer effective Still, many are resistant to even considering real-time WAN access

8-May-200734: http://xrootd.slac.stanford.edu WAN Real Time Access? CPU/event <= RTT/p Where p is number of pre-fetched events Real time WAN access “equivalent” to LAN Some assumptions here Pre-fetching possible Analysis structured to be asynchronous At the very least, framework needs to be asynchrous Firewall problems addressed For instance, using proxy servers

8-May-200735: http://xrootd.slac.stanford.edu The WAN Is Integral to Scalla Native latency reduction protocol elements Asynchronous pre-fetch Maximizes overlap between client CPU and network transfers Request pipelining Vastly reduces request/response latency Vectored reads and writes Allows multi-file and multi-offset access with one request Client scheduled parallel streams Removes server from second guessing the application Integrated proxy server clusters Firewalls addressed in a scalable way Federated peer clusters Allows real-time search for files on the WAN

8-May-200736: http://xrootd.slac.stanford.edu Another Tier 2 Model Bulk transfer only long-lived useful data Need a way to identify this Start jobs the moment enough data present Any missing files can be found on the “net” LAN access to high use / high density files WAN access to everything else Locally missing files Low use or low density files Initiate background bulk transfer when appropriate Switch to local copy when finally present

8-May-200737: http://xrootd.slac.stanford.edu A Forward Looking Architecture Independent Tier 2 Sites Cross-share data when necessary Local SE unavailable or file is missing Site B SE CE Site A CE SE Site C CE SE Site D CE SE Sites Federated As Independent Peer Clusters

8-May-200738: http://xrootd.slac.stanford.edu Conclusion Many ways to build a Tier 2 SE Choice depends on what needs to be accomplished Keep in mind that the simplest solution many times works best This is especially relevant to smaller or highly distributed sites Alternative architectures should be considered They may be the best way to scale production LHC analysis Effort should be spent on making analysis WAN compatible In the end, it’s the science that matters Choose the architecture to do HEP analysis As fast as possible As inexpensively as possible

8-May-200739: http://xrootd.slac.stanford.edu Acknowledgements Software Collaborators INFN/Padova: Fabrizio Furano (client-side), Alvise Dorigo Root: Fons Rademakers, Gerri Ganis (security), Bertrand Bellenet (windows) Alice: Derek Feichtinger, Guenter Kickinger STAR/BNL: Pavel Jackl Cornell: Gregory Sharp SLAC: Jacek Becla, Tofigh Azemoon, Wilko Kroeger Operational collaborators BNL, CNAF, FZK, INFN, IN2P3, RAL, SLAC Funding US Department of Energy Contract DE-AC02-76SF00515 with Stanford University

Storage Architecture for Tier 2 Sites Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University 8-May-07 INFN Tier 2 Workshop

Similar presentations

Presentation on theme: "Storage Architecture for Tier 2 Sites Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University 8-May-07 INFN Tier 2 Workshop"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Storage Architecture for Tier 2 Sites Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University 8-May-07 INFN Tier 2 Workshop

Similar presentations

Presentation on theme: "Storage Architecture for Tier 2 Sites Andrew Hanushevsky Stanford Linear Accelerator Center Stanford University 8-May-07 INFN Tier 2 Workshop"— Presentation transcript:

Similar presentations

About project

Feedback