Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Standard Interfaces to Grid Storage DPM and LFC Update Ricardo Rocha, Alejandro Alvarez ( on behalf of the LCGDM team ) EMI INFSO-RI
Grid Technology Main Goals Provide a lightweight, grid-aware storage solution Simplify life of users and administrators Improve the feature set and performance Use standard protocols Use standard building blocks Allow easy integration with new tools and systems 2
Grid Technology DISK NODE Architecture (Reminder) 3 CLIENT HEAD NODE DISK NODE Separation between metadata and data access – Direct data access to/from Disk Nodes Strong authentication / authorization Multiple access protocols NSDPMGRIDFTP NFSHTTP/DAV XROOT NFS HTTP/DAV XROOT GRIDFTP RFIO
Grid Technology Deployment & Usage DPM is the most widely deployed grid storage system – Over 200 sites in 50 regions – Over 300 VOs – ~36 PB (10 sites with > 1PB) LFC enjoys wide deployment too – 58 instances at 48 sites – Over 300 VOs 4
Grid Technology Deployment & Usage DPM is the most widely deployed grid storage system – Over 200 sites in 50 regions – Over 300 VOs – ~36 PB (10 sites with > 1PB) LFC enjoys wide deployment too – 58 instances at 48 sites – Over 300 VOs 5
Grid Technology Software Availability DPM and LFC are available via gLite and EMI – But gLite releases stopped at From on we’re Fedora compliant for all components, available in multiple repositories – EMI1, EMI2 and UMD – Fedora / EPEL Latest production release is Some packages will never make it into Fedora – YAIM (Puppet? See later…) – Oracle backend 6
Grid Technology Software Availability DPM and LFC are available via gLite and EMI – But gLite releases stopped at From on we’re Fedora compliant for all components, available in multiple repositories – EMI1, EMI2 and UMD – Fedora / EPEL Latest production release is Some packages will never make it into Fedora – YAIM (Puppet? See later…) – Oracle backend 7
Grid Technology System Evaluation
Grid Technology System Evaluation ~1.5 years ago we performed a full system evaluation – Using PerfSuite, out testing framework – – ( results presented later are obtained using this framework too ) It showed the system had serious bottlenecks – Performance – Code maintenance (and complexity) – Extensibility 9
Grid Technology Dependency on NS/DPM daemons All calls to the system had to go via the daemons – Not only user / client calls – Also the case for our frontends (HTTP/DAV, NFS, XROOT, …) – Daemons were a bottleneck, and did not scale well Short term fix (available since 1.8.2) – Improve TCP listening queue settings to prevent timeouts – Increase number of threads in the daemon pools Previously statically defined to a rather low value Medium term (available since 1.8.4, with DMLite) – Refactor the daemon code into a library 10
Grid Technology Dependency on NS/DPM daemons All calls to the system had to go via the daemons – Not only user / client calls – Also valid for our new frontends (HTTP/DAV, NFS, XROOT, …) – Daemons were a bottleneck, and did not scale well Short term fix (available since 1.8.2) – Improve TCP listening queue settings to prevent timeouts – Increase number of threads in the daemon pools Previously statically defined Medium term (available since 1.8.4, with DMLite) – Refactor the daemon code into a library 11
Grid Technology GET asynchronous performance DPM used to mandate asynchronous GET calls – Introduces significant client latency – Useful when some preparation of the replica is needed – But this wasn’t really our case (disk only) Fix (available with 1.8.3) – Allow synchronous GET requests 12
Grid Technology GET asynchronous performance DPM used to mandate asynchronous GET calls – Introduces significant client latency – Useful when some preparation of the replica is needed – But this wasn’t really our case (disk only) Fix – Allow synchronous GET requests 13
Grid Technology Database Access No DB connection pooling, no bind variables – DB connections were linked to daemon pool threads – DB connections would be kept for the whole life of the client Quicker fix – Add DB connection pooling to the old daemons – Good numbers, but needs extensive testing… Medium term fix (available since for HTTP/DAV) – DMLite, which includes connection pooling – Among many other things… 14
Grid Technology Database Access No DB connection pooling, no bind variables – DB connections were linked to daemon pool threads – DB connections would be kept for the whole life of the client Quicker fix – Add DB connection pooling to the old daemons – Good numbers, but needs extensive testing… Medium term fix (available since for HTTP/DAV) – DMLite, which includes connection pooling – Among many other things… 15
Grid Technology Dependency on the SRM SRM imposes significant latency for data access – It has its use cases, but is a killer for regular file access – For data access, only required for protocols not supporting redirection (file name to replica translation) Fix (all available from 1.8.4) – Keep SRM for space management only (usage, reports, …) – Add support for protocols natively supporting redirection HTTP/DAV, NFS 4.1/pNFS, XROOT And promote them widely… Investigating GridFTP redirection support (seems possible!) 16
Grid Technology Other recent activities… ATLAS LFC consolidation effort – There was an instance at each T1 center – They have all been merged into a single CERN one – Similar effort was done at BNL for the US sites HTTP Based Federations – Lots of work in this area too… – But there will be a separate forum event on this topic Puppet and Nagios for easy system administration – Available since for DPM – Working on new manifests for DMLite based setups – And adding LFC manifests, to be tested at CERN 17
Grid Technology Other recent activities… ATLAS LFC consolidation effort – There was an instance at each T1 center – They have all been merged into a single CERN one – The same effort was done at BNL for the US sites HTTP Based Federations – Lots of work in this area too… – But there will be a separate forum event on this topic Puppet and Nagios for easy system administration – In production since for DPM – Working on new manifests for DMLite based setups – And adding LFC manifests, to be tested at CERN 18
Grid Technology Future Proof with DMLite
Grid Technology Future Proof with DMLite DMLite is our new plugin based library Meets goals resulting from the system evaluation – Refactoring of the existing code – Single library used by all frontends – Extensible, open to external contributions – Easy integration of standard building blocks Apache2, HDFS, S3, … 20
Grid Technology DMLite is our new plugin based library Meets goals resulting from the system evaluation – Refactoring of the existing code – Single library used by all frontends – Extensible, open to external contributions – Easy integration of standard building blocks Apache, HDFS, S3, … Future Proof with DMLite 21
Grid Technology DMLite: Interfaces Plugins implement one or multiple interfaces – Depending on the functionality they provide Plugins are stacked, and called LIFO – You can load multiple plugins for the same functionality APIs in C/C++/Python, plugins in C++ (Python soon) 22 I/O domainPool domain PoolHandler IODriver IOHandler Namespace domain Catalog INode PoolManager PoolDriver User domain UserGroupDb
Grid Technology DMLite Plugin: Legacy Interacts directly with the DPNS/DPM/LFC daemons – Simply redirects calls using the existing NS and DPM APIs For full backward compatibility – Both for namespace and pool/filesystem management 23 I/O domainPool domain PoolHandler IODriver IOHandler Namespace domain Catalog INode PoolManager PoolDriver User domain UserGroupDb
Grid Technology DMLite Plugin: MySQL Refactoring of the MySQL backend – Properly using bind variables and connection pooling – Huge performance improvements Namespace traversal comes from Built-in Catalog Proper stack setup provides fallback to Legacy Plugin 24 I/O domainPool domain PoolHandler IODriver IOHandler Namespace domain Catalog INode PoolManager PoolDriver User domain UserGroupDb
Grid Technology DMLite Plugin: Oracle Refactoring of the Oracle backend What applies to the MySQL one, applies here – Better performance with bind variables and pooling – Namespace traversal comes from Built-in Catalog – Proper stack setup provides fallback to Legacy Plugin 25 I/O domainPool domain PoolHandler IODriver IOHandler Namespace domain Catalog INode PoolManager PoolDriver User domain UserGroupDb
Grid Technology DMLite Plugin: Memcache Memory cache for namespace requests – Reduced load on the database – Much improved response times – Horizontal scalability Can be put over any other Catalog implementation 26 I/O domainPool domain PoolHandler IODriver IOHandler Namespace domain Catalog INode PoolManager PoolDriver User domain UserGroupDb
Grid Technology DMLite Plugin: Memcache Memory cache for namespace requests – Reduced load on the database – Much improved response times – Horizontal scalability Can be put over any other Catalog implementation 27 I/O domainPool domain PoolHandler IODriver IOHandler Namespace domain Catalog INode PoolManager PoolDriver User domain UserGroupDb
Grid Technology DMLite Plugin: Hadoop/HDFS First new pool type HDFS pool can coexist with legacy pools, … – In the same namespace, transparent to frontends All HDFS goodies for free (auto data replication, …) Catalog interface coming soon – Exposing the HDFS namespace directly to the frontends 28 I/O domainPool domain PoolHandler IODriver IOHandler Namespace domain Catalog INode PoolManager PoolDriver User domain UserGroupDb
Grid Technology DMLite Plugin: S3 Second new pool type Again, can coexist with legacy pools, HDFS, … – In the same namespace, transparent to frontends Main goal is to provide additional, temporary storage – High load periods, user analysis before big conferences, … – Evaluated against Amazon, now looking at Huawei and OpenStack 29 I/O domainPool domain PoolHandler IODriver IOHandler Namespace domain Catalog INode PoolManager PoolDriver User domain UserGroupDb
Grid Technology DMLite Plugin: VFS Third new pool type (currently in development) Exposes any mountable filesystem – As an additional pool in an existing namespace – Or directly exposing that namespace Think Lustre, GPFS, … 30 I/O domainPool domain PoolHandler IODriver IOHandler Namespace domain Catalog INode PoolManager PoolDriver User domain UserGroupDb
Grid Technology DMLite Plugins: Even more… Librarian – Replica failover and retrial – Used by the HTTP/DAV frontend for a Global Access Service Profiler – Boosted logging capabilities – For every single call, logs response times per plugin HTTP based federations ATLAS Distributed Data Management (DDM) – First external plugin – Currently under development – Will expose central catalogs via standard protocols Writing plugins is very easy… 31
Grid Technology DMLite Plugins: Development 32
Grid Technology DMLite Plugins: Development 33
Grid Technology DMLite: Demo Time
Grid Technology Standard Frontends
Grid Technology Standard Frontends Standards based access to DPM is already available – HTTP/DAV and NFS4.1 / pNFS – But also XROOT (useful in the HEP context) Lots of recent work to make them performant – Many details were already presented – But we needed numbers to show they are a viable alternative We now have those numbers! 36
Grid Technology Frontends: HTTP / DAV Frontend based on Apache2 + mod_dav In production since – Working with PES on deployment in front of the CERN LFC too Can be for both get/put style (=GridFTP) or direct access – Some extras for full GridFTP equivalence Multiple streams with Range/Content-Range Third party copies using WebDAV COPY + Gridsite Delegation – Random I/O Possible to do vector reads and other optimizations Metalink support (failover, retrial) With it is already DMLite based 37
Grid Technology Frontend based on Apache2 + mod_dav In production since – Working with PES on deployment in front of the CERN LFC too Can be for both get/put style (=GridFTP) or direct access – Some extras for full GridFTP equivalence Multiple streams with Range/Content-Range Third party copies using WebDAV COPY + Gridsite Delegation – Random I/O Possible to do vector reads and other optimizations Metalink support (failover, retrial) With it is already DMLite based Frontends: HTTP / DAV 38
Grid Technology Frontends: NFS 4.1 / pNFS Direct access to the data, with a standard NFS client Available with DPM (read only) – Write enabled early next year – Not yet based on DMLite Implemented as a plugin to the Ganesha server Only kerberos authentication for now – Issue with client X509 support in Linux (server ready though) – We’re investigating how to add this DMLite based version in development 39
Grid Technology Frontends: NFS 4.1 / pNFS Direct access to the data, with a standard NFS client Available with DPM (read only) – Read / write early next year – Not yet based on DMLite Implementation based on the Ganesha server Only kerberos authentication for now – Issue with client X509 support – We’re investigating how to add this DMLite based version in development 40
Grid Technology Frontends: XROOTD Not really a standard, but widely used in HEP Initial implementation in 2006 – No multi-vo support, limited authz, performance issues New version 3.1 (rewrite) available with – Multi VO support – Federation aware (already in use in ATLAS FAX federation) – Strong auth/authz with X509, but ALICE token still available Based on the standard XROOTD server – Plugins for XrdOss, XrdCmsClient and XrdAccAuthorize Soon also based on DMLite (version 3.2) 41
Grid Technology Frontends: Random I/O Performance HTTP/DAV vs XROOTD vs RFIO – Soon adding NFS 4.1 / pNFS to the comparison 42 LAN / Chunk Size: / File Size: ProtocolN. ReadsRead SizeRead Time HTTP50022,773, HTTP100046,027, XROOT50022,773, XROOT100046,027, RFIO50022,773, RFIO100046,027,
Grid Technology HTTP/DAV vs XROOTD vs RFIO – Soon adding NFS 4.1 / pNFS to the comparison 43 LAN / Chunk Size: / File Size: / 5000 Reads ProtocolMax. VectorRead SizeRead Time HTTP81,166,613, HTTP162,156,423, HTTP243,211,861, HTTP324,226,877, HTTP648,535,839, XROOT81,166,613, XROOT162,156,423, XROOT243,211,861, XROOT324,226,877, XROOT648,535,839, Frontends: Random I/O Performance
Grid Technology HTTP/DAV vs XROOTD vs RFIO – Soon adding NFS 4.1 / pNFS to the comparison 44 WAN / Chunk Size: / File Size: ProtocolN. ReadsRead SizeRead Time HTTP50022,773, HTTP100046,027, XROOT50022,773, XROOT100046,027, RFIO50022,773,112 RFIO100046,027,143 Frontends: Random I/O Performance
Grid Technology Frontends: Hammercloud We’ve also run a set of Hammercloud tests – Using ATLAS analysis jobs – These are only a few of all the metrics we have 45 Remote HTTP Remote HTTP (TTreeCache) Staging HTTP Remote XROOT Remote XROOT (TTreeCache) Staging XROOT Staging GridFTP Events Athena(s) Event Rate(s) Job Efficiency
Grid Technology Performance: Showdown Big thanks to ShuTing and ASGC – For doing a lot of the testing and providing the infrastructure First recommendation is to phase down RFIO – No more development effort on it from our side HTTP vs XROOTD – Performance is equivalent, up to sites/users to decide – But we like standards… there’s a lot to gain with them Staging vs Direct Access – Staging not ideal… requires lots of extra space on the WN – Direct Access is performant if used with ROOT TTreeCache 46
Grid Technology Summary & Outlook DPM and LFC are in very good shape – Even more lightweight, much easier code maintenance – Open, extensible to new technologies and contributions DMLite is our new core library Standards, standards, … – Protocols and building blocks – Deployment and monitoring – Reduced maintenance, free clients, community help DPM Community Workshops – Paris December 2012 – Taiwan 17 March 2013 (ISGC workshop) A DPM Collaboration is being setup 47