Data Federation & Data Management for the CMS Experiment at the LHC Frank Würthwein SDSC/UCSD 11/16/16 SC16
I will restrict myself to CMS Mont Blanc Four experimental collaborations: ATLAS, CMS, LHCb, ALICE Lake Geneva LHCb ATLAS I will restrict myself to CMS CMS ALICE 11/16/16 SC16
“Big bang” in the laboratory We gain insight by colliding protons at the highest energies possible to measure: Production rates Masses & lifetimes Decay rates From this we derive the “spectroscopy” as well as the “dynamics” of elementary particles. Progress is made by going to higher energies and more proton proton collisions per beam crossing. More collisions => increased sensitivity to rare events More energy => probing higher masses, smaller distances & earlier times 11/16/16 SC16
Data Volume LHC data taking from 2010-12 & 2015-18 & 2021-23 increased data taking rate by x3 after Run 1. CMS = 80M pixel camera taking a “picture” every 25ns roughly 10 PB of data produced per second out of 40MHz event rate, 1kHz is kept in Run 2. ~ 50 PB per year archived to tape dominated by physics quality data & simulation after reconstruction. only primary datasets for the entire collaboration are archived. Centrally organized production of physics quality data for the 2000+ physicists in each collaboration. 2000+ physicists from 180 institutions in 40 countries 11/16/16 SC16
Data to Manage Datasets => distributed globally Calibration Releases => distributed globally Software Releases => distributed globally A typical physicist doing data analysis uses custom software & configs on top of a standardized software release re-applies some high level calibrations does so uniformly across all primary datasets used in the analysis. produces private secondary datasets 11/16/16 SC16
Largest national contribution is only 24% of total resources. Global Distribution Open Science Grid Largest national contribution is only 24% of total resources. 11/16/16 SC16
Software & Calibrations Both are distributed via systems that use Squid caches. Calibrations: Frontier system is backended by an Oracle DB Software: CVMFS is backended by a filesystem Data distribution achieved via globally distributed caching infrastructure. 11/16/16 SC16
CMS Ops for last 6 months 180,000 cores 150,000 cores Routine operations across ~50-100 clusters worldwide 100,000 cores Daily average for the last 6 months
Google Compute Engine SC16 Google = 153.7k jobs CMS global = 124.9k jobs Details see: Burt Holzman 3:30pm in Google booth https://cloudplatform.googleblog.com/2016/11/Google-Cloud-HEPCloud-and-probing-the-nature-of-Nature.html 500TB of “PU data” staged into GCE. Run simulation & digitization & reconstruction in one step. Export output files end of job to FNAL via xrdcp. 11/16/16 SC16
Dataset Distribution ~ 50PB per year Disk Space 2017 = 150 Petabytes Tape space 2017 = 246 Petabytes 11/16/16 SC16
Dataset Distribution Strategies Managed Pre-staging of datasets to clusters managed based on human intelligence managed based on “data popularity” Data Transfer integrated with processing workflow determine popularity dynamically based on pending workloads in WMS. Remote file open & reads via data federation Dynamic caching just like for calibrations & software. 11/16/16 SC16
… making the case for WAN reads. Any Data, Any Time, Anywhere: Global Data Access for Science http://arxiv.org/abs/1508.01443 … making the case for WAN reads. 11/16/16 SC16
Optimize Data Structure for Partial Reads 11/16/16 SC16
Data Federation for WAN Reads Applications connect to local/regional redirector. Redirect upwards only if file does not exist in tree below. Minimizing WAN read access latency this way. Global Redirector Xrootd US Redirector Xrootd EU Redirector … … Data Server XRootd Xrootd local Redirector Data Server XRootd Xrootd local Redirector Xrootd local Redirector Data Server XRootd Xrootd local Redirector Data Server XRootd Data Server XRootd Data Server XRootd Many Clusters in US Many Clusters in EU 11/16/16 SC16
XRootd Data Federation servers can be connected into arbitrary tree structure. application can connect at any arbitrary node in the tree. application read pattern is vector of byte ranges, chosen by the application IO layer for optimized read performance. 11/16/16 SC16
A Distributed XRootd Cache Global Data Federation of CMS Applications can connect at local or top level cache redirector. Test the system as individual or joint cache. Redirector top level cache UCSD Caltech Provisioned test systems: UCSD: 9 x 12 SATA disk of 2TB @ 10Gbps for each system. Caltech: 30 SATA disk of 6TB 14 SSD of 512GB @ 2x40Gbps per system Redirector Redirector … … Cache Server Cache Server Cache Server Cache Server Production Goal: Distributed cache that sustains 10k clients reading simultaneously from cache at up to 1MB/s/client without loss of ops robustness.
Caching behavior Application client requests file open Cache client requests file open from higher level redirector if file not in cache Application client requests vector of byte ranges to read Cache provides subset of bytes that exist in cache, and fetches the rest from remote. if simultaneous writes below configured threshold then write the fetched data to cache. else fetched data stays in RAM, flows through to application, and gets discarded. Cache client fills in missing pieces of file while application processes vector of bytes requested, as long as simultaneous writes below configured threshold. 11/16/16 SC16
Initial Performance Tests Write as measured at NIC Read as measured at NIC 30 Gbps Up to 5000 clients reading from 108 SATA disks across 9 servers Focusing on just one of the servers: Disk IO NIC write NIC read by design cache does not always involve disk when load gets high NIC write/read >> Disk IO => Robust serving of clients more important than cache hits
Caching at University Clusters There are ~80 US Universities participating in the LHC. Only a dozen have clusters dedicated and paid for by LHC project funds. Ideally, many others would want to use their University shared clusters to do LHC science. Data Management is the big hurdle to be effective. Notre Dame was first to adopt the XRootd caching technology as a solution to this problem. The ND CMS group uses the 25k+ core ND cluster for their science. 11/16/16 SC16
Summary & Conclusions LHC experiments have changed their Data Management strategies over time. Initially great distrust in global networks and thus rather static strategies. Spent multiple FTE years debugging global End-to-End transfer performance. Over time becoming more and more agile, aggressively using caching & remote reads to minimize disk storage costs. 11/16/16 SC16