Ian Willers Information: CMS participation in Monarc and RD45 Slides: Paolo Capiluppi, Irwin Gaines, Harvey Newman, Les Robertson, Jamie Shiers, Lucas Taylor Hardware Resource Needs of CMS Experiment start up 2005
2 Contents Why is LHC computing different Monarc Project and proposed architecture An LHC offline computing facility at CERN A Regional Centre LHC data management The Particle Physics Data Grid Summary
3 CMS Structure showing sub-detectors
4 Not covered: CMS Software Professionals Professional software personnel ramping up to ~33 FTE’s (by 2003) Engineers support much larger no. of physicist developers (~4 times) Shortfall: 10 FTE’s (1999)
5 Not covered : Cost of Hardware ~120 MCHF Total Computing cost to 2006 incl. ~consistent with canonical 1/3 : 2/3 rule ~40 MCHF (Tier0, Central systems at CERN) ~40 MCHF (Tier1, ~5 Regional Centres each ~20% of central systems) ~40 MCHF (?) (Universities, Tier2 centres, MC, etc..) Figures being revised
6 LHC Computing: Different from Previous Experiment Generations Geographical dispersion: of people and resources Complexity: the detector and the LHC environment Scale: Petabytes per year of data 1800 Physicists 150 Institutes 32 Countries Major challenges associated with u u Coordinated Use of Distributed computing resources u u Remote software development and physics analysis u u Communication and collaboration at a distance R&D: New Forms of Distributed Systems
7 Comparisons with LHC sized experiment in 2006: CMS at CERN [*] [*] Total CPU: CMS or ATLAS ~ MSi95 Estimates for disk/tape ratio will change (technology evolution)
8 CPU needs for baseline analysis process 52k 4 Hours 4 Times/Day Individuals(500) Analysis (ESD 1%) ~580+x ~700+y ? Total Disk Storage(TB) ~1400k ~1700k 1050k 3k 1k 190k 100k 116k Total CPU Power (SI95) 10 ~300 Days ~10 6 events /DayExperiment/Group Simulation + Reconstruction Total utilized Total installed Hours 4 Times/Day Individuals(500)Analysis (AOD, TAG & DPD) Day Once/Month Groups (20) Selection Days Once/MonthExperimentRe-definition (AOD &TAG) Month 3 times/Year ExperimentRe-processing Days Once/YearExperimentReconstruction Disk I/O (MB/sec)Responsetime/passFrequencyActivity (100% efficiency, no AMS overhead)
9 Major activities foreseen at CERN reality check (Les Robertson, Jan. ‘99) Based on 520,000 SI95, present estimate from Les 600,000 SI ,000 SI95/year
10 MONARC: Common Project Models Of Networked Analysis At Regional Centers Caltech, CERN, Columbia, FNAL, Heidelberg, Helsinki, INFN, IN2P3, KEK, Marseilles, MPI Munich, Orsay, Oxford, Tufts PROJECT GOALS è Develop “Baseline Models” è Specify the main parameters characterizing the Model’s performance: throughputs, latencies è Verify resource requirement baselines: (computing, data handling, networks) TECHNICAL GOALS è Define the Analysis Process è Define RC Architectures and Services è Provide Guidelines for the final Models è Provide a Simulation Toolset for Further Model studies 622 Mbits/s Univ 2 CERN 520k SI Tbytes Disk; Robot Tier2 Ctr 20k SI95 20 TB Disk Robot FNAL/BNL 100k SI Tbyte Disk; Robot 622 Mbits/s N X 622 Mbits/s 622Mbits/s Univ 1 Univ M Model Circa 2005
11 CMS Analysis Model Based on MONARC and ORCA: “Typical” Tier1 RC è CPU Power~100 KSI95 è Disk space~200 TB è Tape capacity600 TB, 100 MB/sec è Link speed to Tier210 MB/sec (1/2 of 155 Mbps) è Raw data5% 50 TB/year è ESD data100%200 TB/year è Selected ESD25%10 TB/year [*] è Revised ESD25%20 TB/year [*] è AOD data100%2 TB/year [**] è Revised AOD100%4 TB/year [**] è TAG/DPD100%200 GB/year è Simulated data25%25 TB/year [*] Covering Five Analysis Groups; each selecting ~1% of Annual ESD or AOD data for a Typical Analysis [**] Covering All Analysis Groups
12 Monarc Data Hierarchy Tier2 Center ~1 TIPS Online System Offline Farm ~20 TIPS CERN Computer Center >20 TIPS Fermilab ~4 TIPS France Regional Center Italy Regional Center Germany Regional Center Institute Institute ~0.25TIPS Workstations ~100 MBytes/sec ~2.4 Gbits/sec Mbits/sec Bunch crossing per 25 nsecs. 100 triggers per second Event is ~1 MByte in size Physicists work on analysis “channels”. Each institute has ~10 physicists working on one or more channels Data for these channels should be cached by the institute server Physics data cache ~PBytes/sec ~622 Mbits/sec or Air Freight Tier2 Center ~1 TIPS ~622 Mbits/sec Tier 0 Tier 1 Tier 3 Tier 4 1 TIPS = 25,000 SpecInt95 PC (1999) = ~15 SpecInt95 Tier2 Center ~1 TIPS Tier 2
13 MONARC Analysis Process Example DAQ/RAW Slow Control/Cal ~20 ~25 Huge number of “small” jobs per Day. Chaotic activity 20 large jobs per Month. Coord. activity 4 times per Year? (per Exp.)
14 Tapes Network from CERN Network from Tier 2 & simulation centers Tape Mass Storage & Disk Servers Database Servers Physics Software Development R&D Systems and Testbeds Info servers Code servers Web Servers Telepresence Servers Training Consulting Help Desk Production Reconstruction Raw/Sim Rec objs Scheduled predictable Experiment/ Physics groups Production Analysis Selection: Rec objs AOD &TAG Scheduled Physics groups Individual Analysis Selection: TAG plots Chaotic Physicists Desktops Tier 2 Local institutes CERN Tapes Regional Centre
15 Offline Computing Facility for CMS at CERN Purpose of the study u investigate the feasibility of building LHC computing facilities using current cluster architectures, conservative assumptions about technology evolution è scale & performance è technology è power è footprint è cost è reliability è manageability
16 Background & assumptions Sizing u Data are estimates from experiments u MONARC analysis group working papers and presentations Architecture u CERN is Tier0 centre and act as Tier1 centre... u CERN distributed architecture (in same room and across site) è simplest components (hyper-sensitive to cost, aversion to complication) è throughput (before performance) è resilience (mostly up all of the time) è computing fabric for flexibility, scalability avoid special-purpose components avoid special-purpose components everything can do anything (does not mean that parts are not dedicated for specific applications, periods,..) everything can do anything (does not mean that parts are not dedicated for specific applications, periods,..)
18 Components (1) Processors u u the then-current low-end PC server (equivalent of the dual cpu boards of 1999) u 4 cpus, each >100 SI95 u creation of AOD and analysis may need better (more expensive) Assembled into clusters and sub-farms u according to practical considerations like è throughput of first level LAN switch è rack capacity è power & cooling, …. u Each cluster comes with a suitable chunk of I/O capacity LAN u no issue - since the computers are high volume components, the computer-LAN interface is standard (then-current Ethernet!) u higher layers need higher throughput - but only about a Tbps
Processor cluster basic box four 100 SI95 processors standard network connection (~2 Gbps) 15% of systems configured as I/O servers (disk server, disk-tape mover, Objy AMS,..) with additional connection to the storage network cluster 9 basic boxes with a network switch (<10 Gbps) sub-farm 4 clusters - with a second-level network switch (<50 Gbps) one sub-farm fits in one rack 3 Gbps* 1.5 Gbps configured as I/O servers storage network farm network cluster and sub-farm sizing adjusted to fit conveniently the capabilities of network switch, racking, power distribution components sub-farm: 36 boxes, 144 cpus, 5 m 2
20 Components (2) Disks u u inexpensive RAID arrays u u capacity limited to ensure sufficient number of independent accessors (say ~100GB with the current size of disk farm) SAN (Storage Area Network) u u if this market develops into high-volume, low-cost (?) è è hopefully using the standard network medium u u otherwise use the current model LAN-connected storage servers instead of special-purpose SAN-connected storage controllers
disk sub-system array Two RAID controllers Dual-attached disks Controllers connect to the storage network Sizing of array subject to components available rack Integral number of arrays, with first level network switches In the main model, half-height 3.5” disks are assumed, 16 per shelf of a 19” rack. With space for 18 shelves in the rack (two-sided), half of the shelves are populated with disks, the remainder housing controllers, network switches, power distribution. 0.8 Gbps 5 Gbps storage network 19” rack, 1m deep, 1.1 m 2 with space for doors, 14TB capacity disk size restricted to give a disk count which matches the number of processors (and thus number of active processes)
22 I/O models to be investigated This is a fabric so it should support any model 1 - the I/O server, or Objectivity AMS model u u all I/O requests must pass through an intelligent processor u u data passes across the SAN to the I/O server and then across the LAN to the application server 2 - as above, but the SAN and the LAN are the same, or the SAN is accessed via the LAN u u all I/O requests pass twice across the LAN - double the network rates in the drawing 3 - the global shared file system u u no I/O servers, database servers - all data is accessed directly from all application servers u u the LAN is the SAN
23 Components (3) Tapes (unpopular in Computer Centres - new technology by 2004?) u u conservative assumption - è 100GB per cartridge è 20MB/sec per drive - with 25% achievable (robot, load/unload, position/rewind, retry, ….) u u let’s hope that all of the active data can be held on disk u u tape needed as archive and for shipping View of part of the magnetic tape vault at CERN's computer centre STK Robot 6000 cartridges x 50 GB = 300TB
24 Problems? We hope the local area network will not be a problem u u cpu-to-I/O requirement is modest u u few Gbps at the computer node u u suitable switches should be available in a couple of years Disk system is probably not an issue u u buy more disk than we currently predict to have enough accessors Tapes - already talked about that Space - OK - thanks to the visionaries of 1970 Power & cooling - not a problem but a new cost
25 The real problem Management u u installation u u monitoring u u fault-determination u u re-configuration u u Integration All this must be fully automated while retaining simplicity and flexibility Make sure full cost of ownership is considered u current industry cost of ownership of PC is 10’000 CHF/year vs. 3’000 CHF purchase price
26 Tapes Network from CERN Network from Tier 2 & simulation centers Tape Mass Storage & Disk Servers Database Servers Physics Software Development R&D Systems and Testbeds Info servers Code servers Web Servers Telepresence Servers Training Consulting Help Desk Production Reconstruction Raw/Sim Rec objs Scheduled predictable Experiment/ Physics groups Production Analysis Selection: Rec objs AOD &TAG Scheduled Physics groups Individual Analysis Selection: TAG plots Chaotic Physicists Desktops Tier 2 Local institutes CERN Tapes Regional Centre
27 Data Import Data Export Mass Storage & Disk Servers Database Servers Tapes Network from CERN Network from Tier 2 and simulation centers Physics Software Development R&D Systems and Testbeds Info servers Code servers Web Servers Telepresence Servers Training Consulting Help Desk Production Reconstruction Raw/Sim--> Rec objs Scheduled, predictable experiment/ physics groups Production Analysis Selection Rec objs TAG Scheduled Physics groups Individual Analysis Selection TAG plots Chaotic Physicists Desktops Tier 2 Local institutes CERN Tapes DATAFLOW Robotic Mass Storage Central Disk Cache Local Disk Cache
28 LHC Data Management 4 Experiments mean u u >10PB / year total u u 100MB - 1.5GB/s u u ~20 years running u u ~5000 physicists u u ~250 institutes u u ~500 concurrent analysis jobs 24x7 Solutions must work at CERN and outside Scalable from 1-10 users to Support lap-top to large servers with 100GB-1TB and HSM From MB/GB/TB? (private data) to many PB
29 Objectivity/Database Architecture: CMS baseline solution Application Objy Client Objy Server Objy Lock Server Objy Server HSM Client HSM Server Application Objy Client Objy Server Application Host Application & Data Server Data Server + HSM files Any host
30 Objectivity/Database Architecture: CMS baseline solution Object ID (OID): 8 bytes u u 64K databases (files) on any host in network u u 32K containers per database u u 64K logical pages per container 4GB for 64KB page; 0.5GB for 8KB page size u u 64K object slots per page Theoretical limit: 10,000PB ( = 10EB) assuming database files of 128TB Maximum practical file size ~10GB u u time to stage: seconds, tape capacity Pending architectural changes for VLDBs u u multi-file DBs (e.g. map containers to files) Federation Database Container Page Object
31 Particle Physics Data Grid (PPDG) u u First Year Goal: Optimized cached read access to 1-10 GBytes, drawn from a total data set of order One Petabyte PRIMARY SITE Data Acquisition, CPU, Disk, Tape Robot SECONDARY SITE CPU, Disk, Tape Robot Site to Site Data Replication Service 100 Mbytes/sec DOE/NGI Next Generation Internet Project ANL, BNL, Caltech, FNAL, JLAB, LBNL, SDSC, SLAC, U.Wisc/CS University CPU, Disk, Users PRIMARY SITE DAQ, Tape, CPU, Disk, Robot Satellite Site Tape, CPU, Disk, Robot University CPU, Disk, Users University Users University Users University Users Multi-Site Cached File Access Service Satellite Site Tape, CPU, Disk, Robot
32 PPDG: Architecture for Reliable High Speed Data Delivery Object-based and File-based Application Services Cache Manager File Access Service MatchmakingService Cost Estimation File Fetching Service File Replication Index End-to-End Network Services Mass Storage Manager ResourceManagement File Mover Site Boundary Security Domain
33 Distributed Data Delivery and LHC Software Architecture Architectural Flexibility u uGRID will allow resources to be used efficiently è è I/O requests up-front; data driven; respond to ensemble of changing cost estimates è è Code movement as well as data movement è è Loosely coupled, dynamic: e.g. Agent-based implementation
34 Summary - Data Issues Development of a robust PB-scale networked data access and analysis system is mission-critical u u An effective partnership exists, HENP-wide, through many R&D projects An aggressive R&D program is required to develop the necessary systems u For reliable data access, processing and analysis across a hierarchy of networks Solutions could be widely applicable to data problems in other scientific fields and industry, by LHC startup
35 Conclusions CMS has a first order estimate of needed resources and costs CMS has identified key issues concerning needed resources CMS is doing a lot of focused R&D work to refine estimates A lot of integration is needed for the software and hardware architecture We have positive feedback from different institutions for regional centres and development