Download presentation
Presentation is loading. Please wait.
1
The CMS-HI Computing Plan Vanderbilt University
Charles F. Maguire Vanderbilt University for the CMS-HI Group Post-review version with back-up slide comparisons June 2, 2010 DOE-NP On-Site Review at Vanderbilt
2
Outline of the Review Talks
Dennis Hall: Support of Vanderbilt University for CMS-HI Bolek Wyslouch: Overview of CMS-HI Physics Plans Charles Maguire: Detailed View of CMS-HI Computing Markus Klute: T0 Operations in Support of CMS-HI Lothar Bauerdick: FNAL T1 Operations for CMS-HI Raphael Granier de Cassagnac: Contribution of non-US T2s Edward Wenger: HI Software Status Within CMS Model Alan Tackett: The Role of ACCRE in CMS-HI Computing Esfandiar Zafar: ITS Support of ACCRE in CMS-HI Alan Tackett: An Inspection Tour of the Proposed T2 Site June 2, 2010 DOE-NP On-Site Review at Vanderbilt
3
Summary of Updated Computing Proposal
CMS-HI Computing Plans Follow Very Closely the CMS Model Extensive use of world-wide CMS computing resources and working groups Continuous oversight by upper level CMS computing management A Complete Prompt Reconstruction Pass to be Done at the T0 Alignment, Calibration, and On-Line DQM at the T0 Archival Storage at the T0 and Standard Transfer to FNAL T1 Site Secondary Archival Storage at the FNAL T1 Transfer of Files to Disk Storage at a New Vanderbilt T2 Site Vanderbilt Site Will: Perform analysis passes on the prompt reco data set Distribute reco and analyzed data sets to MIT and to four non-US T2 HI sites Complete multiple reconstruction and analysis re-passes on the data sets Contain a T3 component which will host US CMS-HI participants (MIT already does) An Enhanced Role of the MIT HI Site for Simulation and Analyses Non-US T2 Sites Contribute a Significant Fraction of Analysis Base June 2, 2010 DOE-NP On-Site Review at Vanderbilt
4
What is New and Different This Year
Sustained Use of the T0 for Prompt Reconstruction Confirmed by new, rigorous simulations monitoring time and memory use Each annual HI data set can be promptly reconstructed in time at the T0: just a few days for the 2010 HI min bias data; 18 – 25 days for later HLT data Standard Transfer of HI Files from the T0 to the FNAL T1 Supervised by the T0 data operations group; simply one more month of duty Further details are contained in Markus Klute’s presentation Strongly recommended by CMS computing management at Bologna workshop No adverse impact is seen for the pp program More description is found in Lothar Bauerdick’s presentation CPU Requirements for CMS-HI Calibrated in HS06 Units Standard measure for the LHC experiments used throughout CMS CMS-HI software all runs in latest, official CMSSW framework Large investment of effort in determining processing times, memory use, and file sizes Careful Inventory of Potential non-US T2 HI Contributions Full accounting is tabulated in Raphael Granier de Cassagnac’s presentation June 2, 2010 DOE-NP On-Site Review at Vanderbilt
5
Overview of HI Computing: DAQ and T0
Anticipated Start of HI Running Change to Pb + Pb acceleration expected on November 1 Two weeks of set-up time Physics data taking expected to begin in mid-November Pb + Pb running for 4 weeks Special DAQ Considerations (see Markus Klute’s talk) Non-zero suppression of SST, ECal and HCal, leads to ~12 MBytes/event Zero suppression is done off-line at the T0, leads to 3 MB per raw event Work flows and CPU requirements for this task to be completed in summer 2010 Data Processing AlCa and DQM to be performed on-line as in the pp running DQM for HI being advanced by Julia Velkovska (at Vanderbilt) and Pelin Kurt (at CERN) “Live” demonstration during pp running scheduled at the post-review ROC tour Prompt reco of the zero-suppressed files at the T0, < 7 days total See the next set of slides for how the CPU requirements were determined Gives 1.9 MByte/event, ~400 TB of raw data and prompt reco files transferred to FNAL T1 Processing after FNAL should be at Vanderbilt (but see back-up “Plan B” later) June 2, 2010 DOE-NP On-Site Review at Vanderbilt
6
Determination of the CMS-HI Computing Requirements
In a Processor Independent Fashion June 2, 2010 DOE-NP On-Site Review at Vanderbilt
7
Method of Determining CPU Requirements
All CMS-HI Software is in Standard CMSSW Release Schedule Major accomplishment since the last review (see Edward Wenger’s talk) Ensures that the validation tests are done and the timing results are correct Simulation Data Testing (done by MIT and Vanderbilt graduate students) Ran with CMS simulation configuration files modified especially for HI use Looked at minimum bias events for modeling 2010 HI run Looked at central events (< 10% centrality) for modeling future HLT runs Each stage of the simulation process was specifically checked GEANT4 tracking stage in the CMS detector (by far the most time consuming step) Reconstruction step, using simulated raw data files from the first step CPU times, memory consumptions, and file output sizes were all recorded Tests were done on processors with already known HS06 rating Estimate of CPU Requirements Was Made in a Data-Driven Fashion Scaled according to projected kind and annual number of events (see next slide) Provision was made for a standard number of reco and analysis re-passes annually Experienced based assumptions were made for analysis and simulation demands June 2, 2010 DOE-NP On-Site Review at Vanderbilt
8
Annual Raw Data Volume Projections
Table 4 (page 20): Projected luminosity and up-time profile for CMS-HI runs Year Integ. Luminosity Events Taken (106) Raw Data (TByte) 2010 10 mb-1 40 – 80 (Min Bias) ~250 2011 20 mb-1 50 (HLT/mid-central) 150 2013 0.5 nb-1 75 (HLT/central) 300 2014 Notes: 1) First year figures are relatively uncertain; CMS requests we use the maximum events value to ensure adequate disk and tape storage space is available 2) HI collision event size characteristics are completely unexplored at this energy 3) LHC down year is in 2012 June 2, 2010 DOE-NP On-Site Review at Vanderbilt
9
HI Data Reconstruction Estimates
Table 5 (page 21): Projected raw data reconstruction computing effort Year Trigger Events (106) One reco pass (1010 HS06-sec) T0 Reco (days) Re-passes Time at VU (Days/Re-pass) 2010 Min Bias 80 1.7 4 3 72 2011 HLT 50 8.5 18 1.3 133 2012 No Beam Above None 2.7 65 2013 75 12.8 25 2 79 2014 Notes: 1) Column six assumes the four year annual growth of HS06 power at VU to be , 8588, 17708, 23028; to be divided into “rcrs” and “rcas” fractions (“RCF”) 2) Column six also assumes VU reco fractions are 0.65, 0.55, 0.55, 0.45, and 0.45 June 2, 2010 DOE-NP On-Site Review at Vanderbilt
10
HI Simulation Estimates
Table 7 (page 23): Projected Simulation Computing Load Year Event Type Number of simulated events (106) HS06-Sec/Event (104) Total Compute Load (1010 HS06-sec) 2010 Min Bias 5.6 1.41 7.9 2011 Central 1.0 5.83 5.8 2012 2013 1.5 8.7 2014 Notes: Simulations assume rough proportionality in number of simulated and collected events The number of simulated events is much smaller than collected events. This is roughly consistent with the numbers typical at RHIC. We plan to re-use events and do embedding June 2, 2010 DOE-NP On-Site Review at Vanderbilt
11
HI Data Analysis Estimates
Table 8 (page 24): Integrated T2 Computing Load Compared to Available Resources Year Analysis + Simulation Need (1011 HS06-sec) Vanderbilt T2 Total T2 Base Ratio: Available/Need 2010 1.47 0.29 1.52 104% 2011 2.54 0.98 2.45 92% 2012 3.73* 2.01 3.73 95%* 2013 4.71 3.20 5.15 111% 2014 126% 1) Column three VU T2 values are computed using HS06 growth model in slide 9 2) Column four total T2 base assumes an MIT HS06 growth model of (already in place, see slide 31), 2423, 3135, 5035, and 7885 3) Column four total T2 base also assumes 3000 HS06 from non-US T2 HI sites. The current estimate is 2000 – 5000 HS06 will be available overseas. *SEE BACK-UP SLIDES FOR CORRECTIONS OF 3.7 ANALYSIS NUMBER RE-PASS ERROR June 2, 2010 DOE-NP On-Site Review at Vanderbilt
12
Determination of the Storage Volume Requirements
June 2, 2010 DOE-NP On-Site Review at Vanderbilt
13
Disk Storage Considerations
General Experience Not having enough disk capacity for analysis is an often-heard complaint It is important that T3 users have sufficient disk space at their local sites This amount, ~TBs, of local disk space is relatively inexpensive to support Used in the final stage of data analysis; e.g. re-scans over last set of NTUPLEs Constraints and Policies for CMS-HI Computing Center at ACCRE Must store both the raw data files and the prompt reco output from the T0, this leads to hundreds of TB needed immediately (November 2010) We do not plan for frequent re-reads of the data files from the FNAL T1 archive This would be complex to arrange, and it would interfere with the FNAL pp mission We must also provide space for analysis output and reco re-pass output Conservative strategy is to overwrite some fraction of the prompt reco disk space as the reco re-pass cycle takes place; some of this output will be shipped to other T2s DOE-NP On-Site Review at Vanderbilt
14
Annual Event Size Projections
Table 9 (page 25): Event sizes for the HI data processing stages Year Event Type Events (106) Reco Event (MBytes) AOD Event PAT Event 2010 Min Bias 80 1.93 0.33 0.10 2011 HLT 50 7.72 1.54 0.50 2012 HLT (from above) 2013 75 2014 Notes: (repeat from slide 8) 1) First year figures are relatively uncertain 2) HI collision event size characteristics are completely unexplored at this energy 3) LHC down year is in 2012 June 2, 2010 DOE-NP On-Site Review at Vanderbilt
15
Sample Calculation for First Year Volume
Input Files Which Must Be Stored on Disk Immediately at ACCRE Minimum bias raw data files from the T0, by way of FNAL Upper estimate of 250 TB from 80M events at ~3 MB/event As usual, the first year estimates have a large uncertainty at all data stages CMS rule: plan for the worst case, for obvious reasons Prompt reco files from the T0, by way of FNAL Present size estimate is 150 TB (80 M events at 1.9 MB/event) These files will be the input source for all user analyses A large fraction of the reco files will be exported to MIT and non-US T2 sites Immediate store minimum = 400 TB (also goes into tape store calculations) Output Files Which Will Be Produced on Disk at ACCRE AOD and user output files is estimated at 35 TB Additional space for one cycle of reco re-pass is estimated at 50 TB Minimum space need in 2010 – 2011 at ACCRE is = 485 TB Actual Future Year CPU vs Disk Acquisitions Will be Dynamically Done Experience at RHIC is to adjust next year’s ratio based on current year utilizations along with projections of next year’s data volumes (also true for tape purchases) June 2, 2010 DOE-NP On-Site Review at Vanderbilt
16
Annual Total Data Volume Projections: Disk and Tape Storage
Table 10 (page 25): Disk and Tape Storage for Vanderbilt and FNAL Year Event Type Events (106) Vanderbilt Disk (PBytes) FNAL Tape 2010 Min Bias 80 0.49 0.56 2011 HLT 50 0.77 0.96 2012 HLT (from above) 1.07 0.45 2013 75 1.20 1.44 2014 These are very lean estimates, and assume that non-US T2s will be contributing a minimum of 200 TB to CMS-HI analysis work. It is also assumed (see later T3 slides) that all the US institutions in CMS-HI will have reasonable (~10 TB) local disk space storage which can receive analysis output from the various T2s in CMS-HI June 2, 2010 DOE-NP On-Site Review at Vanderbilt
17
Vanderbilt Local Infrastructure: ACCRE and ITS
June 2, 2010 DOE-NP On-Site Review at Vanderbilt
18
ACCRE and ITS Infrastructure for CMS-HI
ACCRE (Advanced Computing Center for Research and Education) Consortium of faculty researchers from many different Vanderbilt divisions Faculty Advisory Board makes recommendations for general ACCRE policies The Associate Provost for Research at Vanderbilt has direct authority for ACCRE ACCRE established in 2003 as an $8.5 M venture grant from the University 120 dual-core Linux nodes; successor to Physics Department “VAMPIRE” cluster Now contains over 2000 processors, hosting several major research projects Transiting to regular University operating budget as of 2011, no longer “venture” ACCRE will buy, install, and maintain the CMS-HI computing hardware VU group is not tasked to fix broken hardware or batch operating software VU groups (RHI and HEP) do support the CMSSW specific software at ACCRE Further details on the role of ACCRE in research support are in Alan Tackett’s talk ITS (Information Technology Services) Supports all other computing services for the University not contained in ACCRE Provides rack space, power, and air-conditioning services for the ACCRE hardware Manages the external network links (10 Gbps to ACCRE included) for Vanderbilt Internal network links using PhEDEx CMS file transfer software supported by ACCRE/HEP Further details on the role of ITS at Vanderbilt are in E. Zafar’s talk this afternoon June 2, 2010 DOE-NP On-Site Review at Vanderbilt
19
DOE-NP On-Site Review at Vanderbilt
Example of Network Commissioning Test: 40 TB Ingested in 2 Days from 4 CMS Sites Have also successfully done commissioning tests with MIT and Grif T2 (France) FNAL T1 Florida T2 Nebraska T2 Moscow T2 June 2, 2010 DOE-NP On-Site Review at Vanderbilt
20
DOE-NP On-Site Review at Vanderbilt
Example Site Monitoring Test by CMS Daily Service Availability Monitoring (SAM) Check Checks that all the critical CMS software is working at a given T2 or T3 site June 2, 2010 DOE-NP On-Site Review at Vanderbilt
21
MIT Heavy Ion Analysis Center
June 2, 2010 DOE-NP On-Site Review at Vanderbilt
22
CMS HI at MIT Bates Facility
Computing for HI at MIT Since about 2004 MIT has been building computing resources for the CMS HI program. DOE and MIT university funding used to establish the simulation and analysis facility Most of the studies for Physics Technical Design Report were done at MIT MIT presently provides most of computing resources for the CMS HI group worldwide HI simulation and analysis center and CMS Tier-2 at MIT MIT HI and HEP CMS groups proposed and are maintaining Tier-2 facility funded mostly by HEP MIT Tier-2 is tightly integrated with the computers for CMS HI program and with computers from other MIT/LNS groups (CDF, Neutrinos, Olympus, STAR…) The center infrastructure is identical for all the groups with full access to all CMS and grid tools. As of today HI fraction of the center is about 20% of CPU and 30% of disk space It is relatively easy to add new machines to the center. The recent addition of about 2000 HS06 and 200 TB of disk space took 2-3 weeks. There is access to peak capacity of over HS06 today New infrastructure MIT administration built a new computer center at MIT Bates Laboratory to host scientific computation farms Up to 70 racks with 700 kW of cooling power, CMS is planning to use about 30 racks with up to 6 reserved for heavy ions; access to 10 Gbps external network System administration and hardware support paid via “rack charges”, pays for 1.5 FTE at the lab MIT HI group members contribute crucially to the running of the center June 2, 2010 DOE-NP On-Site Review at Vanderbilt
23
HI Computing Operations
June 2, 2010 DOE-NP On-Site Review at Vanderbilt
24
Overview of Heavy Ion Computing Operations
Guiding Philosophy Wherever possible, HI workflows will be identical to those of pp workflows When changes are necessary, these will be as minimal as possible In this manner, the much larger pp group in CMS can effectively support CMS-HI Examples On-line DQM operations during the HI run will use the same software as for pp Some irrelevant histograms (e.g. MET) will be ignored for the HI data Two new histograms in use: centrality determination and reaction plane Existing DQM software has been put to work for HI simulated events (Pelin Kurt) Learning curve for doing DQM already in progress at Vanderbilt (ROC) T0 operations will be carried out by both pp and HI people in both pp and HI runs Similar to PHENIX case when there is a switch between Au and proton running HI specific software is fully integrated into CMSSW development cycles HI simulation production is now carried out by the central CMS data operations group HI file transfers from the T0 to the FNAL T1 will be managed by the data ops group HI work flows for doing reconstruction re-passes will be based on FNAL T1 software VU HI people will be at FNAL in the summer 2010 to learn T1 operations June 2, 2010 DOE-NP On-Site Review at Vanderbilt
25
Overview of Heavy Ion Computing Operations
Current Analysis Stage for CMS-HI HI analysis is organized into five Physics Interest Groups (PInGs) These interests groups are currently developing analysis software using either 1) central simulation production files, or 2) smaller private generator files Analysis group members are largely using the services of the MIT HI analysis center, or other analysis centers in CMS (FNAL, VU, ... ) Analysis Policies and Procedures in CMS for T3 Users CMS computing management strongly recommends that the HI analysis system be the same as that for the pp data as far as using large volume production files A separate analysis system for the HI group cannot be supported with CMS people For large production work, the CMS system is to use the CRAB framework CMS Remote Analysis Builder, a remote job submission tool with central support Makes optimum, transparent use of the Grid system and the CMS file database CRAB jobs are submitted from any T3-enabled site in the CMS grid, to work at T2s Output can be returned to the user’s local T3 site, or staged to a T2 area where the user has disk space MIT and Vanderbilt host T3 sites available to all CMS-HI users All CMS-HI sites in the US should each have ~10 TBytes, to receive files from T2s Much more efficient for graphical analysis to have event displays on a local cluster June 2, 2010 DOE-NP On-Site Review at Vanderbilt
26
HI View of Personnel Responsibilities
1) It is important that all the CMS experts in each system shown on the left (Detector, Readout, ...) realize that their expertise will be critically needed during the HI run. For some systems, these experts will be almost completely in charge, while in others (RECO, SIM, ...) the HI persons will be taking the lead in giving directions. 2) In April we experienced the loss of a key HI person with knowledge of the T0 operations. Data operations experts in CMS have stepped in to help us cover this loss. These CMS experts will guide HI students and post-docs this summer in completing needed workflows for HI data operations at the T0. June 2, 2010 DOE-NP On-Site Review at Vanderbilt
27
HI Computing Organization
June 2, 2010 DOE-NP On-Site Review at Vanderbilt
28
DOE-NP On-Site Review at Vanderbilt
Proposal Budget June 2, 2010 DOE-NP On-Site Review at Vanderbilt
29
Development of Vanderbilt T2 Center From Table 1 (page 14)
Category 2010 2011 2012 2013 2014 Total New CPU (HS06) 3268 5320 9120 23028 Total CPU 8588 17708 New Disk (TBytes) 485 280 300 135 1200 Total Disk 765 1065 Hardware Cost $258,850 $252,000 $309,000 $127,255 $947,105 Staffing Cost (To DOE) $180,476 $188,285 $195,816 $202,649 $211,795 $980,021 Total Cost $439,326 $440,285 $504,816 $330,904 $1,927,126 (To Vanderbilt) Note: Hardware and staffing cost assumptions are given on the following slide June 2, 2010 DOE-NP On-Site Review at Vanderbilt
30
Development of Vanderbilt T2 Center From Table 1 (page 14)
Category 2010 2011 2012 2013 2014 Total Hardware Cost $258,850 $252,000 $309,000 $127,255 $947,105 Staffing Cost (To DOE) $180,476 $188,285 $195,816 $202,649 $211,795 $980,021 Total Cost $439,326 $440,285 $504,816 $330,904 $1,927,126 (To Vanderbilt) Cost/3 GB core (9.5 HS06 core) $400* $350 $275 $200 $125 NA Cost per TByte $250 $150 $113 $75 Total FTE 3 15 Cost per FTE $120,317 $125,523 $130,544 $135,766 $141,197 This is the full loaded cost picture at Vanderbilt *SEE BACK-UP SLIDE 48 FOR AN IN DEPTH COST COMPARISON ANALYSIS June 2, 2010 DOE-NP On-Site Review at Vanderbilt
31
Enhancement of MIT HI Analysis Center From Table 3 (page 15)
Category 2010 2011 2012 2013 2014 New CPU (HS06) 1900 500 700 2800 Total CPU 2400 3100 5000 7800 New Disk (TBytes) 200 13 150 Total Disk 213 225 375 525 Hardware Cost $135,000 $20,000 $77 ,000* $65,000 Staffing Cost (To DOE) $10,000 $25,000 $30,000 $35,000 $40,000 Total Cost $145,000 $45,000 $50,000 $112,000 $105,000 Cumulative $190,000 $240,000 $352,000 $457,000 Notes: 1) $135,000 in spent 2010 included network card purchase. 2) Personnel costs are increasing with the number of computers and occupied racks. Only ½ in 2010 due to Bates setup. Members of HI group contribute support to the running of the computers June 2, 2010 DOE-NP On-Site Review at Vanderbilt
32
Cost of the FNAL T1 Tape Archive From Table 2 (page 15)
Category 2010 2011 2012 2013 2014 Total Tape Volume (PB) 0.6 1.0 0.5 1.4 4.9 Cost to DOE $94,000 $103,000 $40,000 $116,000 $120,000 $473,000 Notes 1) The incremental costs to hosting the HI data at FNAL are estimated to be $110/tape slot, including overhead 2) Current technology at FNAL uses LT04 tape drives at 800 GB/slot (90% fill) 3) From 2012 onward FNAL will use LT05 technology at double the capacity 4) A dedicated tape drive will be needed for CMS-HI, at a cost of $25,000, which will be charged in the second year when LT05 becomes available June 2, 2010 DOE-NP On-Site Review at Vanderbilt
33
DOE-NP On-Site Review at Vanderbilt
Project Management June 2, 2010 DOE-NP On-Site Review at Vanderbilt
34
DOE-NP On-Site Review at Vanderbilt
Project Oversight Choice of Project Center Choices to the DOE The CMS computing project can be funded as a center out of the MIT CMS project. The administrative cost for doing this have not been included yet The charge would be 68% for the first $25K = $17K A subcontract would be established between Vanderbilt and MIT The Vanderbilt accounting office would monitor the subcontract expenses The Vanderbilt T2 installation can be funded directly as a separate contract There would not be any new administrative cost for doing this. Accounting reports would be filed by the Physics Department financial office working with the Grants Accounting Office of the University The financial officer at ACCRE will file purchase reports to the Physics Office which would be forwarded to Vanderbilt grants accounting The FNAL tape archive component would be set up as a separate subcontract with either MIT or Vanderbilt June 2, 2010 DOE-NP On-Site Review at Vanderbilt
35
Summary of Responses to First Review Comments
Based on the Computing Plan Update Already Presented June 2, 2010 DOE-NP On-Site Review at Vanderbilt
36
Comments on Resource Requirements
Justification of the formal performance requirements of the proposed CMS HI computing center(s) that integrates the CERN Tier-0 obligations, the service level requirements including end users, and the resources of non-DOE participants and other countries. Response: The present CMS-HI computing plan adheres to the general CMS computing model much more closely than previously. There are significant uses of the CERN T0 and the FNAL T1, as well as vital contributions from non-US T2 centers. The personnel involvement of the rest of CMS computing is also strong. The analysis and storage model articulated in the CMS HI Collaboration proposal differs from the one described in the CMS Computing Technical Design Report (TDR1). In the TDR, heavy ion data would be partially processed at the CERN Tier-0 center during the heavy ion run Response: The heavy ion data will in fact be promptly and completely processed at the T0 each year, just as is happening for the pp data. Further processing of the HI data, or the pp data, will not be done at the T0 which lacks the infrastructure and personnel support to do such T1 and T2 tasks. June 2, 2010 DOE-NP On-Site Review at Vanderbilt
37
Comments on Resource Requirements
The US CMS HI computing resources should be driven by technical specifications that are independent of specific hardware choices. Well-established performance metrics should be used to quantify the needs in a way that can be mapped to any processor technology the market may be offering over the course of the next few years. Expected improvements of the CPU processor cost/performance ratio suggest that the requested budget for CPU hardware could be high. Response: The CPU power requirements have been completely reformulated into HS06 units. The requirements are also data-driven, according to the best estimates of data volumes in each year of the proposal. The timing and memory estimates were performed with the most up-to-date versions of the CMS-HI software, which has been full integrated into the normal release cycles of the CMSSW framework. The cost/performance quotes which we have monitored since the first proposal was originally submitted are remaining consistent with our less aggressive assumptions of expected improvements in CPU processor cost/performance ratios. On the other hand, the change in the LHC schedule to move data taking into 2011 from the previous schedule of 2012, has inexorably pushed our cost estimates higher. It is understandable that short-term fluctuations, such as the LHC schedule or the recent economic recession and recovery, will have an important effect on the budgets. June 2, 2010 DOE-NP On-Site Review at Vanderbilt
38
Comment on Quality of Service Burden
The relationship between DOE NP supported grid resources (VU and MIT) for heavy ion research and the grid resources available to the larger CMS HEP collaboration needs to be clarified. Also the formal arrangements with respect to NP pledged resources to the CERN Worldwide LHC Computing Grid (WLCG) need to be defined. Response: The CMS computing management, regards the HI data taking as a simple one-month continuation of the pp running in terms supporting the file transfers from the T0. The related MoUs, stating expected performance for HI data by Vanderbilt and MIT, will be submitted to the WLCG, upon request by the national CMS leader in the US (currently Joel Butler), as per CMS policy. In turn these MoUs, as well as the MoU between ACCRE and CMS-HI, will be written when it becomes clear what are the hardware capabilities of these HI centers. June 2, 2010 DOE-NP On-Site Review at Vanderbilt
39
Comment on Quality of Service Burden
The Tier-1 quality-of-service (QoS) requirements were specifically developed for the HEP LHC experiments, but they might be relaxed for he CMS HI effort in view of its spooled approach to the grid. CMS HI and the WLCG are encouraged to examine the possibility of tailoring the Tier-1 requirements to reflect the needs of the U.S. heavy ion research community. Response: This concern has been alleviated by the recommendation of CMS computing management that the FNAL T1 be used for immediate receipt and secondary tape archival storage of the HI files from the T0. In retrospect, this recommendation makes perfect sense since it would be grossly inefficient to duplicate a T1 network and tape archive capability which would be needed only one month of the year. Moreover, the various workflows to handle file transfers out of the T0 would not have to be substantially modified to accommodate a different T1 site. June 2, 2010 DOE-NP On-Site Review at Vanderbilt
40
Comments on HI Computing Operations
The management, interaction and coordination model between the Tier-2 center(s) and Tier-3 clients is not well formulated. It will be important to document that user institutions will have sufficient resources to access the VU computing center and how the use of the facility by multiple U.S. and international clients will be managed. Response: All CMS-HI users will be given accounts on the ACCRE gateway computers which already function as a T3 in CMS. Just as they can do already at the MIT HI analysis center, these users will be able to access production output from the Vanderbilt T2 using standard CMS job submission and database tools. No separate HI analysis system will be needed. A detailed plan for external oversight of the responsiveness and quality of operation of the computing center(s) should be developed. Response: Vanderbilt will become another CMS T2. As a T2 it will be subject to the same automatic monitoring as other T2s. Availability, reliability, and production output will be tested by standard tools (see slide 19). Based on that information, the performance will be reviewed by the CMS computing management and reported to CERN and the funding agencies. In addition the US-CMS-HI project manager will directly review the performance statistics. June 2, 2010 DOE-NP On-Site Review at Vanderbilt
41
Comments on HI Computing Operations
Draft Memoranda of Understanding (MoUs) should be prepared between the appropriate parties, separately for the VU and the Bates Computing Facilities, that clearly define management, operations, and service level responsibilities. Response: Examples of the existing MoUs between the US T2s and the FNAL T1 are being studied for relevance to the HI program. The DOE-NP has also been queried as to what conditions it expects to see in such MoUs. As we stated earlier, we would not expect to have other parties sign MoUs with us detailing their commitments until it becomes clear what our own capabilities are. The size of the workforce associated with data transport, production, production re-passes, calibration and Monte Carlo simulation efforts, and the general challenges of running in a grid environment should be carefully examined and documented. Response: In the model where the HI data and simulation production are a straightforward extension of the pp production, then the concerns about the small size of the HI group are less troubling. The same CMS group of pp and HI people (including VU graduate students and post-docs) will handle all T0 operations. The receipt HI files from the T0 will be largely supported by the FNAL T1 group. Simulation production for HI use is now completely done, on request, by the central data operations group in CMS. June 2, 2010 DOE-NP On-Site Review at Vanderbilt
42
Comment on ACCRE Infrastructure
ACCRE should articulate its plans and facility investments needed to support the development of a custom infrastructure suited to the needs of the US CMS HI collaboration and conversely, to what extent the US CMS HI collaboration will need to adapt to the existing ACCRE infrastructure (e.g. the use of L-Store at Vanderbilt, when other CMS computing centers use dCache). Response: There are now at least three major storage systems (dCache, Hadoop, and Lustre) used at CMS T2 centers in the US. For example, Cal Tech is using Hadoop while MIT is using dCache. This indicates that the dCache storage system is not an intrinsic part of the CMS computing environment. During the US-T2 workshop held at FNAL this past March, several future alternatives to the use of dCache were under discussion. The CMS production system is transparent to the choice of the underlying storage system. June 2, 2010 DOE-NP On-Site Review at Vanderbilt
43
DOE-NP On-Site Review at Vanderbilt
Back-up Plans for 2010 Calendar Constraints for Developing the Vanderbilt T2 ACCRE staff has worked out an installation schedule (Alan Tackett’s slides) If a full bidding process is required, it could take as long as 15 weeks from the date contract authorization in order to install all the first year’s hardware It could be as short as 10 weeks, or as short as 6 weeks if no bidding needed Assume a contract authorization date of August 1, 2010 The hardware would be ready for testing (e.g. file transfers from FNAL to VU) between September 15 and November 15 Full testing could take 2 months Vanderbilt T2 site could be ready on November 15, or as late as January 15 Back-up Plan (November 2010 to February 2011) No files are shipped to Vanderbilt, but a subset of the prompt reco files are shipped to the MIT HI center for analysis, and to other non-US T2 centers MIT could advance hardware from 2011 into 2010; cost is losing Moore’s law benefit A minimal set of physics analyses are done for the QM 2011 (May) conference Some raw data files could be re-reconstructed at FNAL This would be a policy decision; there is no technical issue preventing it Newly reconstructed files can be re-analyzed at MIT or other T2 centers Vanderbilt would start to receive files from FNAL in February 2011 June 2, 2010 DOE-NP On-Site Review at Vanderbilt
44
DOE-NP On-Site Review at Vanderbilt
Summary The updated CMS-HI computing plan has been greatly revised Far more integration with the rest of the CMS computing resources is achieved The plan follows very closely the model of pp data processing Raw and prompt reco file transfers from the T0 are far more robust than before The exposure to new time-critical file transfer mandates is much reduced The nature of the proposed Vanderbilt CMS-HI center has changed It is no longer a full-fledged T1 site with those critical attendant responsibilities It will function as an enhanced T2 center, approved by CMS management Vanderbilt will receive raw data files and prompt reco files from the T0 via FNAL Vanderbilt serve as a normal T2 center for the analysis of these reco files There will be a normal cycle of reconstruction re-passes on the raw data files Reconstruction re-pass output files will be served to other CMS-HI T2 sites Significantly new amounts of non-US T2 resources have been pledged Network links to these resource have been tested satisfactorily from Vanderbilt These resources constitute an important component of the analysis base for CMS-HI The already partially funded MIT HI analysis center is fulfilling well its role Large simulation production at MIT being launched by the central CERN data ops team The MIT HI T3 center presents an excellent model for doing analysis in CMS-HI June 2, 2010 DOE-NP On-Site Review at Vanderbilt
45
DOE-NP On-Site Review at Vanderbilt
Post-Review Slides POST-REVIEW BACKUP SLIDES June 2, 2010 DOE-NP On-Site Review at Vanderbilt
46
HI Data Analysis Estimates
Table 8 (page 24): Integrated T2 Computing Load Compared to Available Resources Year Analysis + Simulation Need (1011 HS06-sec) Vanderbilt T2 Total T2 Base Ratio: Available/Need 2010 1.47 0.29 1.52 104% 2011 2.54 0.98 2.45 92% 2012 2.88* 2.01 3.73 1.24* 2013 4.71 3.20 5.15 111% 2014 126% 1) Column three VU T2 values are computed using HS06 growth model in slide 9 2) Column four total T2 base assumes an MIT HS06 growth model of (already in place, see slide 31), 2423, 3135, 5035, and 7885 3) Column four total T2 base also assumes 3000 HS06 from non-US T2 HI sites. The current estimate is 2000 – 5000 HS06 will be available overseas. *THESE VALUES CORRECTLY USE 2.7 ANALYSIS RE-PASSES IN 2012 June 2, 2010 DOE-NP On-Site Review at Vanderbilt
47
HI Data Analysis Estimates: “Systematic Error” Example
Table 8 (page 24): Integrated T2 Computing Load Compared to Available Resources Year Analysis + Simulation Need (1011 HS06-sec) Vanderbilt T2 Total T2 Base Ratio: Available/Need 2010 1.47 0.29 1.52 104% 2011 2.54 0.98 2.45 92% 2012 3.13* 1.79* 3.33* 1.06* 2013 4.71 3.20 5.15 111% 2014 126% 1) Column three VU T2 values are computed using HS06 growth model in slide 9, except the VU site has 0.60 instead of 0.55 reco fraction to keep to the same reco days 2) Column four total T2 base assumes an MIT HS06 growth model of (already in place, see slide 31), 2423, 3135, 5035, and 7885 3) Column four total T2 base also assumes 3000 HS06 from non-US T2 HI sites. The current estimate is 2000 – 5000 HS06 will be available overseas. *THESE VALUES ASSUME 3.0 RECO AND ANALYSIS RE-PASSES IN 2012 June 2, 2010 DOE-NP On-Site Review at Vanderbilt
48
Cost-Comparison Analysis for 2010
Proposed Nehalem System ( Base price for 8-core Nehalem E5520 with 24 GB = $2800 Networking and miscellaneous costs = $400 Total cost = $3200, or $400/core Suggested Alternative System ( Base price for 8-core Nehalem with 24 GB = $2800 Upgrade to dual Westmere L5640 with 12 cores = $1246 Add extra 12 GB to maintain 3 GB/core requirement = $450 Total cost = $4896, or $408/core Comparison of 1K Quantity Pricing Nehalem E GHz, 8M cache, 4-core = $373 Westmere E5620, 2.4GHz, 12MB cache, 4-core = $387 Westmere L5640, 2.26Ghz,12MB cache, 6-core = $996 E5520->L5640 upgrade = $623 Conclusion: No cost advantage for the Westmere choice June 2, 2010 DOE-NP On-Site Review at Vanderbilt
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.