DOE-NSF Comprehensive Review of US-LHC Computing January 19, 2007 UT Arlington CLOSEOUT SESSION.

Slides:



Advertisements
Similar presentations
DS-01 Disaster Risk Reduction and Early Warning Definition
Advertisements

State of Indiana Business One Stop (BOS) Program Roadmap Updated June 6, 2013 RFI ATTACHMENT D.
DECam Community Pipeline Review Closeout Presentation DES Council of Directors’ Review August 30-31, 2010 NCSA, Urbana IL.
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
Ray C. Rist The World Bank Washington, D.C.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Action Implementation and Monitoring A risk in PHN practice is that so much attention can be devoted to development of objectives and planning to address.
Pertemuan Matakuliah: A0214/Audit Sistem Informasi Tahun: 2007.
Implementation. We we came from… Planning Analysis Design Implementation Identify Problem/Value. Feasibility Analysis. Project Management. Understand.
Student Assessment Inventory for School Districts Inventory Planning Training.
WRITING THE ClASS REPORT
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
Capability Maturity Model
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Release & Deployment ITIL Version 3
Assessment of Core Services provided to USLHC by OSG.
F Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005.
Acquisitions, a Publisher’s Perspective Craig Duncan Development Manager External Development Studio Building the partnership between.
October 24, 2000Milestones, Funding of USCMS S&C Matthias Kasemann1 US CMS Software and Computing Milestones and Funding Profiles Matthias Kasemann Fermilab.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
Striving for Quality Using continuous improvement strategies to increase program quality, implementation fidelity and durability Steve Goodman Director.
Chapter 2 The process Process, Methods, and Tools
N By: Md Rezaul Huda Reza n
Developing a result-oriented Operational Plan Training
BUSINESS PLUG-IN B15 Project Management.
Service Transition & Planning Service Validation & Testing
Object-oriented Analysis and Design Stages in a Software Project Requirements Writing Analysis Design Implementation System Integration and Testing Maintenance.
1 Designing Effective Programs: –Introduction to Program Design Steps –Organizational Strategic Planning –Approaches and Models –Evaluation, scheduling,
Integrated Risk Management Charles Yoe, PhD Institute for Water Resources 2009.
AP-1 5. Project Management. AP-2 Software Failure Software fails at a significant rate What is failure? Not delivering it on time is an estimation failure.
Construction, Testing, Documentation, and Installation Chapters 15 and 16 Info 361: Systems Analysis and Design.
Atlas CAP Closeout Thanks to all the presenters for excellent and frank presentations Thanks to all the presenters for excellent and frank presentations.
24-Aug-11 ILCSC -Mumbai Global Design Effort 1 ILC: Future after 2012 preserving GDE assets post-TDR pre-construction program.
MEDIN Work Plan for By March 2011 MEDIN will be 3 years into the original 5 year development plan started in Would normally ask for continued.
Notes by Ben Boerkoel, Kent ISD, based on a training by Beth Steenwyk.
BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
From the Transatlantic Networking Workshop to the DAM Jamboree to the LHCOPN Meeting (Geneva-Amsterdam-Barcelona) David Foster CERN-IT.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Consultant Advance Research Team. Outline UNDERSTANDING M&E DATA NEEDS PEOPLE, PARTNERSHIP AND PLANNING 1.Organizational structures with HIV M&E functions.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.
Software Engineering (CSI 321) Software Process: A Generic View 1.
GEO Implementation Boards Considerations and Lessons Learned (Document 8) Max Craglia (EC) Co-chair of the Infrastructure Implementation Board (IIB) On.
Run II Review Closeout 15 Sept., 2004 FNAL. Thanks! …all the hard work from the reviewees –And all the speakers …hospitality of our hosts Good progress.
Continual Service Improvement Methods & Techniques.
PCAP Close Out Feb 2, 2004 BNL. Overall  Good progress in all areas  Good accomplishments in DC-2 (and CTB) –Late, but good.
CS223: Software Engineering Lecture 18: The XP. Recap Introduction to Agile Methodology Customer centric approach Issues of Agile methodology Where to.
DPS/ CMS RRB-T Core Software for CMS David Stickland for CMS Oct 01, RRB l The Core-Software and Computing was not part of the detector MoU l.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
From the Transatlantic Networking Workshop to the DAM Jamboree David Foster CERN-IT.
Company LOGO. Company LOGO PE, PMP, PgMP, PME, MCT, PRINCE2 Practitioner.
CHANGE READINESS ASSESSMENT Measuring stakeholder engagement and attitude to change.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
University Research Model Committee - Key points/issues - Other points/issues - New ideas - “University Model” issues in the report - Findings and Recommendations.
1 ALICE Summary LHCC Computing Manpower Review September 3, 2003.
Bob Jones EGEE Technical Director
EGEE Middleware Activities Overview
Managing the Project Lifecycle
Ian Bird GDB Meeting CERN 9 September 2003
Readiness of ATLAS Computing - A personal view
Software Engineering (CSI 321)
Overview of working draft v. 29 January 2018
GENERAL SERVICES DEPARTMENT Facilities Management Division PROOF –NM (Process Reengineering & Optimization of O&M Functions for New Mexico) Phase II.
Capability Maturity Model
Capability Maturity Model
Presentation transcript:

DOE-NSF Comprehensive Review of US-LHC Computing January 19, 2007 UT Arlington CLOSEOUT SESSION

ATLAS CLOSEOUT: MANAGEMENT

Recommendations from 2006  Observations:  The recommendations from last year generally have been addressed in commendable ways  We note the lack of reporting on an end-to-end cosmic ray test. Such tests are crucial to expose integration issues with the software and computing.  Recommendations  US Atlas, working with Atlas, is urged to focus on a full system test with well defined milestones.

Management, Findings  Findings and Observations:  The committee appreciates the responsiveness and candor of the US Atlas management team.  The management structure is effective and appropriate with problems identified and addressed. We observe that the US efforts are tightly integrated with International Atlas and the US is well represented in the management structures.  Committees are an excellent way to achieve consensus, which is necessary in an international collaboration, however, this can be a very slow process.  We commend US ATLAS S&C for their usage of the change control board.  There is an established plan for transition from development to operations. It is difficult to assess the planning for the transition and if it will address the problems that will occur.  The Computing Model was redefined for resource requirements. The assumptions for the model are still under debate. This has a budget impact in the out-years.

Management, Recommendations  US Atlas should work with Atlas committees to get timely delivery of reports such that risks can be identified and addressed expeditiously.  The year before data taking is an especially stressful time when problems must be addressed as they come up and we recommend retaining some flexibility to address such problems.  Due to the highly interdependent nature of integration tasks, it is crucial to have public and published milestones which reflect the needed system functionality required to support to experiment and data collection at each point in time.

Analysis Readiness: Observations  US ATLAS is taking all reasonable steps to educate the users and build consensus to insure that US collaborators are ready and able to do physics analysis.  The Jamborees are successful  The feedback from the collaborators is positive and enables fine-tuning and improving the process  The US analysis readiness has been internally reviewed by US Atlas  We commend them for addressing the comments from last year’s review.  Locating the complete copy of the ESD at the US Tier 1 is likely to make analysis easier at the risk of an increased single point of failure. We note that they have considered mitigations this risk, for example, by partnering with other Tier 1 sites and having software which is flexible.  There is a plan for adding new collaborators.  Incremental costs are not well understood, however there is a model  The US Atlas management takes responsibility for ensuring that new collaborators can contribute effectively to Atlas.

PanDA  Observations:  PanDA has proven to be effective and is gaining international acceptance.  Having a leadership role in order to have the ability to adapt to changing grid technologies and techniques is valuable.  There is an intent to spread the operational load with the full collaboration.  Recommendations  For PanDA, maintain the development effort in the US and pursue the plans to increase the international support for operations.

DDM  Observations  We note that the Distributed Data Management tool issues are a risk for Atlas success.  The ATLAS collaboration has given this issue a high priority and has recently reviewed the DDM project.  Recommendations  We recommend that US ATLAS should work with ATLAS to execute the Distributed Data Management plan with some urgency.  We encourage the US ATLAS management to ensure that a plan is in place to solve both short term needs and long term issues in DDM and to make sure sufficient resources are put in place to guarantee the success of this plan.  US ATLAS should work with ATLAS to develop concrete milestones that will guarantee a working DDM as soon as feasible and functional within This may require temporary solutions for some DDM system components for use in early data taking until the final solutions been proved."

10% Cut Scenario  Observations:  As outlined, a 10% funding cut would have drastic consequences  Even with full funding (including the proposed call on the management reserve), US Atlas S&C has limited flexibility to reassign resources in order to address the pressing issues which will arise in the early data collection period.  Any funding cut would remove that limited flexibility leading to extreme risk for meeting essential functionality.  Recommendations  Focusing on existing US Atlas responsibilities and avoiding expanding scope will make the best use of the resources.

ATLAS CLOSEOUT: FACILITIES, GRIDS, NETWORKING, AND INFRASTRUCTURE

Comments on last review recommendations:  Demonstrated progress on all recommendations. There remains a concern with dCache performance for chaotic data analysis.

Computing Models  Observations:  There is a plan to test data analysis at the Tier2 sites in advance of data taking.  Good progress in utilizing the facilities, not only at Tier1, but also at the Tier2; getting the Tier2 online for simulation production has been successful.  Full copy of ESD at Tier1 and full copy of AOD in each of the US Tier2 site is beneficial for data analysis capabilities of US groups. If ESD and AOD sizes remain much larger than foreseen in the computing plan this approach may not be possible.  Due to the fact that the BNL Tier1 is the only T1 with the full ESD it has been observed that it is an attractive repository for international ATLAS access, potentially impacting US access.  Recommendations  US ATLAS should assure that the ATLAS task force on ESD and AOD event formats reports back by early summer 2007 to the US Atlas management with a plan to address the event size problem  Checksum techniques should be adopted for all data transfers to ensure data integrity.

Deployment of US Tier-1, Tier-2 centers  Observations:  Tier0 to Tier1 integration has been demonstrated; Tier1 to Tier2 not really exercised so far as a result of problems encountered in using DDM at Tier2 sites.  Tier3 integration plan via the OSG is reasonable, but desktop Tier3 sites seems inappropriate.  A new Tier1 cost profile was presented with significant increase of total costs for the years 2009 to 2011; this resulted from several factors.  For Tier2 the cost profiles has not changed, but the projected capacities are significantly lower than the new target in 2010 and  Recommendation  USATLAS should develop a plan by the next Agencies review for the Tier2 shortfall in 2010, 2011.

Infrastructure and Operations  Observations  Jobs at Tier2 centers continued to run during an 8 hour stand-down of BNL which took the Tier1 facility off-line.

Usability of grid-based software  Observations  Good progress in using the GRID; the use of PANDA demonstrates significant progress in using the GRID effectively by US physicists.  USATLAS demonstrated appropriate management links between OSG and WLCG. Showed that using Panda they could submit jobs to EGEE and OSG.

Cybersecurity  Observations  At the Tier1 facility there is an understanding of responsibilities in cybersecurity at BNL.  US ATLAS now has a cybersecurity officer in place.  They are leveraging OSG expertise to address cybersecurity issues at the Grid level.  Recommendations  Work with cybersecurity officer to put in place a plan to mitigate the effects of cybersecurity incidents

Networks  Observations  The current plan to add a fully redundant diverse path from BNL to ESNet budgeted for 07 on top of the existing 20 gbps links seem adequate for the initial running of the LHC; ATLAS US model which foresees storage of the full ESD sample at BNL means each T2 requires only good connectivity to BNL.  Connectivity between T1 and T2 looks good (existing or planned 10 gbps links)

ATLAS CLOSEOUT: CORE SOFTWARE AND ANALYSIS SUPPORT

Software - 1  Observation: 2007 will bring a large burden on US ATLAS core software and there is a chance that US ATLAS support may suffer while the needs of the greater good are addressed.  Recommendation:  Press International ATLAS for an integration schedule with milestones (however fluid) to define the year’s activities  Observation:  ATLAS recognized the need for a a user-defined ntuple, and the work is essentially done for allowing physics groups and individuals to create them from ATHENA. This is to be commended.  Recommendation:  Show at the next review how the DPD effort turned out: how well the structure worked and how well it was adopted by the collaboration.

Software - 2  Observation:  US ATLAS plans to devote 1 new FTE to developing vATLAS, at the request of the ATLAS technical coordinator with online needs as the primary driver. This is seen as an opportunity to leverage graphics expertise in the US and provide a collaboration wide tool.  Recommendation:  We believe more effort than 1 FTE will be needed, and it should be pursued in International ATLAS. Similarly for PanDA, develop a strategy for ATLAS-wide adoption.  Observation:  US support groups are potentially exposed to high levels of requests from across the collaboration. On one hand, this is a sign of a job well done in terms of being recognized experts.  Recommendation:  Assess the potential workload and develop a mitigation strategy to limit the exposure. Feed back questions posed to the US groups to the UK workbook effort to minimize repeat questions. Note that there was no response this year to last year’s recommendation on regular assessments of support load.

Software - 3  Observation:  Release validation appears woefully inadequate. We saw no system- wide testing. This is shown graphically that failure rate spikes with each new release. Even still, there is no measure of ongoing algorithm quality.  Recommendations:  Press International ATLAS to develop a culture of code QA and testing, and to develop system tests this year. This should reduce the failure rate from software failures as well. The target failure rate as reported in the Answers should be clarified; it appears to quote the job failure rate due to software failure, not per event probability.  Observation:  The software demo was very satisfactory showing that the software, at this stage, is in a functional state, if not elegant.  Observation:  As always, we commend US ATLAS for their level of responsibility in International ATLAS, and for the key projects they develop and support.

CMS CLOSEOUT: MANAGEMENT

Recommendations from 2006  Observations:  The recommendations from last year generally have been addressed in commendable ways  We commend US CMS for their commitment to the success of the experiment.  Recommendations  Even in light of the increased pressures of turn-on, US CMS S&C must work with CMS to establish well-defined US deliverables.

Management: Findings, Observations  The committee appreciates the responsiveness of the US CMS S&C management team to the committee’s questions  The new CMS management structure is targeted towards moving to an operational mode, and appears to be converging. The organization looks plausible although many appointments remain open which is a concern given the time critical nature of having a stable and effective organization. We observe that the US is well represented in the management structures. A goal of the CMS management structure is to provide well defined US CMS S&C deliverables.  US CMS S&C management has been effective in allocating resources to address changes in strategies.  US CMS S&C is focusing on achieving a sustainable operations model.

Management, Recommendations  US CMS S&C should continue to work with CMS management to fill the open positions in the new CMS Computing organization in a timely way.  We recommend that US CMS S&C work with CMS to define and reward physicist participation in software and computing operation tasks.  We recommend a timely process for including software and offline in the Memoranda of Agreement  The year before data taking is an especially stressful time when problems must be addressed as they come up and we recommend retaining flexibility within US CMS S&C responsibilities to address such problems.  Due to the highly interdependent nature of integration tasks, it is crucial to have public and published milestones which reflect the needed system functionality required to support the experiment and data collection at each point in time.

Analysis Readiness: Observations  We commend the completion of the CMS Physics TDR (and the contributions of US physicists) and encourage continued and proportionate participation by US CMS physicists, including FNAL scientific staff.  There are active processes to continue to enable the LPC to be an effective organization for US physicists and to use the LPC-CAF to provide resources for US CMS physicists.

10% Cut Scenario  Observations:  US CMS S&C has analyzed a 10% funding cut, and a mitigation plan exists and the impact is understood.  While the impact is understood, any funding cut would remove needed flexibility leading to risk for addressing unanticipated problems which are bound to arise.  Recommendations  We recommend that US CMS S&C work to specify their scope and reconcile it with their resources. This should be expressed within the MOA.

CMS CLOSEOUT: FACILITIES, GRIDS, NETWORKING, AND INFRASTRUCTURE

Computing Models  Observations:  They have shown that the Tier-2 sites can participate in analysis during the CSA06.  The AOD use in analysis has achieved sufficient performance to make user based analysis effective.  Recommendation  USCMS should follow its stated plan to scale up exercise of analysis at Tier-2 sites to meet capability targets.  Recommend that global CMS adopt checksum techniques for all data transfers to ensure data integrity.

Infrastructure and Operations  Observations:  LHCNet is not provisioned to serve Tier-2 to Tier-I links. There is a risk that the incumbent networks are likely to have inadequate capacity.  If any Tier-1 center goes offline for an extended period of time, there would be significant impact on CMS computing operations.  Scalability and reliability of operators' tracking and fixing of errors for grid-submitted jobs was not explicitly demonstrated.  Recommend  USCMS work with the appropriate agency offices and network providers to ensure that CMS' computing model matches available trans-atlantic network bandwidth.  Recommend CMS develop and deploy monitoring and diagnostic tools that allow operators to manage the predicted scale of job submissions for real data, and demonstrate these tools for CSA07.

Deployment of US Tier-1, Tier-2 centers  Observation:  Current Tier-1and Tier-2 center resources are on target for computing capacities, network, cpu and disk for LHC startup.

Usability of Grid Based Software  Observations:  The grid is a key element of the software tools being used and has made good progress in terms of usability and readiness.  The ProdAgent production system is able to fully utilize the OSG and EGEE grids.  A substantial portion of the data analysis is running on the grid using CRAB.

Cybersecurity  Observations:  The procedures and policies for grid level cybersecurity are in place and have been exercised for a few minor “alarms”. No intrusion was detected in any of these cases.  The cybersecurity responsibilities at the Tier-1, Tier-2 and OSG are established.

Networks  Observations:  Demonstrated a good start on provisioning of Tier-2 links in their network topology.  The US distributed network successfully handled the Tier-1 to Tier-2 transfers with a factor of two clearance compared to the 2008 need.  Recommendation  USCMS should develop a plan to address end-to-end data transfer issues between US Tier-2 sites and non-US Tier-1 sites, presentable at the next agency review.

Comments on Last Year’s Recommendations  USCMS demonstrated adequate progress on all recommendations from last year.

CMS CLOSEOUT: CORE SOFTWARE AND ANALYSIS SUPPORT

 The review committee would like to commend the US CMS S&C SW group on the achievements during 2006 and for the careful planning for 2007  The group appears generally well prepared to handle the upcoming challenges  The demonstration was well done and useful to understand the state of the system  The group responded carefully and clearly to the questions of the review committee  The presentations were clear and concise.

Software Support  Findings:  An effective user support organization is in place and will be further extended by leveraging the effort of CMS physicists.  The US effort is well integrated into the global CMS organization.  Observations:  The group is making efficient use of the available resources.  The close integration into the global CMS organization strengthens the whole collaboration.  Recommendations:  The sharing of the user support load should be carefully monitored so as to ensure that the US CMS group will not be unfairly burdened.

Software Failures  Findings:  CSA06 has shown a 1/million failure rate of jobs due to generic software problems.  Observations:  We commend the group for this achievement.  How the failure rate was measured was not shown. Judging from the demo, this might be hard for very large data samples in the absence of QA tools.  The rate is in the right ballpark, but it is tested only on MC events. The failure rates are likely to increase when real data with real problems arrive. This is where the QA tools will be urgently needed.  Recommendations:  The rate of generic SW failure rates must continue to decrease even for real data. This should be carefully watched.  Appropriate QA tools must be delivered to assure this result.

Implementation of analysis model  Findings:  AODs can be analyzed rapidly (>1kHz rate) with simple algorithms. Some significant overhead due to dCache (not optimized) was evident.  Observations:  The analysis rate for very simple tasks are appropriate but the observed dCache overhead is too large.  Recommendations:  Analysis rates should be further optimized in particular for the use of more complicated algorithms.  The dCache performance must be optimized.

Data Management/Placement for T2  Findings:  In CSA06 data management has been successfully exercised between T0 and T1. The upcoming “full chain” exercise will extend this to the HLT/T0 interface and CSA07 will extend it to the T1/T2 interfaces.  The strategy relies on the decisions and manual operation by a few experts.  Tools are being developed to monitor data access patterns.  Observations:  The strategy to place data at the T2 level manually according to preferences and usage patterns appears wise at the initial stage of the experiment.  Recommendations:  CSA07 should be used to gain as much experience as possible on the T2 access patterns and validate the functionality of the data access pattern monitoring tools