DOE-NSF Comprehensive Review of US-LHC Computing January 19, 2007 UT Arlington CLOSEOUT SESSION.

DOE-NSF Comprehensive Review of US-LHC Computing January 19, 2007 UT Arlington CLOSEOUT SESSION

ATLAS CLOSEOUT: MANAGEMENT

Recommendations from 2006  Observations:  The recommendations from last year generally have been addressed in commendable ways  We note the lack of reporting on an end-to-end cosmic ray test. Such tests are crucial to expose integration issues with the software and computing.  Recommendations  US Atlas, working with Atlas, is urged to focus on a full system test with well defined milestones.

Management, Findings  Findings and Observations:  The committee appreciates the responsiveness and candor of the US Atlas management team.  The management structure is effective and appropriate with problems identified and addressed. We observe that the US efforts are tightly integrated with International Atlas and the US is well represented in the management structures.  Committees are an excellent way to achieve consensus, which is necessary in an international collaboration, however, this can be a very slow process.  We commend US ATLAS S&C for their usage of the change control board.  There is an established plan for transition from development to operations. It is difficult to assess the planning for the transition and if it will address the problems that will occur.  The Computing Model was redefined for resource requirements. The assumptions for the model are still under debate. This has a budget impact in the out-years.

Management, Recommendations  US Atlas should work with Atlas committees to get timely delivery of reports such that risks can be identified and addressed expeditiously.  The year before data taking is an especially stressful time when problems must be addressed as they come up and we recommend retaining some flexibility to address such problems.  Due to the highly interdependent nature of integration tasks, it is crucial to have public and published milestones which reflect the needed system functionality required to support to experiment and data collection at each point in time.

Analysis Readiness: Observations  US ATLAS is taking all reasonable steps to educate the users and build consensus to insure that US collaborators are ready and able to do physics analysis.  The Jamborees are successful  The feedback from the collaborators is positive and enables fine-tuning and improving the process  The US analysis readiness has been internally reviewed by US Atlas  We commend them for addressing the comments from last year’s review.  Locating the complete copy of the ESD at the US Tier 1 is likely to make analysis easier at the risk of an increased single point of failure. We note that they have considered mitigations this risk, for example, by partnering with other Tier 1 sites and having software which is flexible.  There is a plan for adding new collaborators.  Incremental costs are not well understood, however there is a model  The US Atlas management takes responsibility for ensuring that new collaborators can contribute effectively to Atlas.

PanDA  Observations:  PanDA has proven to be effective and is gaining international acceptance.  Having a leadership role in order to have the ability to adapt to changing grid technologies and techniques is valuable.  There is an intent to spread the operational load with the full collaboration.  Recommendations  For PanDA, maintain the development effort in the US and pursue the plans to increase the international support for operations.

DDM  Observations  We note that the Distributed Data Management tool issues are a risk for Atlas success.  The ATLAS collaboration has given this issue a high priority and has recently reviewed the DDM project.  Recommendations  We recommend that US ATLAS should work with ATLAS to execute the Distributed Data Management plan with some urgency.  We encourage the US ATLAS management to ensure that a plan is in place to solve both short term needs and long term issues in DDM and to make sure sufficient resources are put in place to guarantee the success of this plan.  US ATLAS should work with ATLAS to develop concrete milestones that will guarantee a working DDM as soon as feasible and functional within 2007. This may require temporary solutions for some DDM system components for use in early data taking until the final solutions been proved."

10% Cut Scenario  Observations:  As outlined, a 10% funding cut would have drastic consequences  Even with full funding (including the proposed call on the management reserve), US Atlas S&C has limited flexibility to reassign resources in order to address the pressing issues which will arise in the early data collection period.  Any funding cut would remove that limited flexibility leading to extreme risk for meeting essential functionality.  Recommendations  Focusing on existing US Atlas responsibilities and avoiding expanding scope will make the best use of the resources.

ATLAS CLOSEOUT: FACILITIES, GRIDS, NETWORKING, AND INFRASTRUCTURE

Comments on last review recommendations:  Demonstrated progress on all recommendations. There remains a concern with dCache performance for chaotic data analysis.

Computing Models  Observations:  There is a plan to test data analysis at the Tier2 sites in advance of data taking.  Good progress in utilizing the facilities, not only at Tier1, but also at the Tier2; getting the Tier2 online for simulation production has been successful.  Full copy of ESD at Tier1 and full copy of AOD in each of the US Tier2 site is beneficial for data analysis capabilities of US groups. If ESD and AOD sizes remain much larger than foreseen in the computing plan this approach may not be possible.  Due to the fact that the BNL Tier1 is the only T1 with the full ESD it has been observed that it is an attractive repository for international ATLAS access, potentially impacting US access.  Recommendations  US ATLAS should assure that the ATLAS task force on ESD and AOD event formats reports back by early summer 2007 to the US Atlas management with a plan to address the event size problem  Checksum techniques should be adopted for all data transfers to ensure data integrity.

Deployment of US Tier-1, Tier-2 centers  Observations:  Tier0 to Tier1 integration has been demonstrated; Tier1 to Tier2 not really exercised so far as a result of problems encountered in using DDM at Tier2 sites.  Tier3 integration plan via the OSG is reasonable, but desktop Tier3 sites seems inappropriate.  A new Tier1 cost profile was presented with significant increase of total costs for the years 2009 to 2011; this resulted from several factors.  For Tier2 the cost profiles has not changed, but the projected capacities are significantly lower than the new target in 2010 and 2011.  Recommendation  USATLAS should develop a plan by the next Agencies review for the Tier2 shortfall in 2010, 2011.

Infrastructure and Operations  Observations  Jobs at Tier2 centers continued to run during an 8 hour stand-down of BNL which took the Tier1 facility off-line.

Usability of grid-based software  Observations  Good progress in using the GRID; the use of PANDA demonstrates significant progress in using the GRID effectively by US physicists.  USATLAS demonstrated appropriate management links between OSG and WLCG. Showed that using Panda they could submit jobs to EGEE and OSG.

Cybersecurity  Observations  At the Tier1 facility there is an understanding of responsibilities in cybersecurity at BNL.  US ATLAS now has a cybersecurity officer in place.  They are leveraging OSG expertise to address cybersecurity issues at the Grid level.  Recommendations  Work with cybersecurity officer to put in place a plan to mitigate the effects of cybersecurity incidents

Networks  Observations  The current plan to add a fully redundant diverse path from BNL to ESNet budgeted for 07 on top of the existing 20 gbps links seem adequate for the initial running of the LHC; ATLAS US model which foresees storage of the full ESD sample at BNL means each T2 requires only good connectivity to BNL.  Connectivity between T1 and T2 looks good (existing or planned 10 gbps links)

ATLAS CLOSEOUT: CORE SOFTWARE AND ANALYSIS SUPPORT

Software - 1  Observation: 2007 will bring a large burden on US ATLAS core software and there is a chance that US ATLAS support may suffer while the needs of the greater good are addressed.  Recommendation:  Press International ATLAS for an integration schedule with milestones (however fluid) to define the year’s activities  Observation:  ATLAS recognized the need for a a user-defined ntuple, and the work is essentially done for allowing physics groups and individuals to create them from ATHENA. This is to be commended.  Recommendation:  Show at the next review how the DPD effort turned out: how well the structure worked and how well it was adopted by the collaboration.

Software - 2  Observation:  US ATLAS plans to devote 1 new FTE to developing vATLAS, at the request of the ATLAS technical coordinator with online needs as the primary driver. This is seen as an opportunity to leverage graphics expertise in the US and provide a collaboration wide tool.  Recommendation:  We believe more effort than 1 FTE will be needed, and it should be pursued in International ATLAS. Similarly for PanDA, develop a strategy for ATLAS-wide adoption.  Observation:  US support groups are potentially exposed to high levels of requests from across the collaboration. On one hand, this is a sign of a job well done in terms of being recognized experts.  Recommendation:  Assess the potential workload and develop a mitigation strategy to limit the exposure. Feed back questions posed to the US groups to the UK workbook effort to minimize repeat questions. Note that there was no response this year to last year’s recommendation on regular assessments of support load.

Software - 3  Observation:  Release validation appears woefully inadequate. We saw no system- wide testing. This is shown graphically that failure rate spikes with each new release. Even still, there is no measure of ongoing algorithm quality.  Recommendations:  Press International ATLAS to develop a culture of code QA and testing, and to develop system tests this year. This should reduce the failure rate from software failures as well. The target failure rate as reported in the Answers should be clarified; it appears to quote the job failure rate due to software failure, not per event probability.  Observation:  The software demo was very satisfactory showing that the software, at this stage, is in a functional state, if not elegant.  Observation:  As always, we commend US ATLAS for their level of responsibility in International ATLAS, and for the key projects they develop and support.

CMS CLOSEOUT: MANAGEMENT

Recommendations from 2006  Observations:  The recommendations from last year generally have been addressed in commendable ways  We commend US CMS for their commitment to the success of the experiment.  Recommendations  Even in light of the increased pressures of turn-on, US CMS S&C must work with CMS to establish well-defined US deliverables.

Management: Findings, Observations  The committee appreciates the responsiveness of the US CMS S&C management team to the committee’s questions  The new CMS management structure is targeted towards moving to an operational mode, and appears to be converging. The organization looks plausible although many appointments remain open which is a concern given the time critical nature of having a stable and effective organization. We observe that the US is well represented in the management structures. A goal of the CMS management structure is to provide well defined US CMS S&C deliverables.  US CMS S&C management has been effective in allocating resources to address changes in strategies.  US CMS S&C is focusing on achieving a sustainable operations model.

Management, Recommendations  US CMS S&C should continue to work with CMS management to fill the open positions in the new CMS Computing organization in a timely way.  We recommend that US CMS S&C work with CMS to define and reward physicist participation in software and computing operation tasks.  We recommend a timely process for including software and offline in the Memoranda of Agreement  The year before data taking is an especially stressful time when problems must be addressed as they come up and we recommend retaining flexibility within US CMS S&C responsibilities to address such problems.  Due to the highly interdependent nature of integration tasks, it is crucial to have public and published milestones which reflect the needed system functionality required to support the experiment and data collection at each point in time.

Analysis Readiness: Observations  We commend the completion of the CMS Physics TDR (and the contributions of US physicists) and encourage continued and proportionate participation by US CMS physicists, including FNAL scientific staff.  There are active processes to continue to enable the LPC to be an effective organization for US physicists and to use the LPC-CAF to provide resources for US CMS physicists.

10% Cut Scenario  Observations:  US CMS S&C has analyzed a 10% funding cut, and a mitigation plan exists and the impact is understood.  While the impact is understood, any funding cut would remove needed flexibility leading to risk for addressing unanticipated problems which are bound to arise.  Recommendations  We recommend that US CMS S&C work to specify their scope and reconcile it with their resources. This should be expressed within the MOA.

CMS CLOSEOUT: FACILITIES, GRIDS, NETWORKING, AND INFRASTRUCTURE

Computing Models  Observations:  They have shown that the Tier-2 sites can participate in analysis during the CSA06.  The AOD use in analysis has achieved sufficient performance to make user based analysis effective.  Recommendation  USCMS should follow its stated plan to scale up exercise of analysis at Tier-2 sites to meet capability targets.  Recommend that global CMS adopt checksum techniques for all data transfers to ensure data integrity.

Infrastructure and Operations  Observations:  LHCNet is not provisioned to serve Tier-2 to Tier-I links. There is a risk that the incumbent networks are likely to have inadequate capacity.  If any Tier-1 center goes offline for an extended period of time, there would be significant impact on CMS computing operations.  Scalability and reliability of operators' tracking and fixing of errors for grid-submitted jobs was not explicitly demonstrated.  Recommend  USCMS work with the appropriate agency offices and network providers to ensure that CMS' computing model matches available trans-atlantic network bandwidth.  Recommend CMS develop and deploy monitoring and diagnostic tools that allow operators to manage the predicted scale of job submissions for real data, and demonstrate these tools for CSA07.

Deployment of US Tier-1, Tier-2 centers  Observation:  Current Tier-1and Tier-2 center resources are on target for computing capacities, network, cpu and disk for LHC startup.

Usability of Grid Based Software  Observations:  The grid is a key element of the software tools being used and has made good progress in terms of usability and readiness.  The ProdAgent production system is able to fully utilize the OSG and EGEE grids.  A substantial portion of the data analysis is running on the grid using CRAB.

Cybersecurity  Observations:  The procedures and policies for grid level cybersecurity are in place and have been exercised for a few minor “alarms”. No intrusion was detected in any of these cases.  The cybersecurity responsibilities at the Tier-1, Tier-2 and OSG are established.

Networks  Observations:  Demonstrated a good start on provisioning of Tier-2 links in their network topology.  The US distributed network successfully handled the Tier-1 to Tier-2 transfers with a factor of two clearance compared to the 2008 need.  Recommendation  USCMS should develop a plan to address end-to-end data transfer issues between US Tier-2 sites and non-US Tier-1 sites, presentable at the next agency review.

Comments on Last Year’s Recommendations  USCMS demonstrated adequate progress on all recommendations from last year.

CMS CLOSEOUT: CORE SOFTWARE AND ANALYSIS SUPPORT

 The review committee would like to commend the US CMS S&C SW group on the achievements during 2006 and for the careful planning for 2007  The group appears generally well prepared to handle the upcoming challenges  The demonstration was well done and useful to understand the state of the system  The group responded carefully and clearly to the questions of the review committee  The presentations were clear and concise.

Software Support  Findings:  An effective user support organization is in place and will be further extended by leveraging the effort of CMS physicists.  The US effort is well integrated into the global CMS organization.  Observations:  The group is making efficient use of the available resources.  The close integration into the global CMS organization strengthens the whole collaboration.  Recommendations:  The sharing of the user support load should be carefully monitored so as to ensure that the US CMS group will not be unfairly burdened.

Software Failures  Findings:  CSA06 has shown a 1/million failure rate of jobs due to generic software problems.  Observations:  We commend the group for this achievement.  How the failure rate was measured was not shown. Judging from the demo, this might be hard for very large data samples in the absence of QA tools.  The rate is in the right ballpark, but it is tested only on MC events. The failure rates are likely to increase when real data with real problems arrive. This is where the QA tools will be urgently needed.  Recommendations:  The rate of generic SW failure rates must continue to decrease even for real data. This should be carefully watched.  Appropriate QA tools must be delivered to assure this result.

Implementation of analysis model  Findings:  AODs can be analyzed rapidly (>1kHz rate) with simple algorithms. Some significant overhead due to dCache (not optimized) was evident.  Observations:  The analysis rate for very simple tasks are appropriate but the observed dCache overhead is too large.  Recommendations:  Analysis rates should be further optimized in particular for the use of more complicated algorithms.  The dCache performance must be optimized.

Data Management/Placement for T2  Findings:  In CSA06 data management has been successfully exercised between T0 and T1. The upcoming “full chain” exercise will extend this to the HLT/T0 interface and CSA07 will extend it to the T1/T2 interfaces.  The strategy relies on the decisions and manual operation by a few experts.  Tools are being developed to monitor data access patterns.  Observations:  The strategy to place data at the T2 level manually according to preferences and usage patterns appears wise at the initial stage of the experiment.  Recommendations:  CSA07 should be used to gain as much experience as possible on the T2 access patterns and validate the functionality of the data access pattern monitoring tools

DOE-NSF Comprehensive Review of US-LHC Computing January 19, 2007 UT Arlington CLOSEOUT SESSION.

Similar presentations

Presentation on theme: "DOE-NSF Comprehensive Review of US-LHC Computing January 19, 2007 UT Arlington CLOSEOUT SESSION."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DOE-NSF Comprehensive Review of US-LHC Computing January 19, 2007 UT Arlington CLOSEOUT SESSION.

Similar presentations

Presentation on theme: "DOE-NSF Comprehensive Review of US-LHC Computing January 19, 2007 UT Arlington CLOSEOUT SESSION."— Presentation transcript:

Similar presentations

About project

Feedback