Ian Bird WLCG MB 16 th February 2016 16 Feb 2016

Slides:



Advertisements
Similar presentations
Achieve Benefit from IT Projects. Aim This presentation is prepared to support and give a general overview of the ‘How to Achieve Benefits from IT Projects’
Advertisements

Data Management TEG Status Dirk Duellmann & Brian Bockelman WLCG GDB, 9. Nov 2011.
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Ian Bird LHCC Referees’ meeting; CERN, 18 th November 2014.
Ian Bird LCG Project Leader LHCC + C-RSG review. 2 Review of WLCG  To be held Feb 16 at CERN  LHCC Reviewers:  Amber Boehnlein  Chris.
Dynamic Systems Development Method (DSDM)
EPICS Conceptual Review Template Notes:  Use the template as a guide to preparing your presentation….you may add, subtract, or rearrange as needed to.
WLCG Cloud Traceability Working Group progress Ian Collier Pre-GDB Amsterdam 10th March 2015.
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
Chapter : Software Process
Ian Bird WLCG Workshop Okinawa, 12 th April 2015.
Ian Bird WLCG Management Board CERN, 17 th February 2015.
Assessment of Core Services provided to USLHC by OSG.
Year Seven Netbook Project. Aims of the Project To evaluate the impact on learning and teaching of using portable technologies both within and outside.
Ian Bird LHCC Referees’ meeting; CERN, 11 th June 2013 March 6, 2013
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
Key Project Drivers - FY11 Ruth Pordes, June 15th 2010.
EGI-Engage EGI-Engage Engaging the EGI Community towards an Open Science Commons Project Overview 9/14/2015 EGI-Engage: a project.
Developing a result-oriented Operational Plan Training
Ian Bird LHCC Referee meeting 23 rd September 2014.
Event Management & ITIL V3
Workshop summary Ian Bird, CERN WLCG Workshop; DESY, 13 th July 2011 Accelerating Science and Innovation Accelerating Science and Innovation.
© 2001 Change Function Ltd USER ACCEPTANCE TESTING Is user acceptance testing of technology and / or processes a task within the project? If ‘Yes’: Will.
EGI_DS or “can WLCG operate after EGEE?” Jamie Shiers ~~~ WLCG GDB, May 14 th 2008.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks NA2 – Dissemination, Outreach and Communication.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
Ian Bird GDB; CERN, 8 th May 2013 March 6, 2013
Ian Bird LHCC Referees’ meeting; CERN, 3 rd March 2015.
Ian Bird, WLCG MB; 27 th October 2015 October 27, 2015
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
Ian Bird GDB CERN, 9 th September Sept 2015
Evolution of storage and data management Ian Bird GDB: 12 th May 2010.
Future computing strategy Some considerations Ian Bird WLCG Overview Board CERN, 28 th September 2012.
Security Policy Update WLCG GDB CERN, 14 May 2008 David Kelsey STFC/RAL
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
LHC Computing, CERN, & Federated Identities
1 CREATING AND MANAGING CERT. 2 Internet Wonderful and Terrible “The wonderful thing about the Internet is that you’re connected to everyone else. The.
Evolving Security in WLCG Ian Collier, STFC Rutherford Appleton Laboratory Group info (if required) 1 st February 2016, WLCG Workshop Lisbon.
Ian Bird CERN, 17 th July 2013 July 17, 2013
Ian Bird, CERN 2 nd February Feb 2016
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI strategy and Grand Vision Ludek Matyska EGI Council Chair EGI InSPIRE.
WP5 – Flagship Deployment Period 1 Review Phil Evans – Logica/CGI 1.
1 Proposal for a technical discussion group  Because...  We do not have a forum where all of the technical people discuss the critical.
3rd Helix Nebula Workshop on Interoperability among e-Infrastructures and Commercial Clouds Carmela ASERO, EGI.eu 17 September 2013, Madrid
Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.
Evolution of WLCG infrastructure Ian Bird, CERN Overview Board CERN, 30 th September 2011 Accelerating Science and Innovation Accelerating Science and.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Ian Bird, CERN 1 st February Dec 2015
Traceability WLCG GDB Amsterdam, 7 March 2016 David Kelsey STFC/RAL.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Storage Accounting John Gordon, STFC OMB August 2013.
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
LHCbComputing Update of LHC experiments Computing & Software Models Selection of slides from last week’s GDB
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Evolution of storage and data management
Bob Jones EGEE Technical Director
Update from the Faster Payments Task Force
Computing models, facilities, distributed computing
Summary of Lisbon Workshop
Scientific Computing Strategy (for HEP)
for the Offline and Computing groups
How to enable computing
WLCG: TDR for HL-LHC Ian Bird LHCC Referees’ meting CERN, 9th May 2017.
Strategy document for LHCC – Run 4
WLCG Collaboration Workshop;
Presentation transcript:

Ian Bird WLCG MB 16 th February Feb 2016

Introduction  Two days devoted to medium term (Run 2-3) and longer term (Run 4) concerns  ~140 people registered  Aimed for more of a discussion format rather than presentations  (Informal) feedback from many said this was useful o Some aspects probably needed a bit more preparation to be more successful 16 Feb 2016

16 Feb 2016

Security, AAA, etc  Fully address traceability:  Freeze deployment of glexec – keep it supported for the exisiting use, but no point to expend further effort in deployment  Need full traceability solutions and tools: o Experiment frameworks Work needed to get VO traceability info into CERN SoC (or to …) o Use of VMs/containers helps o Invest in deploying ‘big data’ tools for managing traceability data SoC capability, appliances?  Traceability working group needed?  This is a reflection of the trust of VO’s now, rather than trying to trust many individuals  Incident Response & dealing with threats (against people)  Invest in (coordinate in WLCG) better intelligence/trust with other communities, and with vendors  Federated IDs: long term goal  eduGain etc; develop use cases within AARC2 project  Policy work  Data, privacy, etc 16 Feb 2016

Compute  Lot of discussion of cloud and cloud use  Models of how to provision and access commercial clouds are evolving o HNSciCloud will explore more aspects  Many models of using clouds, containers, VM’s (vac, batch queues, & etc., etc.) o Probably exposure of experience in GDB is a correct way to proceed for the moment  Lots of discussion on the use of HPC  Useful in certain specific circumstances or as opportunistic resources  Significant effort expended in this area, for few % gain in resources o Not to be ignored, but can we gain more in other areas for a similar effort???  What should our strategy be here – generalise to opportunistic resources more broadly?  Issues of IP connectivity, lack of storage access, etc. (see these issues in HPC, cloud, etc.) o Addressing these fully will actually benefit our entire operation o Long standing concern over connectivity and implications at sites 16 Feb 2016

Data  Object Storage  multiple motivations  scalability (exploiting less meta data) as embedded storage  also - nicer/more modern tools  Roles of smaller sites (or with little support effort)  demo - describe scenarios, ask for supporters, drop rest  cataloged cache (eg dpm)  proxy cache (eg xroot)  Rob (largely non hep specific components, trust?)  boinc (no operator, no shared storage)  Common questions  prove simulation use case  analysis at small sites will be compressed  estimated impact on T1 (eg wan stageout)  Federation of storage, desired by some experiments  Prefer to have a single storage endpoint to aggregate resources across several sites  Maintain redundancy and replica locality 16 Feb 2016

Data 16 Feb 2016

Info sys, accounting, etc  Information system:  Too much “by hand” information in too many places – error prone  Does WLCG need an IS? (my impression is “yes”) o But should be focused on as simple as possible for service discovery o Benchmark data should be separated  Suggestion (and work done) to use AGIS for this o Needs agreement before we proceed further o Alternative is do nothing and let experiments gather info directly from GocDB, OIM etc  Benchmarking: we need  A real benchmark (HS06 or update) for: o Procurement, reporting, expressing requirements, etc  A fast “calibration” benchmark to run e.g. at start of every pilot o Needed for understanding environment o Essential for opportunistic resources, or cloud uses o Ideal if we could agree a single such fast benchmark for everyone 16 Feb 2016

Accounting, cont  Accounting  Work has been done for cloud accounting in EGI  Not clear how to publish accounting from commercial clouds or HPC (or non-pledged in general)  Wallclock vs CPUtime reporting – not discussed  We should review formally what each stakeholder needs from accounting  Experiments, FA’s, sites, etc  What data should be gathered and reported?  Today’s accounting system has grown over 10 years – time to review what we want from it and how to manage it o Also to manage expectations of the data itself 16 Feb 2016

16 Feb 2016

Observations  Generally lack of preparedness to think about the longer term  People tightly focused on immediate concerns and Run 2  Probably a lack of clarity over what the situation for Phase 2 upgrades will be:  In terms of requirements – what is the real scale of the problem – need better estimates  What we can really expect from technology  An understanding of the real limitations of the system we have today  We should also bear in mind that while we potentially need to instigate revolutionary changes in computing models, nevertheless we will have to face an evolutionary deployment  Concerns over software and efficiency (in all aspects) will be a significant area of work  Commonalities may be possible in new tools/services or next generation of existing  Propose a number of activities to address some of these aspects 16 Feb 2016

1) Definition of the upgrade problem  Set up a study group to:  Establish and update estimates of actual computing requirements for HL-LHC, more realistic than previous estimates: o what are the baseline numbers for data volumes/rates, CPU needs, etc.?  Look at the long term evolution of computing models and large scale infrastructure o Need both visionary “revolutionary” model(s) that challenge assumptions, and “evolutionary” alternatives  Explore possible models that address o Today’s shortcomings o Try to use best of evolving technologies o Address expectations of how the environment may evolve Large scale joint procurements, clouds, interaction with other HEP/Astro-P/other sciences o Possible convergence of (the next generation of) main toolsets  Build a realistic cost model of LHC computing, help to evaluate various models and proposals – this will be a key to guiding direction of solutions 16 Feb 2016

2) Software-related activities  Strengthen the HSF –  “Improve software performance” – o Need to define what the goals are here o Need to define metrics for performance: E.g. time to completion vs throughput vs cost o Continue concurrency forum/HSF activities – but try and promote more o And other initiatives like reconstruction algorithms etc  Techlab o expand as a larger scale facility under HSF umbrella o Include support tools (profilers, compilers, memory etc) Including support, training, etc openlab can also help here o Should be collaborative – CERN + other labs  Technology review o “PASTA” – reform the activity – make into an ongoing activity, updating report every ~2 years Broad group of interested experts o Also under HSF umbrella – strongly related to the above activities 16 Feb 2016

3) Performance evaluation/”modelling”  Investigate real-world performance of today’s systems:  Why is performance so far from simple estimates of what it should be?  Different granularities/scales: o Application on a machine o Site level: bottlenecks, large-scale performance Different scale sites, different workflows o Overall distributed system At which level? Are data models and workflows appropriate?  Once we have a better handle of actual performance – can we derive some useful models/parameterisations etc?  Useful enough to guide choices of computing models – don’t have to be perfect or complete  This feeds into any cost models 16 Feb 2016

4) Prototyping (demonstrators)  Some specific prototyping of some of the ideas that arise from the above activities  For example:  Data or storage management o Storage federations, caches rather than “SE” o Etc.  Optimisation of sites with little effort or expertise o “Site in a box” appliance, Boinc, vac, etc o What about cache, stage-out, etc  Others as ideas arise  Common activity here would help to evolve into common solutions in production eventually 16 Feb 2016

How to progress – 1  Study group  Set up now, needs a chair person and experts from each experiment  Initial tasks o to understand likely realistic performance and resource requirements – establish a baseline o Look into building a realistic cost model  Later o Look at straw-man models based on better understanding of the above 16 Feb 2016

Progress – 2  Software and technology related  Mandate the HSF to :  Define metrics of what should be achieved, and how to describe performance (see next)  Set up an ongoing Technology Review group  Work with CERN techlab team to define what that should look like (and how to expand collaboration)  What can be done about long term careers and recognition of software development 16 Feb 2016

Progress – 3  Performance/modelling  Small team in IT starting to work on this and consolidate existing efforts  Define a programme of work to look at current performance and concerns  Propose to mandate this team (led by Markus Schulz) to engage experts in the experiments and sites o Define some initial goals 16 Feb 2016

Demonstrators  This seems to be a natural way to move forward  Some could be under the umbrella of the WLCG Operations team  Each project would need to define goals and metrics, and engage volunteers/efforts  Projects can be both medium term investigations/prototypes  E.g. storage federation across sites  Or, longer term demonstrators  New models 16 Feb 2016

Summary  Medium term  A lot of work ongoing o Including other aspects not discussed in Lisbon (e.g. cost of operations)  Useful to have (as discussed previously) a technical forum to coordinate all the activities? o Coordination: A chairperson, GDB chair, 1 per experiment (senior enough) o GDB and operations teams useful mechanisms for discussion/work  Longer term  3 areas of work proposed  Should be managed by the MB, also need to work towards a more concrete plan  Prototypes/demonstrators  A useful way to explore ideas and eventually converge on common solutions? 16 Feb 2016

16 Feb 2016