Computing Fabrics & Networking Technologies Summary Talk Tony Cass usual disclaimers apply! October 2 nd 2010.

Computing Fabrics & Networking Technologies Summary Talk Tony Cass usual disclaimers apply! October 2 nd 2010

What was covered? 2 Dogs that didn’t bark: Lustre/GPFS IPv6

Other observations Availability of grid sites is hard to maintain... Hardware is not reliable, no matter if it is commodity or not – We have 100 PB disk worldwide – something is always failing Problems are (surprisingly?) generally not middleware related... Missing: (global) prioritisation, ACLs, quotas,... Actual use cases today are far simpler than the grid middleware attempted to provide for – E.g. advent of “pilot jobs” changes the need for brokering Deployment of upgrades/new services is very slow Inter-dependencies application-middleware-OS is a nightmare Have we (HEP) really understood how to use a distributed architecture? Ian.Bird@cern.ch3 Providing reliable data management is still an outstanding problem

Fabric Management 4 PS11-1-006 o run complex large-scale reconfigurations  Castor  Batch o improve performance and security measures o allow end-users to reconfigure their fabrics o introduce dynamic clusters o enable web-based administration PS11-3-282 PS11-4-374 PO-WED-040

Fabric Management 5 PS11-1-006 o run complex large-scale reconfigurations  Castor  Batch o improve performance and security measures o allow end-users to reconfigure their fabrics o introduce dynamic clusters o enable web-based administration PS11-3-282 PS11-4-374 PO-WED-040 Many presentations and interesting plots… … BUT … Do we keep re-inventing the wheel? What’s hot? Puppet What’s not?quattor

Other observations Availability of grid sites is hard to maintain... Hardware is not reliable, no matter if it is commodity or not – We have 100 PB disk worldwide – something is always failing Problems are (surprisingly?) generally not middleware related... Missing: (global) prioritisation, ACLs, quotas,... Actual use cases today are far simpler than the grid middleware attempted to provide for – E.g. advent of “pilot jobs” changes the need for brokering Deployment of upgrades/new services is very slow Inter-dependencies application-middleware-OS is a nightmare Have we (HEP) really understood how to use a distributed architecture? Ian.Bird@cern.ch6 Providing reliable data management is still an outstanding problem Reliability being addressed. But hands often tied… We have the technology to deploy rapidly at sites. Slow deployment is a communication problem, not a technical one… lots of too much PS11-2-072

Storage 7

1 st workshop held in June – Recognition that network as a very reliable resource can optimize the use of the storage and CPU resources The strict hierarchical MONARC model is no longer necessary – Simplification of use of tape and the interfaces – Use disk resources more as a cache – Recognize that not all data has to be local at a site for a job to run – allow remote access (or fetch to a local cache) Often faster to fetch a file from a remote site than from local tape Data management software will evolve – A number of short term prototypes have been proposed – Simplify the interfaces where possible; hide details from end-users Experiment models will evolve – To accept that information in a distributed system cannot be fully up-to- date; use remote access to data and caching mechanisms to improve overall robustness Timescale: 2013 LHC run Ian Bird, CERN8 Evolution of Data Management

9 PS35-35-304 & PS52-1-303

10 No comparable presentations from dCache in the track, but team also Working on “demonstrators” after the WLCG data management jamboree

1 st workshop held in June – Recognition that network as a very reliable resource can optimize the use of the storage and CPU resources The strict hierarchical MONARC model is no longer necessary – Simplification of use of tape and the interfaces – Use disk resources more as a cache – Recognize that not all data has to be local at a site for a job to run – allow remote access (or fetch to a local cache) Often faster to fetch a file from a remote site than from local tape Data management software will evolve – A number of short term prototypes have been proposed – Simplify the interfaces where possible; hide details from end-users Experiment models will evolve – To accept that information in a distributed system cannot be fully up-to- date; use remote access to data and caching mechanisms to improve overall robustness Timescale: 2013 LHC run Ian Bird, CERN11 Evolution of Data Management Work in progress

Making what we have today more sustainable (software, operations, etc) is a challenge Data issues: – Data management and access – How to make reliable systems from commodity (or expensive!) hardware – Fault tolerance – Data preservation and open access Need to adapt to changing technologies: – Use of many-core CPUs (and other processor types?) – Global filesystems (soon...) – Virtualisation Network infrastructure – This is the most reliable service we have – Invest in networks and make full use of the distributed system Ian Bird, CERN12 Evolution and sustainability

CPU Technology Best summarised by Sverre Jarp in the plenary 13 Software needs to adapt to fully exploit chip potential

Virtualisation — Batch 14 PS29-1-015 PS29-2-143 PS29-3-278 PS13-5-124

Making what we have today more sustainable (software, operations, etc) is a challenge Data issues: – Data management and access – How to make reliable systems from commodity (or expensive!) hardware – Fault tolerance – Data preservation and open access Need to adapt to changing technologies: – Use of many-core CPUs (and other processor types?) – Global filesystems (soon...) – Virtualisation Network infrastructure – This is the most reliable service we have – Invest in networks and make full use of the distributed system Ian Bird, CERN15 Evolution and sustainability You have been warned… … is coming to a site near you (if it isn’t already there!) Conspicuous by absence

Grid Middleware – Complexity of today’s middleware compared to the actual use cases – Evolve by using more “standard” technologies: e.g. Message Brokers, Monitoring systems are first steps Global AAI: – SSO – Evolution/replacement/hiding of today’s X509 – Use existing ID federations? – Integrate with commercial/opensource software? Fabric – Are we using our resources as effectively as possible? (power is an issue) – Use of remote data centres Ian Bird, CERN16 Areas for evolution – 2

Virtualisation — Infrastructure 17 PO-WED-26 And also: PO-WED-041 PS35-1-375 & PS35-2-141

Storage Efficiency 18 PS46-3-310 PO-MON-062 Also: TReqS PO-WED-028 PS52-2-460

Grid Middleware – Complexity of today’s middleware compared to the actual use cases – Evolve by using more “standard” technologies: e.g. Message Brokers, Monitoring systems are first steps Global AAI: – SSO – Evolution/replacement/hiding of today’s X509 – Use existing ID federations? – Integrate with commercial/opensource software? Fabric – Are we using our resources as effectively as possible? (power is an issue) – Use of remote data centres Ian Bird, CERN19 Areas for evolution – 2 Being addressed. No mention, but… CERNVM-FS for software distribution @ PIC PO-MON-016

Infrastructure 20 PS05-5-509 One infrastructure talk. Improved PUE, but only 65kW. Is it cost effective long term to add capacity in such small chunks? Interesting development! PS11-2-072

Bandwidth, Reliability, Connectivity – Not forgetting that we have collaborators on 6 continents Including outer edges of Europe – Have set up group to express these requirements in conjunction with network communities But we also need a service: – Monitoring Is largely missing today – we have a hard time to understand if there is a network problem – Operational support Is a complex problem with many organisations involved. Who owns the problem? How can we (user) track progress? Ian Bird, CERN`21 So what do we need from networks?

Networking 22 PS17-1-150 PS17-3-337

Networking 23 PS23-2-198 Network aware, but policy awareness also needed

24 PS52-3-356

Bandwidth, Reliability, Connectivity – Not forgetting that we have collaborators on 6 continents Including outer edges of Europe – Have set up group to express these requirements in conjunction with network communities But we also need a service: – Monitoring Is largely missing today – we have a hard time to understand if there is a network problem – Operational support Is a complex problem with many organisations involved. Who owns the problem? How can we (user) track progress? Ian Bird, CERN`25 So what do we need from networks? Bandwidth will be there. But expect to pay There is monitoring: lots shown here at WAN level. But is (LAN level) monitoring adequate to resolve problems seen—and available to the right people? No answers here

Hierarchy of Tier 0, 1, 2 no longer so important Tier 1 and Tier 2 may become more equivalent for the network Traffic could flow more between countries as well as within (already the case for CMS) Network bandwidth (rather than disk) will need to scale more with users and data volumes Data placement will be driven by demand for analysis and not pre-placement Ian.Bird@cern.ch26 Implications for networks Tier2s!!!

.... Another hype / marketing / diversion ??? Yes, but – Virtualisation is already helping in several areas Breaking the dependency nightmare Improving system management, provision of services on demand Potential to help use resources more effectively and efficiently (many of us have power/cooling limitations) Use of remote computer centres – Cloud technology Let’s not forget why we have and need a “grid”; much of this cannot be provided by today’s “cloud” offerings – Collaboration (VO’s), worldwide AAI and trust, dispersed resources (hw and people), Although we should be able to make use of commercial clouds transparently Ian Bird, CERN27 Virtualisation and “clouds” PS23-2-198 PO-MON-009

Summary Summarised  Fabrics working well!  Many interesting presentations  Well attended –Thanks to all those who braved the rain!  Virtualisation topics split across 3 tracks –Dedicated track for CHEP ’12? »or will it all be routine by then?  We seem to be addressing many of Ian’s concerns but… –wheels are often reinvented –developments sometimes occur in isolation  Still scope for improved collaboration between sites and between different work areas. 28

Computing Fabrics & Networking Technologies Summary Talk Tony Cass usual disclaimers apply! October 2 nd 2010.

Similar presentations

Presentation on theme: "Computing Fabrics & Networking Technologies Summary Talk Tony Cass usual disclaimers apply! October 2 nd 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computing Fabrics & Networking Technologies Summary Talk Tony Cass usual disclaimers apply! October 2 nd 2010.

Similar presentations

Presentation on theme: "Computing Fabrics & Networking Technologies Summary Talk Tony Cass usual disclaimers apply! October 2 nd 2010."— Presentation transcript:

Similar presentations

About project

Feedback