Computing on the Grid and in the Clouds Rocío Rama Ballesteros CERN IT-SDC Support for Distributed Computing Group.

Slides:



Advertisements
Similar presentations
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Why Grids Matter to Europe Bob Jones EGEE.
Advertisements

Computing for LHC Dr. Wolfgang von Rüden, CERN, Geneva ISEF students visit CERN, 28 th June - 1 st July 2009.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
High Performance Computing Course Notes Grid Computing.
T1 at LBL/NERSC/OAK RIDGE General principles. RAW data flow T0 disk buffer DAQ & HLT CERN Tape AliEn FC Raw data Condition & Calibration & data DB disk.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Assessment of Core Services provided to USLHC by OSG.
GridPP Steve Lloyd, Chair of the GridPP Collaboration Board.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
SICSA student induction day, 2009Slide 1 Social Simulation Tutorial Session 6: Introduction to grids and cloud computing International Symposium on Grid.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
Advanced Computing Services for Research Organisations Bob Jones Head of openlab IT dept CERN This document produced by Members of the Helix Nebula consortium.
A short introduction to GRID Gabriel Amorós IFIC.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Grid Technologies  Slide text. What is Grid?  The World Wide Web provides seamless access to information that is stored in many millions of different.
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 25 th April 2012.
10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
1 Resource Provisioning Overview Laurence Field 12 April 2015.
Bob Jones Technical Director CERN - August 2003 EGEE is proposed as a project to be funded by the European Union under contract IST
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
Jürgen Knobloch/CERN Slide 1 A Global Computer – the Grid Is Reality by Jürgen Knobloch October 31, 2007.
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.
Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
WLCG and the India-CERN Collaboration David Collados CERN - Information technology 27 February 2014.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
DTI Mission – 29 June LCG Security Ian Neilson LCG Security Officer Grid Deployment Group CERN.
Dr. Andreas Wagner Deputy Group Leader - Operating Systems and Infrastructure Services CERN IT Department The IT Department & The LHC Computing Grid –
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
LHC Computing, CERN, & Federated Identities
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
INFSO-RI Enabling Grids for E-sciencE The EGEE Project Owen Appleton EGEE Dissemination Officer CERN, Switzerland Danish Grid Forum.
1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
IT-DSS Alberto Pace2 ? Detecting particles (experiments) Accelerating particle beams Large-scale computing (Analysis) Discovery We are here The mission.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.
Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.
Grid Colombia Workshop with OSG Week 2 Startup Rob Gardner University of Chicago October 26, 2009.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Accessing the VI-SEEM infrastructure
Review of the WLCG experiments compute plans
Grid site as a tool for data processing and data analysis
Computing models, facilities, distributed computing
The LHC Computing Grid Visit of Mtro. Enrique Agüera Ibañez
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
ExaO: Software Defined Data Distribution for Exascale Sciences
Presentation transcript:

Computing on the Grid and in the Clouds Rocío Rama Ballesteros CERN IT-SDC Support for Distributed Computing Group

Overview The computational problem The challenge Grid computing The WLCG Operational experience Future perspectives 2

The Computational Problem Where does it come from? 3

The Source of all Data Delivering collisions at 40MHz

A Collision

An Event Raw data: – Was a detector element hit? – ADC counts – Time signals Reconstructed data: – Momentum of tracks (4-vectors) – Origin – Energy in clusters (jets) – Particle type – Calibration information – …

Data Acquisition 1 GB/s

MB/s Data flow to permanent storage: 4-6 GB/sec 1-2 GB/s Data Flow up to 4 GB/s

Reconstruction and Archival First Reconstruction, Data Quality, Calibration

An event´s lifetime Anna Sfyrla Summer Student Lecture: From Raw Data to Physics MC takes as much computing as the rest!!!

The Computing Challenge Scale –Data –Computing –Complexity 11

Data Volume & Rates 30+PB per year + simulation Preservation – for ∞ Processing – 340k + cores Log scale Understood when we started

Big Data! Duplicate raw data Simulated data Many derived data products Recreate as software gets improved Replicate to allow physicists to access it Few PB of raw data becomes ~100 PB!  13

Large Distributed Community And all have Computers and Storage And all have Computers and Storage

LHC Users’Computer Centres

Overview The computational problem The challenge Grid computing The WLCG Operational experience Future perspectives 16

Go Distributed! Why? Technical and political/financial reasons – No single center could provide ALL the computing Buildings, Power, Cooling, Cost, … – The community is distributed Computing already available at all institutes – Funding for computing is also distributed How do you distribute all? – With big data – With hundreds of computing centers – With a global user community – And there is always new data!

The Grid “Coordinated resource sharing and problem –solving in dynamic, multi- institutional virtual organizations” Ian Foster and Karl Kesselman Share –Computing resources –Storage resources Many computers act together as a single one! 18

Main Ideas Multi-institutional organizations Site1 Site2 Site3 Different Services Different Policies Different AAA (Authentication Authorisation Accounting) Different AAA (Authentication Authorisation Accounting) Different Scale and Expertise

Virtual Organizations The Users from A and B create a Virtual Organization – Users have a unique identity but also the identity of the VO Organizations A and B support the Virtual Organization – Place “grid” interfaces at the organizational boundary – These map the generic “grid” functions/information/credentials To the local security functions/information/credentials Multi-institutional e-Science Infrastructures Organization B Organization A Virtual Organization 20

The Grid - Multi-institutional organizations Sites have to trust each other VOs have to trust sites Sites have to trust VOs For simplicity: Sites deal with VO permissions VOs’ deal with users Sites can override VO decisions Trust each other?  Security!! Site 1 Site2 Site3 VO

How to exchange secret keys? – 340 Sites (global) With hundreds of nodes each? – 200 User Communities (non local) – Users (global) And keep them secret!!! Public Key Based Security

Multi-institutional organizations Security How does all of this work? - Middleware! The Grid Site 1 Site2 Site3 VO

Middleware Software in the middle making possible the communication between users and services! – Sophisticated and divers back-end services – Potential simple, heterogeneous, front end services Deals with the diversity of services – Storage systems, batch systems… Integrated across multiple organizations – Lack of centralized control – Geographical distribution – Different policy environments International issues Services Users Middleware 24

Original Grid Services Data Management Services Job Management Services Security Services Information Services Certificate Management Service VO Membership Service Authentication Service Authorization Service Information System Messaging Service Site Availability Monitor Accounting Service Monitoring tools: experiment dashboards; site monitoring Storage Element File Catalogue Service File Transfer Service Grid file access tools GridFTP service Database and DB Replication Services POOL Object Persistency Service Compute Element Workload Management Service VO Agent Service Application Software Install Service Experiments invested considerable effort into integrating their software with grid services; and hiding complexity from users Pilot Factory Monitoring tools: experiment dashboards; site monitoring 25

Managing Jobs on the Grid Workload Management Batch System Computing Element Schedules Submits Job Submit Job Every VO/Experiment 26 Worker Node Batch System Computing Element Request Job Schedules Submits Pilot Experiment/VO Workload Management Send Job Worker Node Pilot Factory Task queue

The Brief History of WLCG MONARC project – Defined the initial hierarchical architecture Growing interest in Grid technology – HEP community main driver in launching the DataGrid project EU DataGrid project – Middleware & testbed for an operational grid LHC Computing Grid – Deploying the results of DataGrid for LHC experiments EU EGEE project phase 1 – A shared production infrastructure building upon the LCG EU EGEE project phase 2 – Focus on scale, stability Interoperations/Interoperability EU EGEE project phase 3 – Efficient operations with less central coordination x EGI and EMI – Sustainability CERN

WLCG Worldwide LHC Computing Grid An International collaboration to distribute, store and analyze LHC data Links computer centers worldwide that provide computing and storage resources into a single infrastructure accessible by all LHC physicists Biggest scientific Grid project in the world 28 WLCG EGI OSG NDGF

A Tiered Architecture 40% 15% 45% Tier-0 (CERN): (15%) Data recording Initial data reconstruction Data distribution Tier-1 (13 centres): (40%) Permanent storage Re-processing Analysis Connected 10 Gb fibres Tier-2 (~160 centres): (45%) Simulation End-user analysis

LHC Networking Relies upon – OPN, GEANT, ESNet – NRENs & other national & international providers

Computing Model Evolution Today: Bandwidths Gb/s, not limited to the hierarchy Flatter, mostly a mesh Sites contribute based on capability Greater flexibility and efficiency More fully utilize available resources Original model: Static strict hierarchy Multi-hop data flows Lesser demands on Tier 2 networking Virtue of simplicity Designed for <~2.5 Gb/s within the hierarchy 31

WLCG Infrastructure sites, ~8000 users nearly 40 countries 1.5 PB/week recorded 2-3 GB/s from CERN Global data movement: 15 GB/s CPU days/day Resource distribution CERN Tier 1s Tier 2s 2 M jobs / day200PB Storage

Operations Cooperation and collaboration between sites, sites and experiments! 34

Operations Not all is provided by WLCG directly WLCG links the services provided by the underlying infrastructures – And ensures that they are compatible EGI OSG NDGF EGI provides some central services: User support (GGUS) Accounting (APEL & portal)

Shared Infrastructures: EGI A few hundred VOs from several scientific domains – Astronomy & Astrophysics – Civil Protection – Computational Chemistry – Comp. Fluid Dynamics – Computer Science/Tools – Condensed Matter Physics – Earth Sciences – Fusion – High Energy Physics – Life Sciences – Further applications joining all the time – Recently fishery ( iMarine)

Production Grids WLCG relies on a production quality infrastructure – Requires standards of: Availability/reliability Performance Manageability – Used 365 days a year Vital that we build a fault-tolerant and reliable system – That can deal with individual sites being down and recover Monitoring and operational tools and procedures are as important as the middleware 37

Global Grid User Support GGUS: Web based portal – About 1000 tickets per months – Grid security aware – Interfaces to regional/national support structures

From Software To Services Services require – Fabric – Management – Networking – Security – Monitoring – User Support – Problem Tracking – Accounting – Service support – SLAs – … But now on a global scale – Respecting the autonomy of sites – Linking the different infrastructures NDGF, EGI, OSG Focus on Monitoring

Types of Monitoring Passive Monitoring – Measure the real computing activity Data transfers Job processing … Active Monitoring – Checking the sites by probing them: Availability/Reliability Performance … – Functional testing and Stress testing

Monitoring Framework Monitoring WLCG computing activities since 2006 A common monitoring framework developed at CERN Multiple Targets:  Passive  Active Different Perspectives: Users, Experts, Sites 41

Passive: Data Transfers 10GByte/s

Passive: Data and Job Flow

Active: HammerCloud

Active: SAM3 Tools

Quality over Time

The Future Everything is working But at a cost..... Goals and Challenges for the future: More, More, much More! Not More! (People and Cash) Use common technologies Lower operations costs  Clouds Private/Commercial Clouds Opportunistic resources Optimization of code and workflows need ~ factor improvement! 47

Scale of challenge Computing challenge – “double” this run – Then explode thereafter Experiment upgrades High luminosity Two solutions – More efficient usage Better algorithms Better data management – More resources Opportunistic Volunteer Move with technology Clouds Processor architectures 10 Year Horizon Historical growth of 25%/year Room for improvement MHS06 PB We are here 48 CMS ATLAS ALICE LHCb

Clouds offer flexibility – user workloads and system requirements are decoupled – dynamic allocation of resources – commercial and non-commercial providers Based on established, open technology and protocols – expertise is widely available – products and tools evolve rapidly – commercial and non-commercial users Proven scalability – small in-house systems to world wide distributed systems Motivation for Clouds

Clouds for LHC CERN and many WLCG sites are now using cloud technologies to provision their compute clusters – Many are deploying Openstack – global community Cloud provisioning – Better cluster management and flexibility – Can run existing grid services on top LHC experiments also manage HLT farms with Openstack – Allows them to switch between DAQ and processing 50

What do Clouds provide? SaaS PaaS IaaS VMs on demand

Grids – provides abstraction for Services Batch, Storage... high level, huge variety of services – provides management of communities Virtual Organisations (VO) – Provider Centric monitoring, accounting, security model, quotas.... Clouds – abstraction for Infrastructure (IaaS) – provides low level services CPU, object store,.... – provides no management of communities – high level services have to be provided by VOs Workflow, accounting, quotas, security – User Centric! users have to organise workflows, accounting, conceptualisation, monitoring, sharing..... Grid vs Clouds

High-level View Virtual Machine Interface Instantiates Request Resource Cloud Request Job Pilot running Send Job Submits Pilot Worker Node Batch System Computing Element Request Job Schedules Submits Pilot Experiment/VO Workload Management Send Job Worker Node Pilot Factory Task queue Pilot Factory Task queue Workload Management

Functional Areas Image Management Capacity Management Monitoring Accounting Pilot Job Framework Supporting Services Clouds are cool, but no magic bullet – lots of additional tasks move to user-land -Provides the job environment -Balance pre- and post-instantiation operations -Requires a specific component with some intelligence -Do I need to start a VM and if so where? Do I need to stop a VM and if so where? Are the VMs that I started OK? 54

Volunteer Computing 55

It would have been impossible to release physics results so quickly without the outstanding performance of the Grid (including the CERN Tier-0) Includes MC production, user and group analysis at CERN, 10 Tier1-s, ~ 70 Tier-2 federations  > 80 sites 100 k Number of concurrent ATLAS jobs Jan-July 2012 > 1500 distinct ATLAS users do analysis on the GRID  Available resources fully used/stressed (beyond pledges in some cases)  Massive production of 8 TeV Monte Carlo samples  Very effective and flexible Computing Model and Operation team  accommodate high trigger rates and pile-up, intense MC simulation, analysis demands from worldwide users (through e.g. dynamic data placement)

Conclusions Grid Computing and WLCG have proven themselves during the first run of data-taking of LHC Grid Computing works for our community and has a future Model changed from Tree to Mesh structure networks improved much faster than CPUs Shift from resource provider to user community new tasks, new responsibilities, new tool-chains Lots of challenges for our generation!