Computing on the Grid and in the Clouds Rocío Rama Ballesteros CERN IT-SDC Support for Distributed Computing Group.

Computing on the Grid and in the Clouds Rocío Rama Ballesteros CERN IT-SDC Support for Distributed Computing Group

Overview The computational problem The challenge Grid computing The WLCG Operational experience Future perspectives rocio.rama.ballesteros@cern.ch 2

The Computational Problem Where does it come from? rocio.rama.ballesteros@cern.ch 3

The Source of all Data Delivering collisions at 40MHz rocio.rama.ballesteros@cern.ch4

A Collision rocio.rama.ballesteros@cern.ch5

An Event Raw data: – Was a detector element hit? – ADC counts – Time signals Reconstructed data: – Momentum of tracks (4-vectors) – Origin – Energy in clusters (jets) – Particle type – Calibration information – … rocio.rama.ballesteros@cern.ch6

Data Acquisition 1 GB/s rocio.rama.ballesteros@cern.ch7

200-400 MB/s Data flow to permanent storage: 4-6 GB/sec 1-2 GB/s Data Flow up to 4 GB/s rocio.rama.ballesteros@cern.ch8

Reconstruction and Archival rocio.rama.ballesteros@cern.ch9 First Reconstruction, Data Quality, Calibration

An event´s lifetime rocio.rama.ballesteros@cern.ch10 Anna Sfyrla Summer Student Lecture: From Raw Data to Physics MC takes as much computing as the rest!!!

The Computing Challenge Scale –Data –Computing –Complexity rocio.rama.ballesteros@cern.ch 11

Data Volume & Rates 30+PB per year + simulation Preservation – for ∞ Processing – 340k + cores Log scale rocio.rama.ballesteros@cern.ch12 Understood when we started

Big Data! rocio.rama.ballesteros@cern.ch Duplicate raw data Simulated data Many derived data products Recreate as software gets improved Replicate to allow physicists to access it Few PB of raw data becomes ~100 PB!  13

Large Distributed Community rocio.rama.ballesteros@cern.ch14 And all have Computers and Storage And all have Computers and Storage

LHC Users’Computer Centres rocio.rama.ballesteros@cern.ch15

Overview The computational problem The challenge Grid computing The WLCG Operational experience Future perspectives rocio.rama.ballesteros@cern.ch 16

Go Distributed! Why? Technical and political/financial reasons – No single center could provide ALL the computing Buildings, Power, Cooling, Cost, … – The community is distributed Computing already available at all institutes – Funding for computing is also distributed How do you distribute all? – With big data – With hundreds of computing centers – With a global user community – And there is always new data! rocio.rama.ballesteros@cern.ch17

The Grid “Coordinated resource sharing and problem –solving in dynamic, multi- institutional virtual organizations” Ian Foster and Karl Kesselman Share –Computing resources –Storage resources Many computers act together as a single one! rocio.rama.ballesteros@cern.ch 18

Main Ideas Multi-institutional organizations rocio.rama.ballesteros@cern.ch19 Site1 Site2 Site3 Different Services Different Policies Different AAA (Authentication Authorisation Accounting) Different AAA (Authentication Authorisation Accounting) Different Scale and Expertise

Virtual Organizations rocio.rama.ballesteros@cern.ch The Users from A and B create a Virtual Organization – Users have a unique identity but also the identity of the VO Organizations A and B support the Virtual Organization – Place “grid” interfaces at the organizational boundary – These map the generic “grid” functions/information/credentials To the local security functions/information/credentials Multi-institutional e-Science Infrastructures Organization B Organization A Virtual Organization 20

The Grid - Multi-institutional organizations Sites have to trust each other VOs have to trust sites Sites have to trust VOs For simplicity: Sites deal with VO permissions VOs’ deal with users Sites can override VO decisions Trust each other?  Security!! rocio.rama.ballesteros@cern.ch21 Site 1 Site2 Site3 VO

How to exchange secret keys? – 340 Sites (global) With hundreds of nodes each? – 200 User Communities (non local) – 10000 Users (global) And keep them secret!!! Public Key Based Security rocio.rama.ballesteros@cern.ch22

Multi-institutional organizations Security How does all of this work? - Middleware! The Grid rocio.rama.ballesteros@cern.ch23 Site 1 Site2 Site3 VO

Middleware rocio.rama.ballesteros@cern.ch Software in the middle making possible the communication between users and services! – Sophisticated and divers back-end services – Potential simple, heterogeneous, front end services Deals with the diversity of services – Storage systems, batch systems… Integrated across multiple organizations – Lack of centralized control – Geographical distribution – Different policy environments International issues Services Users Middleware 24

Original Grid Services Data Management Services Job Management Services Security Services Information Services Certificate Management Service VO Membership Service Authentication Service Authorization Service Information System Messaging Service Site Availability Monitor Accounting Service Monitoring tools: experiment dashboards; site monitoring Storage Element File Catalogue Service File Transfer Service Grid file access tools GridFTP service Database and DB Replication Services POOL Object Persistency Service Compute Element Workload Management Service VO Agent Service Application Software Install Service Experiments invested considerable effort into integrating their software with grid services; and hiding complexity from users rocio.rama.ballesteros@cern.ch Pilot Factory Monitoring tools: experiment dashboards; site monitoring 25

Managing Jobs on the Grid rocio.rama.ballesteros@cern.ch Workload Management Batch System Computing Element Schedules Submits Job Submit Job Every VO/Experiment 26 Worker Node Batch System Computing Element Request Job Schedules Submits Pilot Experiment/VO Workload Management Send Job Worker Node Pilot Factory Task queue

The Brief History of WLCG 1999 - MONARC project – Defined the initial hierarchical architecture 2000 - Growing interest in Grid technology – HEP community main driver in launching the DataGrid project 2001-2004 - EU DataGrid project – Middleware & testbed for an operational grid 2002-2005 - LHC Computing Grid – Deploying the results of DataGrid for LHC experiments 2004-2006 - EU EGEE project phase 1 – A shared production infrastructure building upon the LCG 2006-2008 - EU EGEE project phase 2 – Focus on scale, stability Interoperations/Interoperability 2008-2010 - EU EGEE project phase 3 – Efficient operations with less central coordination 2010 - 201x EGI and EMI – Sustainability CERN rocio.rama.ballesteros@cern.ch27

WLCG Worldwide LHC Computing Grid An International collaboration to distribute, store and analyze LHC data Links computer centers worldwide that provide computing and storage resources into a single infrastructure accessible by all LHC physicists Biggest scientific Grid project in the world rocio.rama.ballesteros@cern.ch 28 WLCG EGI OSG NDGF

A Tiered Architecture 40% 15% 45% Tier-0 (CERN): (15%) Data recording Initial data reconstruction Data distribution Tier-1 (13 centres): (40%) Permanent storage Re-processing Analysis Connected 10 Gb fibres Tier-2 (~160 centres): (45%) Simulation End-user analysis rocio.rama.ballesteros@cern.ch29

LHC Networking Relies upon – OPN, GEANT, ESNet – NRENs & other national & international providers rocio.rama.ballesteros@cern.ch30

Computing Model Evolution rocio.rama.ballesteros@cern.ch Today: Bandwidths 10-100 Gb/s, not limited to the hierarchy Flatter, mostly a mesh Sites contribute based on capability Greater flexibility and efficiency More fully utilize available resources Original model: Static strict hierarchy Multi-hop data flows Lesser demands on Tier 2 networking Virtue of simplicity Designed for <~2.5 Gb/s within the hierarchy 31

WLCG Infrastructure 32 170 sites, ~8000 users nearly 40 countries 1.5 PB/week recorded 2-3 GB/s from CERN Global data movement: 15 GB/s 250 000 CPU days/day Resource distribution CERN Tier 1s Tier 2s 2 M jobs / day200PB Storage rocio.rama.ballesteros@cern.ch32

rocio.rama.ballesteros@cern.ch33

Operations Cooperation and collaboration between sites, sites and experiments! rocio.rama.ballesteros@cern.ch 34

Operations Not all is provided by WLCG directly WLCG links the services provided by the underlying infrastructures – And ensures that they are compatible EGI OSG NDGF EGI provides some central services: User support (GGUS) Accounting (APEL & portal) rocio.rama.ballesteros@cern.ch35

Shared Infrastructures: EGI A few hundred VOs from several scientific domains – Astronomy & Astrophysics – Civil Protection – Computational Chemistry – Comp. Fluid Dynamics – Computer Science/Tools – Condensed Matter Physics – Earth Sciences – Fusion – High Energy Physics – Life Sciences –......... Further applications joining all the time – Recently fishery ( iMarine) rocio.rama.ballesteros@cern.ch36

Production Grids rocio.rama.ballesteros@cern.ch WLCG relies on a production quality infrastructure – Requires standards of: Availability/reliability Performance Manageability – Used 365 days a year Vital that we build a fault-tolerant and reliable system – That can deal with individual sites being down and recover Monitoring and operational tools and procedures are as important as the middleware 37

Global Grid User Support GGUS: Web based portal – About 1000 tickets per months – Grid security aware – Interfaces to regional/national support structures rocio.rama.ballesteros@cern.ch38

From Software To Services Services require – Fabric – Management – Networking – Security – Monitoring – User Support – Problem Tracking – Accounting – Service support – SLAs – … But now on a global scale – Respecting the autonomy of sites – Linking the different infrastructures NDGF, EGI, OSG rocio.rama.ballesteros@cern.ch39 Focus on Monitoring

Types of Monitoring Passive Monitoring – Measure the real computing activity Data transfers Job processing … Active Monitoring – Checking the sites by probing them: Availability/Reliability Performance … – Functional testing and Stress testing rocio.rama.ballesteros@cern.ch40

Monitoring Framework rocio.rama.ballesteros@cern.ch Monitoring WLCG computing activities since 2006 A common monitoring framework developed at CERN Multiple Targets:  Passive  Active Different Perspectives: Users, Experts, Sites 41

Passive: Data Transfers rocio.rama.ballesteros@cern.ch42 10GByte/s

Passive: Data and Job Flow rocio.rama.ballesteros@cern.ch43

Active: HammerCloud rocio.rama.ballesteros@cern.ch44

Active: SAM3 Tools rocio.rama.ballesteros@cern.ch45

Quality over Time rocio.rama.ballesteros@cern.ch46

The Future Everything is working But at a cost..... Goals and Challenges for the future: More, More, much More! Not More! (People and Cash) Use common technologies Lower operations costs  Clouds Private/Commercial Clouds Opportunistic resources Optimization of code and workflows need ~ factor 10-20 improvement! rocio.rama.ballesteros@cern.ch 47

2010 2015 2018 2023 Scale of challenge Computing challenge – “double” this run – Then explode thereafter Experiment upgrades High luminosity Two solutions – More efficient usage Better algorithms Better data management – More resources Opportunistic Volunteer Move with technology Clouds Processor architectures 10 Year Horizon rocio.rama.ballesteros@cern.ch Historical growth of 25%/year Room for improvement MHS06 PB We are here 48 CMS ATLAS ALICE LHCb

Clouds offer flexibility – user workloads and system requirements are decoupled – dynamic allocation of resources – commercial and non-commercial providers Based on established, open technology and protocols – expertise is widely available – products and tools evolve rapidly – commercial and non-commercial users Proven scalability – small in-house systems to world wide distributed systems Motivation for Clouds rocio.rama.ballesteros@cern.ch49

Clouds for LHC rocio.rama.ballesteros@cern.ch CERN and many WLCG sites are now using cloud technologies to provision their compute clusters – Many are deploying Openstack – global community Cloud provisioning – Better cluster management and flexibility – Can run existing grid services on top LHC experiments also manage HLT farms with Openstack – Allows them to switch between DAQ and processing 50

What do Clouds provide? SaaS PaaS IaaS VMs on demand rocio.rama.ballesteros@cern.ch51

Grids – provides abstraction for Services Batch, Storage... high level, huge variety of services – provides management of communities Virtual Organisations (VO) – Provider Centric monitoring, accounting, security model, quotas.... Clouds – abstraction for Infrastructure (IaaS) – provides low level services CPU, object store,.... – provides no management of communities – high level services have to be provided by VOs Workflow, accounting, quotas, security – User Centric! users have to organise workflows, accounting, conceptualisation, monitoring, sharing..... Grid vs Clouds rocio.rama.ballesteros@cern.ch52

High-level View Virtual Machine Interface Instantiates Request Resource Cloud rocio.rama.ballesteros@cern.ch53 Request Job Pilot running Send Job Submits Pilot Worker Node Batch System Computing Element Request Job Schedules Submits Pilot Experiment/VO Workload Management Send Job Worker Node Pilot Factory Task queue Pilot Factory Task queue Workload Management

Functional Areas Image Management Capacity Management Monitoring Accounting Pilot Job Framework Supporting Services Clouds are cool, but no magic bullet – lots of additional tasks move to user-land rocio.rama.ballesteros@cern.ch -Provides the job environment -Balance pre- and post-instantiation operations -Requires a specific component with some intelligence -Do I need to start a VM and if so where? Do I need to stop a VM and if so where? Are the VMs that I started OK? 54

Volunteer Computing rocio.rama.ballesteros@cern.ch http://lhcathome.web.cern.ch 55

It would have been impossible to release physics results so quickly without the outstanding performance of the Grid (including the CERN Tier-0) Includes MC production, user and group analysis at CERN, 10 Tier1-s, ~ 70 Tier-2 federations  > 80 sites 100 k Number of concurrent ATLAS jobs Jan-July 2012 > 1500 distinct ATLAS users do analysis on the GRID  Available resources fully used/stressed (beyond pledges in some cases)  Massive production of 8 TeV Monte Carlo samples  Very effective and flexible Computing Model and Operation team  accommodate high trigger rates and pile-up, intense MC simulation, analysis demands from worldwide users (through e.g. dynamic data placement) rocio.rama.ballesteros@cern.ch56

Conclusions Grid Computing and WLCG have proven themselves during the first run of data-taking of LHC Grid Computing works for our community and has a future Model changed from Tree to Mesh structure networks improved much faster than CPUs Shift from resource provider to user community new tasks, new responsibilities, new tool-chains Lots of challenges for our generation! rocio.rama.ballesteros@cern.ch57

Computing on the Grid and in the Clouds Rocío Rama Ballesteros CERN IT-SDC Support for Distributed Computing Group.

Similar presentations

Presentation on theme: "Computing on the Grid and in the Clouds Rocío Rama Ballesteros CERN IT-SDC Support for Distributed Computing Group."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computing on the Grid and in the Clouds Rocío Rama Ballesteros CERN IT-SDC Support for Distributed Computing Group.

Similar presentations

Presentation on theme: "Computing on the Grid and in the Clouds Rocío Rama Ballesteros CERN IT-SDC Support for Distributed Computing Group."— Presentation transcript:

Similar presentations

About project

Feedback