Computing on the Grid and in the Clouds Rocío Rama Ballesteros CERN IT-SDC Support for Distributed Computing Group
Overview The computational problem The challenge Grid computing The WLCG Operational experience Future perspectives 2
The Computational Problem Where does it come from? 3
The Source of all Data Delivering collisions at 40MHz
A Collision
An Event Raw data: – Was a detector element hit? – ADC counts – Time signals Reconstructed data: – Momentum of tracks (4-vectors) – Origin – Energy in clusters (jets) – Particle type – Calibration information – …
Data Acquisition 1 GB/s
MB/s Data flow to permanent storage: 4-6 GB/sec 1-2 GB/s Data Flow up to 4 GB/s
Reconstruction and Archival First Reconstruction, Data Quality, Calibration
An event´s lifetime Anna Sfyrla Summer Student Lecture: From Raw Data to Physics MC takes as much computing as the rest!!!
The Computing Challenge Scale –Data –Computing –Complexity 11
Data Volume & Rates 30+PB per year + simulation Preservation – for ∞ Processing – 340k + cores Log scale Understood when we started
Big Data! Duplicate raw data Simulated data Many derived data products Recreate as software gets improved Replicate to allow physicists to access it Few PB of raw data becomes ~100 PB! 13
Large Distributed Community And all have Computers and Storage And all have Computers and Storage
LHC Users’Computer Centres
Overview The computational problem The challenge Grid computing The WLCG Operational experience Future perspectives 16
Go Distributed! Why? Technical and political/financial reasons – No single center could provide ALL the computing Buildings, Power, Cooling, Cost, … – The community is distributed Computing already available at all institutes – Funding for computing is also distributed How do you distribute all? – With big data – With hundreds of computing centers – With a global user community – And there is always new data!
The Grid “Coordinated resource sharing and problem –solving in dynamic, multi- institutional virtual organizations” Ian Foster and Karl Kesselman Share –Computing resources –Storage resources Many computers act together as a single one! 18
Main Ideas Multi-institutional organizations Site1 Site2 Site3 Different Services Different Policies Different AAA (Authentication Authorisation Accounting) Different AAA (Authentication Authorisation Accounting) Different Scale and Expertise
Virtual Organizations The Users from A and B create a Virtual Organization – Users have a unique identity but also the identity of the VO Organizations A and B support the Virtual Organization – Place “grid” interfaces at the organizational boundary – These map the generic “grid” functions/information/credentials To the local security functions/information/credentials Multi-institutional e-Science Infrastructures Organization B Organization A Virtual Organization 20
The Grid - Multi-institutional organizations Sites have to trust each other VOs have to trust sites Sites have to trust VOs For simplicity: Sites deal with VO permissions VOs’ deal with users Sites can override VO decisions Trust each other? Security!! Site 1 Site2 Site3 VO
How to exchange secret keys? – 340 Sites (global) With hundreds of nodes each? – 200 User Communities (non local) – Users (global) And keep them secret!!! Public Key Based Security
Multi-institutional organizations Security How does all of this work? - Middleware! The Grid Site 1 Site2 Site3 VO
Middleware Software in the middle making possible the communication between users and services! – Sophisticated and divers back-end services – Potential simple, heterogeneous, front end services Deals with the diversity of services – Storage systems, batch systems… Integrated across multiple organizations – Lack of centralized control – Geographical distribution – Different policy environments International issues Services Users Middleware 24
Original Grid Services Data Management Services Job Management Services Security Services Information Services Certificate Management Service VO Membership Service Authentication Service Authorization Service Information System Messaging Service Site Availability Monitor Accounting Service Monitoring tools: experiment dashboards; site monitoring Storage Element File Catalogue Service File Transfer Service Grid file access tools GridFTP service Database and DB Replication Services POOL Object Persistency Service Compute Element Workload Management Service VO Agent Service Application Software Install Service Experiments invested considerable effort into integrating their software with grid services; and hiding complexity from users Pilot Factory Monitoring tools: experiment dashboards; site monitoring 25
Managing Jobs on the Grid Workload Management Batch System Computing Element Schedules Submits Job Submit Job Every VO/Experiment 26 Worker Node Batch System Computing Element Request Job Schedules Submits Pilot Experiment/VO Workload Management Send Job Worker Node Pilot Factory Task queue
The Brief History of WLCG MONARC project – Defined the initial hierarchical architecture Growing interest in Grid technology – HEP community main driver in launching the DataGrid project EU DataGrid project – Middleware & testbed for an operational grid LHC Computing Grid – Deploying the results of DataGrid for LHC experiments EU EGEE project phase 1 – A shared production infrastructure building upon the LCG EU EGEE project phase 2 – Focus on scale, stability Interoperations/Interoperability EU EGEE project phase 3 – Efficient operations with less central coordination x EGI and EMI – Sustainability CERN
WLCG Worldwide LHC Computing Grid An International collaboration to distribute, store and analyze LHC data Links computer centers worldwide that provide computing and storage resources into a single infrastructure accessible by all LHC physicists Biggest scientific Grid project in the world 28 WLCG EGI OSG NDGF
A Tiered Architecture 40% 15% 45% Tier-0 (CERN): (15%) Data recording Initial data reconstruction Data distribution Tier-1 (13 centres): (40%) Permanent storage Re-processing Analysis Connected 10 Gb fibres Tier-2 (~160 centres): (45%) Simulation End-user analysis
LHC Networking Relies upon – OPN, GEANT, ESNet – NRENs & other national & international providers
Computing Model Evolution Today: Bandwidths Gb/s, not limited to the hierarchy Flatter, mostly a mesh Sites contribute based on capability Greater flexibility and efficiency More fully utilize available resources Original model: Static strict hierarchy Multi-hop data flows Lesser demands on Tier 2 networking Virtue of simplicity Designed for <~2.5 Gb/s within the hierarchy 31
WLCG Infrastructure sites, ~8000 users nearly 40 countries 1.5 PB/week recorded 2-3 GB/s from CERN Global data movement: 15 GB/s CPU days/day Resource distribution CERN Tier 1s Tier 2s 2 M jobs / day200PB Storage
Operations Cooperation and collaboration between sites, sites and experiments! 34
Operations Not all is provided by WLCG directly WLCG links the services provided by the underlying infrastructures – And ensures that they are compatible EGI OSG NDGF EGI provides some central services: User support (GGUS) Accounting (APEL & portal)
Shared Infrastructures: EGI A few hundred VOs from several scientific domains – Astronomy & Astrophysics – Civil Protection – Computational Chemistry – Comp. Fluid Dynamics – Computer Science/Tools – Condensed Matter Physics – Earth Sciences – Fusion – High Energy Physics – Life Sciences – Further applications joining all the time – Recently fishery ( iMarine)
Production Grids WLCG relies on a production quality infrastructure – Requires standards of: Availability/reliability Performance Manageability – Used 365 days a year Vital that we build a fault-tolerant and reliable system – That can deal with individual sites being down and recover Monitoring and operational tools and procedures are as important as the middleware 37
Global Grid User Support GGUS: Web based portal – About 1000 tickets per months – Grid security aware – Interfaces to regional/national support structures
From Software To Services Services require – Fabric – Management – Networking – Security – Monitoring – User Support – Problem Tracking – Accounting – Service support – SLAs – … But now on a global scale – Respecting the autonomy of sites – Linking the different infrastructures NDGF, EGI, OSG Focus on Monitoring
Types of Monitoring Passive Monitoring – Measure the real computing activity Data transfers Job processing … Active Monitoring – Checking the sites by probing them: Availability/Reliability Performance … – Functional testing and Stress testing
Monitoring Framework Monitoring WLCG computing activities since 2006 A common monitoring framework developed at CERN Multiple Targets: Passive Active Different Perspectives: Users, Experts, Sites 41
Passive: Data Transfers 10GByte/s
Passive: Data and Job Flow
Active: HammerCloud
Active: SAM3 Tools
Quality over Time
The Future Everything is working But at a cost..... Goals and Challenges for the future: More, More, much More! Not More! (People and Cash) Use common technologies Lower operations costs Clouds Private/Commercial Clouds Opportunistic resources Optimization of code and workflows need ~ factor improvement! 47
Scale of challenge Computing challenge – “double” this run – Then explode thereafter Experiment upgrades High luminosity Two solutions – More efficient usage Better algorithms Better data management – More resources Opportunistic Volunteer Move with technology Clouds Processor architectures 10 Year Horizon Historical growth of 25%/year Room for improvement MHS06 PB We are here 48 CMS ATLAS ALICE LHCb
Clouds offer flexibility – user workloads and system requirements are decoupled – dynamic allocation of resources – commercial and non-commercial providers Based on established, open technology and protocols – expertise is widely available – products and tools evolve rapidly – commercial and non-commercial users Proven scalability – small in-house systems to world wide distributed systems Motivation for Clouds
Clouds for LHC CERN and many WLCG sites are now using cloud technologies to provision their compute clusters – Many are deploying Openstack – global community Cloud provisioning – Better cluster management and flexibility – Can run existing grid services on top LHC experiments also manage HLT farms with Openstack – Allows them to switch between DAQ and processing 50
What do Clouds provide? SaaS PaaS IaaS VMs on demand
Grids – provides abstraction for Services Batch, Storage... high level, huge variety of services – provides management of communities Virtual Organisations (VO) – Provider Centric monitoring, accounting, security model, quotas.... Clouds – abstraction for Infrastructure (IaaS) – provides low level services CPU, object store,.... – provides no management of communities – high level services have to be provided by VOs Workflow, accounting, quotas, security – User Centric! users have to organise workflows, accounting, conceptualisation, monitoring, sharing..... Grid vs Clouds
High-level View Virtual Machine Interface Instantiates Request Resource Cloud Request Job Pilot running Send Job Submits Pilot Worker Node Batch System Computing Element Request Job Schedules Submits Pilot Experiment/VO Workload Management Send Job Worker Node Pilot Factory Task queue Pilot Factory Task queue Workload Management
Functional Areas Image Management Capacity Management Monitoring Accounting Pilot Job Framework Supporting Services Clouds are cool, but no magic bullet – lots of additional tasks move to user-land -Provides the job environment -Balance pre- and post-instantiation operations -Requires a specific component with some intelligence -Do I need to start a VM and if so where? Do I need to stop a VM and if so where? Are the VMs that I started OK? 54
Volunteer Computing 55
It would have been impossible to release physics results so quickly without the outstanding performance of the Grid (including the CERN Tier-0) Includes MC production, user and group analysis at CERN, 10 Tier1-s, ~ 70 Tier-2 federations > 80 sites 100 k Number of concurrent ATLAS jobs Jan-July 2012 > 1500 distinct ATLAS users do analysis on the GRID Available resources fully used/stressed (beyond pledges in some cases) Massive production of 8 TeV Monte Carlo samples Very effective and flexible Computing Model and Operation team accommodate high trigger rates and pile-up, intense MC simulation, analysis demands from worldwide users (through e.g. dynamic data placement)
Conclusions Grid Computing and WLCG have proven themselves during the first run of data-taking of LHC Grid Computing works for our community and has a future Model changed from Tree to Mesh structure networks improved much faster than CPUs Shift from resource provider to user community new tasks, new responsibilities, new tool-chains Lots of challenges for our generation!