ATLAS Grid Data Processing: system evolution and scalability D Golubkov, B Kersevan, A Klimentov, A Minaenko, P Nevski, A Vaniachine and R Walker for the.

Slides:



Advertisements
Similar presentations
Chapter 19: Network Management Business Data Communications, 5e.
Advertisements

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Distributed Systems Architectures Slide 1 1 Chapter 9 Distributed Systems Architectures.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Chapter 19: Network Management Business Data Communications, 4e.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Distributed components
Distributed Systems Architectures
City University London
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Integrated Scientific Workflow Management for the Emulab Network Testbed Eric Eide, Leigh Stoller, Tim Stack, Juliana Freire, and Jay Lepreau and Jay Lepreau.
Protocols and the TCP/IP Suite
DISTRIBUTED COMPUTING
Client/Server Architectures
Protocols and the TCP/IP Suite Chapter 4. Multilayer communication. A series of layers, each built upon the one below it. The purpose of each layer is.
Chapter 1: Hierarchical Network Design
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
Framework for Automated Builds Natalia Ratnikova CHEP’03.
Protocols and the TCP/IP Suite
DISTRIBUTED COMPUTING
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
® IBM Software Group © 2007 IBM Corporation J2EE Web Component Introduction
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility –A well.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Software Quality Assurance
OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Chapter 10 Analysis and Design Discipline. 2 Purpose The purpose is to translate the requirements into a specification that describes how to implement.
Notes of Rational Related cyt. 2 Outline 3 Capturing business requirements using use cases Practical principles  Find the right boundaries for your.
D0RACE: Testbed Session Lee Lueking D0 Remote Analysis Workshop February 12, 2002.
Middleware for Grid Computing and the relationship to Middleware at large ECE 1770 : Middleware Systems By: Sepehr (Sep) Seyedi Date: Thurs. January 23,
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
CSC 480 Software Engineering Lecture 18 Nov 6, 2002.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.
LQCD Workflow Project L. Piccoli October 02, 2006.
Aneka Cloud ApplicationPlatform. Introduction Aneka consists of a scalable cloud middleware that can be deployed on top of heterogeneous computing resources.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
CSC 480 Software Engineering Lecture 17 Nov 4, 2002.
Predrag Buncic ALICE Status Report LHCC Referee Meeting CERN
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
Fault – Tolerant Distributed Multimedia Streaming Web Application By Nirvan Sagar – Srishti Ganjoo – Syed Shahbaaz Safir
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Chapter 19: Network Management
EGEE Middleware Activities Overview
CSC 480 Software Engineering
Exploring Azure Event Grid
Protocols and the TCP/IP Suite
Optena: Enterprise Condor
Protocols and the TCP/IP Suite
Distributed Systems and Concurrency: Distributed Systems
Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility A well.
Presentation transcript:

ATLAS Grid Data Processing: system evolution and scalability D Golubkov, B Kersevan, A Klimentov, A Minaenko, P Nevski, A Vaniachine and R Walker for the ATLAS Collaboration Abstract: The production system for Grid Data Processing handles petascale ATLAS data reprocessing and Monte Carlo activities. The production system empowered further data processing steps on the Grid performed by dozens of ATLAS physics groups with coordinated access to computing resources worldwide, including additional resources sponsored by regional facilities. The system provides knowledge management of configuration parameters for massive data processing tasks, reproducibility of results, scalable database access, orchestrated workflow and performance monitoring, dynamic workload sharing, automated fault tolerance and petascale data integrity control. Conclusions: The GDP production system fully satisfies the requirements of ATLAS data reprocessing, simulations, and production by physics groups. The next LHC shutdown provides an opportunity for refactoring the production system, whilst retaining those core capabilities valued most by the users. We collected the requirements and identified core design principles for the next generation production system. Prototyping is in progress, to scale up the production system for a growing number of tasks processing data for physics analysis and other ATLAS main activities. The design of the next generation production system architecture is driven by the following principles: Flexibility: As the requirements update process works well, the list of requirements and use cases continue to grow. Accommodating that growth, the next generation production system architecture must be flexible by design. Isolation: Avoiding inherent fragility of the monolithic systems, we adopted another core design principle - isolation. Separating core concerns, the production system logic layer will be separated from the presentation layer, so that users will have a familiar but improved interface for task requests. Redundancy: A core concern is scalability, required to assure eventual consistency of the distributed petascale data processing. Since in a scalable system, the top layers should not trust the layers below, we retain the redundancy. Speed up & Force- finish Forking & Extensions Scout Probing Transient & Intermediate Recovery DEfT: Database Engine for Tasks Requirements: The production system evolves to accommodate a growing number of users and new requirements from our contacts in ATLAS main areas: Trigger, Physics, Data Preparation and Software & Computing. Jobs: Splitting of a large data processing task into jobs (small data processing tasks) is similar to the splitting of a large file into smaller TCP/IP packets during the FTP data transfer. Users do not care how many TCP/IP packets were dropped during the file transfer. Similarly, users do not care about transient job failures, when petascale data processing delivers six sigma quality performance in production, corresponding to event losses below the level. Performance: Automatic job resubmission avoids event losses at the expense of CPU time used by the failed jobs. Distribution of tasks ordered by CPU time used to recover transient failures is not uniform: most of CPU time required for recovery was used in a small fraction of tasks. In 2010 reprocessing, the CPU time used to recover transient failures was 6% of the CPU time used for reconstruction. In 2011 reprocessing, the CPU time used to recover transient failures was reduced to 4% of the CPU time used for the reconstruction. Task Order CPU-hours used to recover transient failures Production system architecture grew organically. Legacy requirements constrained the design and implementation. Task Evolution Tag Definition Task Request The next LHC shutdown provides an opportunity to rethink the architecture, making implementations more maintainable, consistent and scalable. Behind a familiar Task Request interface the DEfT architecture will emerge. Tasks: In ATLAS production system the task (a collection of jobs) is used as a main unit of computation. The ATLAS job workload management system PanDA provides transparency of data and processing. Demonstrating system scalability, the rate of production tasks requests grows exponentially. ©2011 IEEE