Job Processing Database consolidation Task recovery De-cronification

Slides:



Advertisements
Similar presentations
Author - Title- Date - n° 1 GDMP The European DataGrid Project Team
Advertisements

MC, REPROCESSING, TRAINS EXPERIENCE FROM DATA PROCESSING.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
ATLAS : File and Dataset Metadata Collection and Use S Albrand 1, J Fulachier 1, E J Gallas 2, F Lambert 1 1. Introduction The ATLAS dataset search catalogs.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
Integrating HPC into the ATLAS Distributed Computing environment Doug Benjamin Duke University.
Software Sustainability Institute Online reconstruction (Manchego) Status report 09/02/12 Mike Jackson
Learningcomputer.com SQL Server 2008 – Profiling and Monitoring Tools.
Event Data History David Adams BNL Atlas Software Week December 2001.
ATLAS Grid Data Processing: system evolution and scalability D Golubkov, B Kersevan, A Klimentov, A Minaenko, P Nevski, A Vaniachine and R Walker for the.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
The ATLAS Cloud Model Simone Campana. LCG sites and ATLAS sites LCG counts almost 200 sites. –Almost all of them support the ATLAS VO. –The ATLAS production.
INFSO-RI Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, ,
Transformation System report Luisa Arrabito 1, Federico Stagni 2 1) LUPM CNRS/IN2P3, France 2) CERN 5 th DIRAC User Workshop 27 th – 29 th May 2015, Ferrara.
30 Copyright © 2009, Oracle. All rights reserved. Using Oracle Business Intelligence Delivers.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
LHCbDirac and Core Software. LHCbDirac and Core SW Core Software workshop, PhC2 Running Gaudi Applications on the Grid m Application deployment o CVMFS.
Network integration with PanDA Artem Petrosyan PanDA UTA,
Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.
Dynamic Data Placement: the ATLAS model Simone Campana (IT-SDC)
Future of Distributed Production in US Facilities Kaushik De Univ. of Texas at Arlington US ATLAS Distributed Facility Workshop, Santa Cruz November 13,
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
David Adams ATLAS AJDL: Abstract Job Description Language David Adams BNL June 29, 2004 PPDG Collaboration Meeting Williams Bay.
Oct 16, 2009T.Kurca Grilles France1 CMS Data Distribution Tibor Kurča Institut de Physique Nucléaire de Lyon Journées “Grilles France” October 16, 2009.
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
PD2P Planning Kaushik De Univ. of Texas at Arlington S&C Week, CERN Dec 2, 2010.
PD2P, Caching etc. Kaushik De Univ. of Texas at Arlington ADC Retreat, Naples Feb 4, 2011.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
PanDA & Networking Kaushik De Univ. of Texas at Arlington UM July 31, 2013.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
The ALICE Analysis -- News from the battlefield Federico Carminati for the ALICE Computing Project CHEP 2010 – Taiwan.
British Library Document Supply Service (BLDSS) API
Global Search: An Introduction and Administrator Perspective
REPLICATION & LOAD BALANCING
Charles Maguire et (beaucoup d’) al.
Real Time Fake Analysis at PIC
Memory Miss Elliott.
OpenPBS – Distributed Workload Management System
CCS Engineering Tools The tools are used help development and debugging of VLT SW control applications This presentation will provide a general view of.
Overview of the Belle II computing
Lecture 16: Data Storage Wednesday, November 6, 2006.
David Adams Brookhaven National Laboratory September 28, 2006
IPOM and E-Booking.
DCC Workshop Input from Computing Coordination
Readiness of ATLAS Computing - A personal view
The ADC Operations Story
Import Cron and Windows Task Scheduler definitions
EECS 498 Introduction to Distributed Systems Fall 2017
Marking a Piece of Equipment for Transfer/Scrap/Withdrawal
湖南大学-信息科学与工程学院-计算机与科学系
Protocol Basics.
Operating Systems.
Michael P. McCumber Task Force Meeting April 3, 2006
Interrupt handling Explain how interrupts are used to obtain processor time and how processing of interrupted jobs may later be resumed, (typical.
Performing Database Recovery
Why Threads Are A Bad Idea (for most purposes)
CrawlBuddy The web’s best friend.
Roadmap for Data Management and Caching
ATLAS DC2 & Continuous production
Daily Batch Process for Demand Forecasting
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes)
Lecture 20: Representing Data Elements
Student Information System Pre-Progression Processing
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Job Processing Database consolidation Task recovery De-cronification T1 overflow primary reco Panda

Database consolidation Task_req ~ etask Currently sync back and forth, e.g. status T0 has no task req – data driven prod is 'whim-driven' Status in requested,pending,submitting,running,.. EJobExe ~ Panda (activated/archived jobs) biggest table – mostly duplicate info EjobDef? Dynamic job definition to process every file some for MP(500evt), some for single core(50)?

Task Recovery Examples repro of some events fails wait for new cache AOD corrupted during merge lost files turn of HIST to save RAM Create recovery task to (re)produce missing files Attach to original container What about the TAG? new merged AOD – update TAG guid? no. Verify AOD before upload(in merge TRF) and replicate.

De-cronification AKTR and jobdef has many crons process bulk tasks, tasks(pending,submitting), more jobs, add attempts, abort jobs, stats create output datasets, freeze them, add to containers subscribe datasets, delete them Representation of desired state and engines to get there Messaging triggers between different actors eg. if output datasets exist, then create jobs poll for dataset, or trigger – do both Delay between task def, and jobs created is artificial partly due to crons, and partly deliberate to allow user back-out 'Submit Task' – are you sure(yes/no) – can always abort (c.f qsub)

Example: Task Request Task(chain) request web form(click submit) Sanity check – inputs,release,... accept task Create output datasets and containers Create jobs for bamboo 1st task Monitor loop more jobs, increase maxattempt, abort jobs, notify user, jobs for next task requestor can have active input abort, tobefinished, more RAM clickable actions for user analysis too? Task Done? Freeze datasets, subscribe to whereever

T1 overflow primary reco Expect p-p reco to keep-up at T0, for this year IF backlog approaches 10 day disk buffer may need to use T1s should prepare and test? T0 reco and T1 repro have differences data driven prodsys task definition Test run was HI cut short, but monitoring was criticized

Panda Break cloud boundaries small clouds can trap big hi-prio task 50k cores process high priority task 10 smaller tasks better? transfer to destination T1 is bottleneck Merge HITS at T2 before transfer(or MP) Prodsys or DDM? more general if DDM does it Compulsory generic merging post-prod step for users

Random stuff Feedback or actual walltime/RAM usage from scout jobs Task chain monitoring