Overview The GLAST Data Handling Pipeline (the Pipeline) provides a uniform interface to the diverse data handling needs of the GLAST collaboration. Its.

Slides:

Advertisements

Similar presentations

Network II.5 simulator ..

Advertisements

Database System Concepts and Architecture

1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.

Chap 2 System Structures.

Operating-System Structures

Guide to Oracle10G1 Introduction To Forms Builder Chapter 5.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

Chapter 12: ADO.NET and ASP.NET Programming with Microsoft Visual Basic.NET, Second Edition.

A Guide to Oracle9i1 Introduction To Forms Builder Chapter 5.

11 3 / 12 CHAPTER Databases MIS105 Lec14 Irfan Ahmed Ilyas.

R.Dubois 12 Jan 2005 Generating MC – User Experience 1/6 GLAST SAS Data Handling Workshop – Pipeline Session Running MC & User Experience Template for.

Interpret Application Specifications

A Guide to SQL, Seventh Edition. Objectives Embed SQL commands in PL/SQL programs Retrieve single rows using embedded SQL Update a table using embedded.

Chapter 7 Managing Data Sources. ASP.NET 2.0, Third Edition2.

Fundamentals of Python: From First Programs Through Data Structures

TIBCO Designer TIBCO BusinessWorks is a scalable, extensible, and easy to use integration platform that allows you to develop, deploy, and run integration.

Advance Computer Programming Java Database Connectivity (JDBC) – In order to connect a Java application to a database, you need to use a JDBC driver. –

FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)

The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.

Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.

AQS Web Quick Reference Guide Changing Raw Data Values Using Maintenance 1. From Main Menu, click Maintenance, Sample Values, Raw Data 2. Enter monitor.

GLAST LAT ProjectDOE/NASA Baseline-Preliminary Design Review, January 8, 2002 K.Young 1 LAT Data Processing Facility Automatically process Level 0 data.

OracleAS Reports Services. Problem Statement To simplify the process of managing, creating and execution of Oracle Reports.

Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.

9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.

K. Harrison CERN, 20th April 2004 AJDL interface and LCG submission - Overview of AJDL - Using AJDL from Python - LCG submission.

June 6 th – 8 th 2005 Deployment Tool Set Synergy 2005.

Marianne BargiottiBK Workshop – CERN - 6/12/ Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.

The Network Performance Advisor J. W. Ferguson NLANR/DAST & NCSA.

Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.

IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.

Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.

Introduction to the new mainframe © Copyright IBM Corp., All rights reserved. Chapter 12 Understanding database managers on z/OS.

Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,

Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.

Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.

Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.

8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.

Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.

37 Copyright © 2007, Oracle. All rights reserved. Module 37: Executing Workflow Processes Siebel 8.0 Essentials.

3 Copyright © 2004, Oracle. All rights reserved. Working in the Forms Developer Environment.

Copyright © 2003 ProsoftTraining. All rights reserved. Perl Fundamentals.

ClearQuest XML Server with ClearCase Integration Northwest Rational User’s Group February 22, 2007 Frank Scholz Casey Stewart

Process Description and Control Chapter 3. Source Modified slides from Missouri U. of Science and Tech.

Bookkeeping Tutorial. 2 Bookkeeping content  Contains records of all “jobs” and all “files” that are produced by production jobs  Job:  In fact technically.

CP476 Internet Computing Perl CGI and MySql 1 Relational Databases –A database is a collection of data organized to allow relatively easy access for retrievals,

K. Harrison CERN, 22nd September 2004 GANGA: ADA USER INTERFACE - Ganga release status - Job-Options Editor - Python support for AJDL - Job Builder - Python.

1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.

Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.

Application Web Service Toolkit Allow users to quickly add new applications GGF5 Edinburgh Geoffrey Fox, Marlon Pierce, Ozgur Balsoy Indiana University.

July 19, 2004Joint Techs – Columbus, OH Network Performance Advisor Tanya M. Brethour NLANR/DAST.

Java Object-Relational Layer Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.

WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.

Ganga/Dirac Data Management meeting October 2003 Gennady Kuznetsov Production Manager Tools and Ganga (New Architecture)

The Database Project a starting work by Arnauld Albert, Cristiano Bozza.

Fermi Fermi (previously GLAST) Gamma-Ray Space Telescope Processing Pipeline and Data CatalogGamma-Ray Space Telescope Processing Pipeline and Data Catalog.

4 Copyright © 2004, Oracle. All rights reserved. Advanced Interface Methods.

Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.

A Guide to SQL, Seventh Edition

BOSS: the CMS interface for job summission, monitoring and bookkeeping

Existing Perl/Oracle Pipeline

BOSS: the CMS interface for job summission, monitoring and bookkeeping

GLAST Release Manager Automated code compilation via the Release Manager Navid Golpayegani, GSFC/SSAI Overview The Release Manager is a program responsible.

“Running Monte Carlo for the Fermi Telescope using the SLAC farm”

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.

PL/SQL Scripting in Oracle:

Machine Independent Features

Chapter 10 ADO.

Level 1 Processing Pipeline

Production Manager Tools (New Architecture)

Presentation transcript:

Overview The GLAST Data Handling Pipeline (the Pipeline) provides a uniform interface to the diverse data handling needs of the GLAST collaboration. Its goal is to generically process graphs of dependent tasks, maintaining a full record of its state, history and data products. It will be used to process the down-linked data acquired from the satellite instrument after launch in 2007, acquiring the data from the Mission Operations Center and delivering science products to the GLAST collaboration and the Science Support Center. It is currently used to perform Monte Carlo Simulations, and Analysis of data taken from the instrument during integration and commissioning. This historical information serves as the base from which formal data catalogs will be built. In cataloging the relationship between data, analysis results, the software versions that produced them, as well as statistics (memory usage, cpu usage) of the processing we are able to fully track the provenance of all the data products. We are then able to investigate and diagnose problems in software versions and determine a solution which may include reprocessing of data. Data reprocessing is inherently simple with the Pipeline. A representation of the processing chain and data flow is stored internally in ORACLE and specified by the user in an xml file which is uploaded to the Pipeline with an ancillary utility. Minor changes to the xml file allow for similar processing to be retried using a patched version of code. The bulk of code is written in Oracle PL/SQL language in the form of stored procedures compiled into the Database. A middle tier of code linking the database to disk access and the SLAC LSF Batch Farm (3000+ CPUs) is implemented in perl. An automatically generated library of perl subroutines wrap each Stored Procedure and provide a simple entry-point interface between the two code-bases. We will address lessons learned in operations of the Pipeline in terms of needed configuration flexibility and long term maintenance. Perl PL/SQL Wrappers # FUNCTION: CREATETASK # RETURNS: NUMBER(22) # INPUT: TASKTYPE_FK_IN NUMBER(22) IN # INPUT: TASKNAME_IN VARCHAR2 IN # INPUT: GLASTUSER_FK_IN NUMBER(22) IN # INPUT: NOTATION_IN VARCHAR2 IN # INPUT: BASEFILEPATH_IN VARCHAR2 IN # INPUT: RUNLOGFILEPATH_IN VARCHAR2 IN sub CREATETASK { my $self=shift; # input arguments as hashref with keys=parameters my $args=shift; my $dbh=$self->{dbh}; # function return value: my $func_result = undef; # PLSQL function call my $sth1=$dbh->prepare(' BEGIN :func_result := DPF.CREATETASK(:TASKTYPE_FK_IN, :TASKNAME_IN, :GLASTUSER_FK_IN, :NOTATION_IN, :BASEFILEPATH_IN, :RUNLOGFILEPATH_IN); END; '); # Bind Function Return Value: $sth1->bind_param_inout(":func_result", \$func_result, 0); #Binding parameters: $sth1->bind_param(":TASKTYPE_FK_IN", $args->{TASKTYPE_FK_IN}); $sth1->bind_param(":TASKNAME_IN", $args->{TASKNAME_IN}); $sth1->bind_param(":GLASTUSER_FK_IN", $args->{GLASTUSER_FK_IN}); $sth1->bind_param(":NOTATION_IN", $args->{NOTATION_IN}); $sth1->bind_param(":BASEFILEPATH_IN", $args->{BASEFILEPATH_IN}); $sth1->bind_param(":RUNLOGFILEPATH_IN", $args->{RUNLOGFILEPATH_IN}); # execute $sth1->execute() || die $!; # return stored function result: return $func_result; } GLAST Data Handling Pipeline Daniel Flath, Stanford Linear Accelerator Center For the GLAST Collaboration Stored Procedures & Functions PROCEDURE deleteTaskProcessByPK(TaskProcess_PK_in IN int) IS RecordInfoList CURSORTYPE; RI_PK_iter RecordInfo.RecordInfo_PK%TYPE; BEGIN -- Delete the RI_TaskProcess and RecordInfo Rows: OPEN RecordInfoList FOR select RecordInfo_FK from RI_TaskProcess where TaskProcess_FK = TaskProcess_PK_in; LOOP FETCH RecordInfoList INTO RI_PK_iter; EXIT WHEN RecordInfoList%NOTFOUND; delete from RI_TaskProcess where RecordInfo_FK = RI_PK_iter; delete from RecordInfo where RecordInfo_PK = RI_PK_iter; END LOOP; CLOSE RecordInfoList; -- Delete TP_DS rows: delete from TP_DS where TaskProcess_FK = TaskProcess_PK_in; -- Delete the TaskProcess: delete from TaskProcess where TaskProcess_PK = TaskProcess_PK_in; END; Database Scheduling / Processing Logic Run states in blue Job states in gold (1 Run has a sequence of jobs) Utility scripts in Pink 1.New run(s) created with createRun utility 2.Scheduler wakes up and collects waiting runs Run promoted to Running state Job Prepared and passed to BatchSub facility (see N. Golpayegani poster) Error results in job failure 3.BatchSub scheduler wakes up and collects waiting jobs Job submitted to LSF compute farm Submission error results in job failure, or Pipeline notified via callback, marks job Submitted 4.LSF scheduler decides to run job Job assigned to farm machine LSF failure results in job failure, or BatchSub notified, invokes callback to pipeline which marks job Started 5.Job finishes or runs out of CPU allocation BatchSub notified, invokes callback to pipeline with exit status of job On Failure, pipeline marks job Failed and marks run End_Failure and processing halts until user intervenes (see 6. below) On success, pipeline marks job Ended If last job in sequence Run is marked End_Success (processing done), or If more jobs in sequence Run is marked Waiting for scheduling of next job (return to 2.) 6.Failed run(s) rescheduled with rollBackFailedRun utility Failed job and all data is deleted Run is marked waiting as if last successful job just finished (return to 2.) Waiting Running End_Success End_Failure Prepared Submitted Started Ended Failed Any job failure marks a run failure Last job success marks run success Job failure possible during each running state Success of all but final job marks run for scheduling of next job in sequence Run marked running… …and next job scheduled createRun rollBackFailedRun Run marked ready for rescheduling attempt Run created and marked ready for scheduling x Failed Job Deleted Did you know? Oracle has an internal language called PL/SQL PL/SQL is a strongly typed language, much like pascal, that blends pure SQL with conditional and program logic PL/SQL supports both stored procedures that perform an operation and stored functions that return a scalar value or cursor containing table rows This powerful language allows SQL query results to be manipulated before returning them, or bookkeeping to be performed to log user access history The code is compiled and stored in the database for rapid execution The GLAST Pipeline includes over 130 stored procedures and functions and over 2000 lines of PL/SQL code The pipeline does no direct manipulation of database tables, all database access is accomplished through the use of these native subroutines Stored PL/SQL can be invoked from perl, python, java, even JSP via JDBC (See Stored Procedures & Functions) We have developed a utility to generate perl wrapper subroutines for every stored PL/SQL routine used by the pipeline These subroutines incorporate all of the tedious Perl::DBI calls to translate data types, communicate with the database and handle exceptions The user only has to include the generated perl module, and call the subroutines to gain all functionality that the PL/SQL code provides Schema is normalized to provide efficient data access for both: Management and administration of active processing, and Retrieval during data mining to support analysis Schema contains three distinct tables types: Regular tables store the bolus of records Enumerated tables allow rows in “regular tables” to specify one of several possible metadata values without duplicating strings. This helps with query speed and promotes consistency Relational tables allow many-to-many record relationships between two regular tables Schema consists of two tiers: Management tables for defining pipeline tasks Task table defines attributes of a task like real-data-analysis or simulation TaskProcess table defines a sequence of processing steps for each task Dataset table defines the data products of a task, including the locations they should be stored TPI_DSI entries link TaskProcess and Dataset records to define the read/write relationships between processing and data for a task Instance tables: Each Run table record describes a specific (named) instantiation of processing for some task. There are many runs for every task. As an example: a monte carlo task may contain 1000 runs each simulating and reconstructing 5000 particle/detector interaction events For every Run, a TPInstance record is created for each TaskProcess in the Run’s parent task. A TPInstance entry records statistics of the compute-job that did the processing defined in the corresponding TaskProcess. Similarly; for every Run, DSInstance records (recording disk file location and statistics) are created for each file corresponding to a Dataset Record in the Task that owns the run TPI_DSI records relate the TPInstance and DSInstance processing/data history for a run Access to data: Management tables are administered by users via an xml upload/download web-based utility Tasks are defined in an explicitly specified XML format (backed by an extensible XSD schema) Specify paths to executables Specify order of execution Specify required input and output data files Specify LSF batch queue and batch group Specify other meta-data (“TaskType”, Dataset Type”, “Dataset File Type”) Existing tasks can be downloaded in this xml format, used as a template, and modifications re-uploaded to define a new task. This saves significant time Instance tables are accessed by both pipeline software and a web front end Pipeline software access is only by stored procedures and functions through the pipeline scheduler and command line utility scripts Scheduler manages and records processing Utility scripts allow users to create new runs, reschedule failed runs, etc. Web front end using read-only access provides a view of processing to user for monitoring. Processing log-files can be viewed and if the user notices a problem he/she can then administrate using the utility command-line scripts. Eventually these utilities will be made available directly from the front-end interface. Web Based User Interface Provides World-wide Administration Pipeline monitoring is performed by humans via a web base user interface consisting of three main views The Task Summary View provides an overview of processing for all tasks (leftmost picture) Each task is listed with the status of the runs belonging to it Filtering can be performed by task name to reduce the number of displayed tasks The results can be sorted by name or by status counts Clicking a task name drills down to the Run Summary View for that task The Run Summary View displays more specific information about the ongoing processing for a task (middle picture) Each stage of processing (TaskProcess) within the task is displayed along with it’s job instances totaled by processing state Clicking on a TaskProcess name brings up the detail view for jobs of that type Clicking on one of the totals brings up the detail view for jobs of that TaskProcess type in only the corresponding processing state The Individual Runs View displays details about the jobs selected in the Run Summary View (Rightmost picture) Statistics for each job are displayed including current status FTP Links to processing log-files are provided and can be clicked to display the corresponding log. These include: Log: Job output (stdout/stderr) Files: A list of links to all log files for download Out: Standard output of pipeline decision logic during job management and scheduling Err: Standard error of pipeline decision logic, if any This interface will soon be updated to include buttons which invoke administrative utilities that are currently available only on the command line, including deletion of runs, rescheduling of failed runs, scheduling of new runs