Three critical areas of change enabled the move from pipeline processing built mostly on unstructured custom code to a structured system where pipelines.

Slides:



Advertisements
Similar presentations
LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
JTX Overview Overview of Job Tracking for ArcGIS (JTX)
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.
1 Generic logging layer for the distributed computing by Gene Van Buren Valeri Fine Jerome Lauret.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
ARCS Data Analysis Software An overview of the ARCS software management plan Michael Aivazis California Institute of Technology ARCS Baseline Review March.
Two main requirements: 1. Implementation Inspection policies (scheduling algorithms) that will extand the current AutoSched software : Taking to account.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):
1-2.1 Grid computing infrastructure software Brief introduction to Globus © 2010 B. Wilkinson/Clayton Ferner. Spring 2010 Grid computing course. Modification.
CompuNet Grid Computing Milena Natanov Keren Kotlovsky Project Supervisor: Zvika Berkovich Lab Chief Engineer: Dr. Ilana David Spring, /
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Page 1 Building Reliable Component-based Systems Chapter 18 - A Framework for Integrating Business Applications Chapter 18 A Framework for Integrating.
What is adaptive web technology?  There is an increasingly large demand for software systems which are able to operate effectively in dynamic environments.
Maintaining and Updating Windows Server 2008
Passage Three Introduction to Microsoft SQL Server 2000.
Understanding and Managing WebSphere V5
Professional Informatics & Quality Assurance Software Lifecycle Manager „Tools that are more a help than a hindrance”
Talend 5.4 Architecture Adam Pemble Talend Professional Services.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
BMC Software confidential. BMC Performance Manager Will Brown.
Introduction to the Enterprise Library. Sounds familiar? Writing a component to encapsulate data access Building a component that allows you to log errors.
Software Engineering Muhammad Fahad Khan
Christopher Jeffers August 2012
C Copyright © 2009, Oracle. All rights reserved. Appendix C: Service-Oriented Architectures.
Global Customer Partnership Council Forum | 2008 | November 18 1IBM - GCPC MeetingIBM - GCPC Meeting IBM Lotus® Sametime® Meeting Server Deployment and.
Magnetic Field Measurement System as Part of a Software Family Jerzy M. Nogiec Joe DiMarco Fermilab.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
BLU-ICE and the Distributed Control System Constraints for Software Development Strategies Timothy M. McPhillips Stanford Synchrotron Radiation Laboratory.
The ACGT Workflow Editing & Enactment Environment Giorgos Zacharioudakis Institute of Computer Science, Foundation for Research & Technology – Hellas (ICS-FORTH)
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Introduction to the Adapter Server Rob Mace June, 2008.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Grid programming with components: an advanced COMPonent platform for an effective invisible grid © 2006 GridCOMP Grids Programming with components. An.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
Distribution and components. 2 What is the problem? Enterprise computing is Large scale & complex: It supports large scale and complex organisations Spanning.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Chapter 4: Working with ASP.NET Server Controls OUTLINE  What ASP.NET Server Controls are  How the ASP.NET run time processes the server controls on.
Interactive Workflows Branislav Šimo, Ondrej Habala, Ladislav Hluchý Institute of Informatics, Slovak Academy of Sciences.
Message Broker
Repository for Targeted Proteomics Assays Josh Eckels Skyline Users Group - June 9, 2013.
PwC New Technologies New Risks. PricewaterhouseCoopers Technology and Security Evolution Mainframe Technology –Single host –Limited Trusted users Security.
VMware vSphere Configuration and Management v6
Mantid Stakeholder Review Nick Draper 01/11/2007.
Copyright © The OWASP Foundation Permission is granted to copy, distribute and/or modify this document under the terms of the OWASP License. The OWASP.
CSI 3125, Preliminaries, page 1 SERVLET. CSI 3125, Preliminaries, page 2 SERVLET A servlet is a server-side software program, written in Java code, that.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Tool Integration with Data and Computation Grid “Grid Wizard 2”
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Selenium server By, Kartikeya Rastogi Mayur Sapre Mosheca. R
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
INFSO-RI JRA2 Test Management Tools Eva Takacs (4D SOFT) ETICS 2 Final Review Brussels - 11 May 2010.
 We present the implementation of pluggable scoring in the X! Tandem MS/MS database search program.  We have modified the core search program to facilitate.
Maintaining and Updating Windows Server 2008 Lesson 8.
Wednesday NI Vision Sessions
CPAS Comparative Proteomics Analysis System Adam Rauch LabKey Software
Architecture Review 10/11/2004
LOCO Extract – Transform - Load
Self Healing and Dynamic Construction Framework:
Overview of Workflows: Why Use Them?
DBOS DecisionBrain Optimization Server
Presentation transcript:

Three critical areas of change enabled the move from pipeline processing built mostly on unstructured custom code to a structured system where pipelines and tools could be declared in configuration files, and run anywhere. First, we split the monolithic PipelineJob class, leaving the PipelineJob itself as simply the data model, to be saved as XML and transferred to any number of computers participating in the data processing. We introduced the Task as the primary processing unit – with access to the context information of the PipelineJob data model – and the TaskPipeline as a progression of Tasks to be performed. Second, we chose to put a message queue at the center of our distributed communications [Figure 1], with an atomic, durable queue of PipelineJobs in various states, receiving processing time for the computers designated to run their active Task. Use of the Mule Enterprise Service Bus standardizes communication with the message queue, and supports easy configuration of stand-alone Mule servers for remote task processing. Globus WS Table 1 – msInspect usage  We present the early implementation of an Adaptable Enterprise Pipeline in CPAS.  We created a new structured architecture for CPAS PipelineJobs that supports configuration of distributed automated pipeline processing across any number of computers.  We present declarative XML configuration capable of exposing rich new automated pipeline processing options on CPAS servers.  We achieved Mule Server and Globus configurations that run powerful distributed processing.  Future work includes:  Release a version of CPAS with full support for remote servers and computational clusters.  Generate pipeline experiment descriptions dynamically based on configuration.  Investigate pipeline customization at runtime through a web user interface.  Improve cross-lab reproducibility of pipeline processing with a sharable pipeline configuration archive. GRAM provides a layer of abstraction that will allow Tasks to be run on a variety of computer clusters and grids. Third, we implemented FileAnalysisTaskPipeline and CommandTask as Adaptable Enterprise Pipeline for the Computational Proteomics Analysis System (CPAS) Brendan MacLean 1, Parag Mallick 2 1 LabKey Software, Seattle, WA USA 2 Spielberg Family Center for Applied Proteomics, Los Angeles, CA USA Overview Methods Results Conclusions High-throughput, distributed analyses of proteomics data are a critical aspect of proteomics research. To meet escalating challenges in data processing, research institutions have been expanding their computational infrastructures. However, huge volumes of data, rapid evolution of software tools, and complex software configuration present serious problems for both individual researchers and high throughput proteomics facilities. Full automation of processing ranging across multiple computers and many tools, both custom and licensed, remains elusive. CPAS has long supported a number of ways for software developers to create fully automated pipelines that may be started and then monitored from a web browser. This framework has allowed an active open source community to create full-featured pipelines for MS/MS peptide identification and quantification, running Mascot [2], SEQUEST [3], X! Tandem [4], msInspect- Q3 [5], and tools from the Trans Proteomic Pipeline on systems incorporating over 100 computers. These pipelines, however, required a good deal of programming skill to create and even modify. Our objective was to build a more reliable foundation on which distributed proteomics data processing pipelines could be created and modified without code compilation. As part of developing the open-source software system CPAS [1], we have significantly reduced the complexity of creating, modifying and sharing automated proteomics data processing pipelines that scale from a single machine to vast computational clusters and distributed services. We achieved this reduction in complexity by integrating best-of-breed, open source Enterprise Software tools developed through years of investment in distributed services for large businesses and financial institutions, as well as the Globus computational grid project. This new architecture allows proteomics data processing pipelines and programs in them to be configured after installation of a CPAS server. We present a general introduction to the software modifications that enable this Adaptable Enterprise Pipeline, supply example configuration XML, and discuss plans for future work. ASMS 2008 Introduction 1)Rauch A, Bellew M, et al. J of Proteome Res. 01/2006; 5(1): ,. 2)Perkins DN, Pappin DJ, Creasy DM, Cottrell, JS. Electrophoresis Dec; 20(18): )Eng JK, McCormack AL, and Yates JR 3rd, J Am Soc Mass Spectrom. 1994; 5: )Craig R, Beavis RC. Bioinformatics Jun 12; 20(9): )Faca, V. et al. J of Proteome Res. 2006; 5(8): )Bellew M, Coram M, Fitzgibbon M, Igra M, et al. Bioinformatics. 2006; 22: )Jones, A. et al. Bioinformatics. 2004; 20(10): This work was funded by Canary Fund and Spielberg Family Foundation References We were able to use the XML configuration exposed by this new architecture to implement two MS1 processing pipelines that we released with CPAS version 8.1. The MS1 pipelines may be started by choosing input files and entering parameters using CPAS web pages [Figure 4]. These inputs are successfully converted to command-line arguments for three separate processing programs run in succession: 1.Peakaboo – peak extraction 2.msInspect [6] – feature finding 3.Pepmatch – matching features to MS2 peptide IDs Pipeline state is tracked in CPAS, and the output from the programs is logged. Both may be viewed from a web browser, real-time, as the pipeline progresses. The Peakaboo program was not included in the default configurations, because it has not yet been released. The pipelines, however, may be externally reconfigured to run Peakaboo for labs where the program is available. In fact, we have successfully added arbitrary, unanticipated processing steps, complete with program-specific parameters, to existing pipelines through the same external configuration. A small amount of custom code was necessary to create the FuGE [7] based experiment description used by CPAS to display a processing workflow graph [Figure 5], and to direct loading of result data into CPAS. We believe, however, LabKey Server Database Server Remote Task Server File Storage Message Queue MuleMule Remote Task Server Remote Task Servers MuleMule Scheduler Cluster MuleMule GRAMGRAM Figure 1 This diagram shows a LabKey Server configured to run processing tools on remote computers and a computational cluster, using a message queue for distributed communication. Figure 4 Web pages in CPAS allow investigators to choose files, enter command parameters, and start processing. Figure 6 Custom viewing pages in CPAS show the results data from a MS1 pipeline run. msInspect usage --findPeptides [--out=filename] [--start=scan][--count=scanCount] … [--noAccurateMass] mzxmlfile ParameterDescription msinspect findpeptides, start scan Minimum scan number (default 1) msinspect findpeptides, scan count Number of scans to search, if not all (default ) msinspect findpeptides, no accurate mass Do not attempt mass-accuracy adjustment after default peak finding strategy (default false) configurable extensions of TaskPipeline and Task for the most common processing method of running existing programs on data files. We chose the Spring bean configuration format for configuring these classes [Figures 2 & 3], which proved sufficiently portable that it could be used to configure the same code to run on the LabKey Server, Mule Server instances, and processes running on cluster nodes. We put considerable effort into creating a rich configuration language for the critical task of exposing pipeline parameters in the BioML format used in the MS2 pipelines, complete with directives to convert these input parameters and files to command-line arguments that drive the pipeline processing [Table 1 & Figure 3]. Figure 2 A XML definition for an MS1 pipeline Figure 3 - A XML task definition for msInspect... that the new architecture will make it possible to generate this description dynamically based on the pipeline definition. The existing modular data loading and viewing architecture made supporting the new result file types easy. We added an ExperimentDataHandler class to load each of the new file types (.peaks.xml,.peptides.tsv and.pepmatch.tsv), as well as custom data viewers [Figure 6] for these types. As pipeline configuration becomes even simpler in the future, we expect custom data loaders and viewers to become the primary focus of new development in CPAS. Finally, using the same Java archives, and changing only external XML configuration files, we were able to run msInspect on a computational cluster with Globus GRAM, and run Pepmatch on a remote Mule Server. The message queue allowed process control to pass seamlessly between the LabKey Server and the remote machines designated to run each task. State changes and logging viewed through the web interface were indistinguishable from the same processing run on a single computer, though certainly multi-computer configurations make possible the processing of massive amounts of data with little performance impact on the web server. Figure 5 This experiment page, shows a process graph for a MS1 pipeline configured and run on the new software.