Computing the Smith-Waterman Algorithm on the Illinois Bio-Grid Dave S. Angulo 1, Nigel M. Parsad 2, Tom Goodale 3, Gabrielle Allen 3, Ed Seidel 3 1 The.

Slides:

Advertisements

Similar presentations

Client/Server Computing (the wave of the future) Rajkumar Buyya School of Computer Science & Software Engineering Monash University Melbourne, Australia.

Advertisements

Multiple Processor Systems

Categories of I/O Devices

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

Chess Problem Solver Solves a given chess position for checkmate Problem input in text format.

Parallel Database Systems The Future Of High Performance Database Systems David Dewitt and Jim Gray 1992 Presented By – Ajith Karimpana.

Institute for Software Science – University of ViennaP.Brezany 1 Databases and the Grid Peter Brezany Institute für Scientific Computing University of.

A Grid implementation of the sliding window algorithm for protein similarity searches facilitates whole proteome analysis on continuously updated databases.

Data Parallel Algorithms Presented By: M.Mohsin Butt

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan CMSC 838 Presentation.

A Parallel Solution to Global Sequence Comparisons CSC 583 – Parallel Programming By: Nnamdi Ihuegbu 12/19/03.

1 Bio-Sequence Analysis with Cradle’s 3SoC™ Software Scalable System on Chip Xiandong Meng, Vipin Chaudhary Parallel and Distributed Computing Lab Wayne.

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.

Yuan CMSC 838 Presentation Parallelisation of IBD computation for determining genetic disease map.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.

Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.

Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov.

GHS: A Performance Prediction and Task Scheduling System for Grid Computing Xian-He Sun Department of Computer Science Illinois Institute of Technology.

Omar Darwish.  Load balancing is the process of improving the performance of a parallel and distributed system through a redistribution of load among.

How Parallelism Is Used In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using: Three Complimentary Approaches to Parallelization of Local.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Introduction to Computational Thinking Vicky Chen.

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

Chapter 17 Domain Name System

Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA February 22/23, 2006 SURA, Washington DC Memory Efficient.

UNIT - 1Topic - 2 C OMPUTING E NVIRONMENTS. What is Computing Environment? Computing Environment explains how a collection of computers will process and.

Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.

1 Distributed Process Scheduling: A System Performance Model Vijay Jain CSc 8320, Spring 2007.

Client – Server Architecture. Client Server Architecture A network architecture in which each computer or process on the network is either a client or.

How computer’s are linked together.

Applications for the Grid Here at GGF1: Gabrielle Allen, Thomas, Dramlitsch, Gerd Lanfermann, Thomas Radke, Ed Seidel Max Planck Institute for Gravitational.

SURA GridPlan Infrastructure Working Group Art Vandenberg Georgia State University Mary Fran Yafchak SURA Working.

Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.

Multiprocessor and Real-Time Scheduling Chapter 10.

11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.

BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.

Multiprossesors Systems.. What are Distributed Databases ? “ A Logically interrelated collection of shared data ( and a description of this data) physically.

1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.

Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.

6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.

GridLab WP-2 Cactus GAT (CGAT) Ed Seidel, AEI & LSU Co-chair, GGF Apps RG, Gridstart Apps TWG Gabrielle Allen, Robert Engel, Tom Goodale, *Thomas Radke.

HIGUCHI Takeo Department of Physics, Faulty of Science, University of Tokyo Representing dBASF Development Team BELLE/CHEP20001 Distributed BELLE Analysis.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.

Module 9 Planning and Implementing Monitoring and Maintenance.

Static Process Scheduling

A System Performance Model Distributed Process Scheduling.

Sequence Alignment.

Linux Operations and Administration

CS 420 Design of Algorithms Parallel Algorithm Design.

Client – Server Architecture A Basic Introduction 1.

Dynamic Grid Computing: The Cactus Worm The Egrid Collaboration Represented by: Ed Seidel Albert Einstein Institute

NCBI Grid Presentation. NCBI Grid Structure NetCache NetSchedule Load Balancer (LBSM) Load Balancer (LBSM) Worker Nodes CGI Gateway.

Name : Mamatha J M Seminar guide: Mr. Kemparaju. GRID COMPUTING.

Productive Performance Tools for Heterogeneous Parallel Computing

IM.Grid: A Grid Computing Solution for image processing

A Database of Peak Annotations of Empirically Derived Mass Spectra

Dynamic Grid Computing: The Cactus Worm

MapReduce Simplied Data Processing on Large Clusters

MapReduce: Data Distribution for Reduce

2009 AAG Annual Meeting Las Vegas, NV March 25th, 2009

Introduction to Operating Systems

Introduction to Operating Systems

Multiprocessor and Real-Time Scheduling

CS703 - Advanced Operating Systems

Client/Server Computing

Presentation transcript:

Computing the Smith-Waterman Algorithm on the Illinois Bio-Grid Dave S. Angulo 1, Nigel M. Parsad 2, Tom Goodale 3, Gabrielle Allen 3, Ed Seidel 3 1 The School of Computer Science, Telecommunications and Information Systems, Depaul University | 2 Kurt Rossman Labs, The University of Chicago | 3 Albert Einstein Institute, Golm (AEI/MPG) | | | Motivation:To exploit the prodigious computational resources of the Illinois Bio-Grid (IBG) by simultaneously querying multiple protein sequences against multiple protein sequence databases for homology. The Smith-Waterman algorithm will be utilized as it guarantees the optimal local pairwise alignment between homologous sequences. The efficiencies gained by the parallel distribution of both the database query and the dynamic programming load should be substantially greater than the single sequence/single protein database search that is the current computational biology standard. Task Farming Basics on the Grid Smith-Waterman Task Farming (SWTask) on the IBG Machines Involved: A. An N processor Grid which dynamically allocates resources for client processes. B.Have one processor designated as the Master Task Farm Manager – TFM(0). C.Have M processors designated as the Worker Task Farm Managers – TFM (1). Data Involved: A.P source data files (estimate 140) from sequence database. B.Each data file has perhaps 100,000 sequence strings with potentially 4,000 characters per string. C. P can be broken into subsets P’, P” etc. Tasks Involved: A.Download P source data files. Total number of characters to compare is approximately 56,000,000 (140 source files x 100,000 sequence strings x 4,000 characters per string). B.Complete a W x W character expression. For two source files P 1 and P 2, consider P 1 x P 2. C.Since P 1 x P 2 == P 2 x P 1, only the upper matrix of comparisons will be performed. Task Management Scenario: A.The TFM(0) gives P TFM(1) processors individual directives to download, process, and “own” one source data file: i).Each TFM(1) processor downloads a source data file, strips off non-essential annotations, and stores the annotations on local disk. ii).Each TFM(1) processor saves the resulting stripped source file in memory for sequence alignment analysis using Smith-Waterman. iii).Each TFM(1) processor remains prepared to send and receive stripped source files to and from other TFM(1) peers on the Grid. iv).The TFM(0) keeps track of which TFM(1) processors own what source file.. B.The TFM(0) gives T TFM(1) processors directives to obtain a second source data file (all or partial) from a TFM(1) peer: i).Each TFM(1) processor asks a TFM(1) peer for the second stripped source file. ii).Each TFM(1) processor then does a pairwise sequence comparison of the two files in memory. iii).Each TFM(1) processor then requests more work from the TFM(0). The TFM(0) may then direct the TFM(1) to ask a peer for a third file in a second thread. C.The TFM(0) tracks and dynamically manages: i).TFM(1) progress. ii).TFM(1) task distribution based upon workload sharing and processor speed (via completion requests). TFM(1) TFM(0) implemented in Cactus TM modules used for starting remote TFM(1)s TFM(1) TFM(0) Designed for the Grid TFM(1) Tasks can be anything – in this case the computation of a bioinformatics application Task Manager Hierarchy: In the traditional Master/Slave task manager architecture, there are problems with slave startup and communication between master and slave. Specific issues include authentication/authorization to start remote jobs, queues on remote sources, and firewalls between resources. A three-level hierarchy provides solutions to these issues: Level 1:The Task Farm Manager (0), a.k.a. TFM(0), farms out tasks to remote resources on the Grid and was the Master in the traditional Master/Slave architecture. Level 2:A Task Farm Manager (1), a.k.a. TFM(1), is started on a queue for each remote resource assigned a task. Level 3 :The specific computational task. This level corresponds to the Slave in the three-level model. Task Manager module Structure: The Task Farm Manager (TFM) utilizes the ASCA generic task farm module as well as the Task Farm Logic Manager module (TFLM). For TFM(0), ASCA(0) requests information from TFLM(0) regarding the minimum number of tasks that can be run (MinTasks), how many tasks are desired (DesiredTasks), and how many processors and how much memory is required per task (TaskRequirements). When a TFM(1) requests a task, the TFM(0) calls GetMoreTasks which manages a list of task id’s for uncompleted tasks. Then for each task, TFM(1) calls GetInputFile which provides the required parameters for the specific source files to be processed. The SWLM module is the logic manager specific to Smith-Waterman applications. SWLM provides info as to what tasks to start and what parameters to run for each input files. The SWTask module (not shown) will communicate with the SWLM to get and process files on the task end. Generic Part Application Specific Strategy:To develop and implement a Smith-Waterman software toolkit (SWTask) to run in the distributed environment of the IBG. This toolkit will be part of a larger IBG Bioinformatic Workbench whose modules will also allow for the Grid-enabled computation of the FASTA and BLAST algorithms. The SWTask will include task farming, data acquisition, and Smith-Waterman software modules. N-processor Grid M-protein sequence databases N X M pairwise protein alignments using Smith-Waterman