How Parallelism Is Used In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using: Three Complimentary Approaches to Parallelization of Local.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Introduction CSCI 444/544 Operating Systems Fall 2008.
Reference: Message Passing Fundamentals.
Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Bioinformatics and Phylogenetic Analysis
1: Operating Systems Overview
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Operating Systems CS208. What is Operating System? It is a program. It is the first piece of software to run after the system boots. It coordinates the.
Computer Organization and Architecture
Chapter 6: An Introduction to System Software and Virtual Machines
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
DISTRIBUTED COMPUTING
CS364 CH08 Operating System Support TECH Computer Science Operating System Overview Scheduling Memory Management Pentium II and PowerPC Memory Management.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Database Design – Lecture 16
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Thanks to Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction n What is an Operating System? n Mainframe Systems.
DynamicBLAST on SURAgrid: Overview, Update, and Demo John-Paul Robinson Enis Afgan and Purushotham Bangalore University of Alabama at Birmingham SURAgrid.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 1 Introduction Read:
Fall 2000M.B. Ibáñez Lecture 01 Introduction What is an Operating System? The Evolution of Operating Systems Course Outline.
◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Session-8 Data Management for Decision Support
Operating System Concepts Chapter One: Introduction What is an operating system? Simple Batch Systems Multiprogramming Systems Time-Sharing Systems Personal-Computer.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
Chapter 101 Multiprocessor and Real- Time Scheduling Chapter 10.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
1.1 Operating System Concepts Introduction What is an Operating System? Mainframe Systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered.
11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.
Distributed Database Systems Overview
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
Distributed DBMSs- Concept and Design Jing Luo CS 157B Dr. Lee Fall, 2003.
C o n f i d e n t i a l 1 Course: BCA Semester: III Subject Code : BC 0042 Subject Name: Operating Systems Unit number : 1 Unit Title: Overview of Operating.
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
Operating System Principles And Multitasking
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Slide 6-1 Chapter 6 System Software Considerations Introduction to Information Systems Judith C. Simon.
1.1 Sandeep TayalCSE Department MAIT 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming Batched Systems Time-Sharing Systems.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
What is BLAST? Basic BLAST search What is BLAST?
Silberschatz and Galvin  Operating System Concepts Module 1: Introduction What is an operating system? Simple Batch Systems Multiprogramming.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
What is BLAST? Basic BLAST search What is BLAST?
Applied Operating System Concepts
Chapter 1: Introduction
Introduction to Distributed Platforms
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Basics of BLAST Basic BLAST Search - What is BLAST?
William Stallings Computer Organization and Architecture
湖南大学-信息科学与工程学院-计算机与科学系
Bioinformatics and BLAST
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Operating System Concepts
Comparative Genomics.
Introduction to Operating Systems
Language Processors Application Domain – ideas concerning the behavior of a software. Execution Domain – Ideas implemented in Computer System. Semantic.
Subject Name: Operating System Concepts Subject Number:
Basic Local Alignment Search Tool (BLAST)
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Operating System Overview
Operating System Concepts
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

How Parallelism Is Used In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using: Three Complimentary Approaches to Parallelization of Local BLAST Service on Workstation Clusters - Braun, Pedretti, Casavant, Scheetz, Birkett, Roberts

Overview 1.A Unique Approach to Information Gathering 2.Types of Architecture Used 3.Software Packages Used 4.How Parallelism is used in BLAST a.Background b.Granularity c.Sequence to Sequence Comparison d.Parallelization of Single Query across Partitioned Database e.Partitioned Set Of Queries Across Set of Servers

A Unique Approach to Information Gathering ASK SOMEONE !!

Alejandro Shaffer Says… Parallel Computing is Used to Analyze: Protein Sequence Data DNA Sequence Data Protein Structure Data Genetic Inheritance Data - Among Others

Alejandro Shaffer Says… Parallel Bioinformatics Computations are Run on the Followng Architectures: Small Shared Memory Multiprocessor Loosely coupled network of processors

Alejandro Shaffer Says… The assembly of the Human Genome was done on the loosely coupled network of computers.

Alejandro Shaffer Says… Two Software Packages Used by Bioinformaticists That Run On Parallel Computers: BLAST FASTLINK

BLAST Analyzes Protein or DNA Sequences: Takes input sequences and searches large databases for similar sequences.

FASTLINK Used to hunt the approximate chromosomal location of disease causing genes. - leaving this topic open for someone else to research.

BLAST Basic Local Alignment Search Tool The most common Sequence Comparison tool.

BLAST Three Parallel Components to BLAST 1.Sequence to Sequence Comparison Level 2.Parallelization of a single query across a distributed database 3.A set of queries is partitioned across a set of servers with either a replicated or partitioned database.

BLAST At the time the paper was published – December 15, 1999 – the only completed implementation was the third step: Parallelizing Batch Requests

First – Some Background “The basic nature of the entire process of gene discovery is highly parallel, heterogenous, and distributed.”

Background At the time of the publishing of this paper, the current mode used by 90% of researchers is to submit single queries for comparison of sequence data ( chars) against one or more databases (GenBank)

Background Paper predicted that once the human genome was finished, the frequency and intensity of inquiries against data will increase exponentially. We’ve all seen the graph (several times) that proves this is true.

Background Problems: 1)Cluster of servers continues to diminish in its ability to serve the increasing number of requests. 2)Network traffic is becoming intolerable. 3)Database is growing at increasing rate. 4)Single queries are time consuming.

Refresher … Granularity Defined as the size of the computation between communication or synchronization points. Course – Each process contains a large number of sequential instructions and takes a substantial time to execute. Fine – Each process consists of a few, or even one instruction. Medium – Middle ground.

Refresher Granularity - Granularity is related to the number of processors being used. Metric Computation/Communication ratio = tcomp/tcomm Important to maximize ratio while maintaining parallelism

Three levels of Parallelism Exploitable in BLAST 1 sequence 1 sequence N sequences (batch request) 1 sequence 1 sequence M sequences M sequences (in database) (in database) M sequences (in database) Mult. alignments on single sequence pairs Partition database Multiple targets examined at once Replicate Database – Partition input sets Fine GrainedMedium GrainedCourse Grained Subject(s) Target(s) Parallelism

BLAST BLAST is a heuristic search algorithm Heuristic: Process of elimination and compromise by using the “what if” theory. An educated guess that reduces or limits the search for solutions. A method of solving problems by intelligent trial and error.

BLAST Five variations of BLAST blastn blastx tblastx blastp tblastn

BLAST blastn Compares a nucleotide sequence against a nucleotide database (Relatively quick)

BLAST blastx Compares a nucleotide sequence against a protein database. Nucleotide “subject” needs to be translated into a peptide sequence – since 6 different translations, the basic blast algorithm must be applied 6 times.

BLAST tblastx Compares nucleotide sequence to nucleotide database, only each is translated (in all 6 reading frames) into a peptide sequence before blasting. This is the most computationally intesive BLAST algorithm – must be invoked 36 times for each sequence to sequence comparison.

BLAST blastp Compares a peptide sequence to a peptide database (Relatively quick)

BLAST tblastn Compares a peptide sequence against a nucleotide database Requires 6 calls to BLAST

BLAST Benefits of Parallelizing Local BLAST Reduces processing time in relation to number of compute nodes utilized. Reduces costs by utilizing commodity workstations and PCs. A locally-scheduled parallel algorithm allows prioritization and control over individual searches.

Types Of Parallelism I. Pairwise Multiple Alignment Fancy term for earlier description of variations on BLAST algorithm. Since the comparisons are mutually independent, the parallelization of the comparisons is potentially very efficient. Of greatest importance would be a high-speed, low- latency interconnection network to allow rapid selection and scoring of the best possible alignment. Effective implementation would greatly benefit from specialized hardware.

Types of Parallelism II. Database Partitioning Distributing chunks of the database across a collection of compute nodes. Master node coordinates the scheduling of jobs and collates the results from each submission.

Types of Parallelism III. Batch Mode Scheduling sets of queries, while keeping full copies of the database stored on each compute node. This type of parallelism is currently in place and being used.

Batch Mode The foundation of the local batch BLAST system is the Portable Batch System developed for NASA. PBS is comprised of three parts: The Job Server The Scheduler Compute Nodes

Batch Mode The Job Server Responsible for managing two queues of incoming jobs – one for batch blast jobs, the other for jobs interactively submitted to local BLAST through a web interface.

Batch Mode The Scheduler Applies job scheduling algorithm to allocate compute nodes to jobs in the two incoming job queues. Some nodes have several CPUs and can handle more than one simultaneous blast job. The scheduler assigns multiple jobs to such nodes.

Batch Mode Compute Nodes Each node has a monitor that communicates with job server. Each node has own set of sequence databases.

Job Types 1)Batch jobs – can be executed at any time and restarted if necessary. 2) Interactive jobs – time critical and should have priority over batch jobs.

Job Types At time paper was published, the current implementation was as follows: 75% of compute nodes execute batch jobs 25% always available for interactive web jobs. if no batch jobs, all 100% are available for web jobs – neither type of job will be starved of resources with this approach.

Issues with Batch Mode All replicated databases must be updated periodically to reflect the most recent contents of globally shared db. All nodes copies must be consistent with one another. Otherwise, results of the query would depend on which compute node processed it.

Considerations… A Networked File System is being considered where there would be several I/O servers in a system, each with a complete copy of database. Compute nodes would rely on these I/O servers for access to database.

Next Step… The partitioned database implementation will utilize many of the concepts developed for the course-grained implementation, but the scheduler would need to know which nodes had which section of the database.

Next Step… Outputs would then need to be combined into single output file This is Non-Trivial Merge program must parse, sort, and correct data from nodes, and E values must be corrected to reflect larger database size.

Questions ??? One of My Own Since this paper was published in 1999, have all three levels of parallelism described here been exploited by now? - haven’t found the answer.