Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,

Slides:

Advertisements

Similar presentations

Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, Parallel Computational Biochemistry.

Advertisements

Bioinformatics Multiple sequence alignments Scoring multiple sequence alignments Progressive methods ClustalW Other methods Hidden Markov Models Lecture.

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.

Molecular Evolution Revised 29/12/06

1 “INTRODUCTION TO BIOINFORMATICS” “SPRING 2005” “Dr. N AYDIN” Lecture 4 Multiple Sequence Alignment Doç. Dr. Nizamettin AYDIN

Heuristic alignment algorithms and cost matrices

. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.

Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.

1 Protein Multiple Alignment by Konstantin Davydov.

Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Bioinformatics and Phylogenetic Analysis

Phylogenetic Trees Presenter: Michael Tung

Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.

Multiple alignment: heuristics

Multiple sequence alignment

Similar Sequence Similar Function Charles Yan Spring 2006.

Sequence Alignment III CIS 667 February 10, 2004.

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.

Multiple Sequence Alignments

Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.

Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.

CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,

CISC667, F05, Lec8, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Multiple Sequence Alignment Scoring Dynamic Programming algorithms Heuristic algorithms.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.

Chapter 5 Multiple Sequence Alignment.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Introduction to Profile Hidden Markov Models

Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.

Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.

Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.

Frank Dehnewww.dehne.net Parallel Computational Biochemistry.

Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.

InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.

Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.

Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.

Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.

Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.

Comp. Genomics Recitation 3 The statistics of database searching.

Construction of Substitution Matrices

Using Traveling Salesman Problem Algorithms to Determine Multiple Sequence Alignment Orders Weiwei Zhong.

Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.

Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight.

Copyright OpenHelix. No use or reproduction without express written consent1.

MUSCLE An Attractive MSA Application. Overview Some background on the MUSCLE software. The innovations and improvements of MUSCLE. The MUSCLE algorithm.

Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.

COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Construction of Substitution matrices

DNA, RNA and protein are an alien language

Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.

Pairwise Sequence Alignment. Three modifications for local alignment The scoring system uses negative scores for mismatches The minimum score for.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Multiple Sequence Alignment Dr. Urmila Kulkarni-Kale Bioinformatics Centre University of Pune

Bioinformatics Overview

Multiple sequence alignment (msa)

Multiple Sequence Alignment

In Bioinformatics use a computational method - Dynamic Programming.

Sequence Based Analysis Tutorial

Introduction to Bioinformatics

Presentation transcript:

Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov, Haruna Cofer, Roberto Gomperts SGI

CMSC 838T – Presentation Problem Statement u Multiple Sequence Alignment (MSA)  Basis for phylogenetic analysis - Infer homology relationships  Building protein families - conserved region may imply common function  Aids in function/structure prediction of new proteins  Global MSA – Clustal W  Is it computationally expensive ? Yes, for 100 sequences. u Goal : Parallelize Clustal W  Clustal W takes hours for 100 or more sequences  Parallelization possible for the algorithm u Contribution of the paper  Parallel Clustal W l Parallel version of basic Clustal W  HT Clustal l Parallelize heterogeneous Multiple Sequence Alignment problems  MULTICLUSTAL l Parallel version of an optimization on Clustal W

CMSC 838T – Presentation Talk Overview u Overview of talk  Motivation  Background l Sequential Clustal W  Parallel Clustal W  HT Clustal l Problem Statement l Optimizations  MULTICLUSTAL l Sequential Algorithm l Optimizations  Observations

CMSC 838T – Presentation Introduction u Sequential Clustal W Algorithm  Given N sequences of length M each  Pairwise Alignment (PA) l Creates distance matrix N x N based on pairwise alignment scores l Evolutionary distance  Guide Tree (GT) construction (Phylogenetic tree) l Use Neighbor-joining algorithm  Progressive Multiple Alignment (PA) l Use guide tree to align closely related pairs of sequences l Progressively align next sequence to existing alignment

CMSC 838T – Presentation Parallel Clustal W u Problem Statement  Parallelize the Sequential Clustal W u Execution time breakup  PW = pairwise alignment, GT = guide tree, PA = progressive alignment

CMSC 838T – Presentation Parallel Clustal W u Pairwise Alignment Stage  N(N-1)/2 pairwise alignments  Send them randomly to different processors l Random – as jobs of different load l Random also produces statistically uniform distribution (over a large set of jobs)  1.8X speedup achieved on a 1000 sequence MSA with 8 CPUs u Guide Tree Stage  Parallelize “find closest neighbors from distance matrix”  Used in the neighbor joining algorithm l Find minimum element of each row concurrently l Use this to find minimum element of matrix

CMSC 838T – Presentation Parallel Clustal W u Progressive Alignment Stage  Computation of a function score(I,J) precomputed in parallel l Alignment score of sequence I and J  Not much parallelization in the third stage u Overall Speedup  Speedup of 10x for 600 MA sequences using 16 CPUs  Time reduced from 1 hr 7 minutes to 6.5 minutes  Relative scaling is better for larger inputs

CMSC 838T – Presentation HT Clustal u Problem Statement  Calculate large numbers of MSAs of various sizes (independent problems)  Such problems seen in high-throughput (HT) research environments  Representative Problem (from paper) : l Perform independent MSA over 100 sets of sequences l Each set has between 20 to 100 sequences with average of 60 sequences l Average Length of sequence = 390

CMSC 838T – Presentation HT Clustal - Optimizations u Basic Idea  Each MSA operation (on one set of sequences) is independent of the other  Run ClustalW as a uniprocessor job on one MSA problem  Launch multiple Clustal W jobs on different processors u Job Scheduling  Jobs of different duration – depends on sequence set  Two scheduling options explored: l Schedule dynamically – if processor is free, schedule an MSA job – chosen randomly l Schedule dynamically – Sequences are presorted (based on filesize)

CMSC 838T – Presentation HT Clustal – Performance Numbers u Speedups  Almost linear speedups  31x on 32 CPUs for the representative MSA problem  116X on 128 CPUs for a larger test case l Solution time reduced from 18.5 hours to 9.5 minutes  Speedup shown for the example MSA set:

CMSC 838T – Presentation HT Clustal – Effect of Presorting u Effect of presorting  Figure shows effect of presorting for the example MSA set 32 CPUs, 100 sets, ~3 jobs per CPU  If average number of jobs per CPU < 5 presorting helps  For larger number of jobs per CPU statistical averaging reduces load imbalance

CMSC 838T – Presentation MULTICLUSTAL u MULTICLUSTAL Algorithm  A Perl script to generate high quality MSA with little user intervention  Searches for best combination of Clustal W input parameters l To reduce gaps, increase clustering  Parameters to vary : l Scoring matrices : pairwise and multiple l Gap open and extension penalties (pairwise and multiple)  Sequential Algorithm : 1. Till all parameters are sufficiently varied { 2. alignment = Run Clustal W () 3. Calculate quality of alignment 4. Change Parameters }  Quality of alignment l A numerical quantity based on u identitical amino acid matches u Conservative amino acid substitutions u Gap events, amino acid islands I.e. –X-, -XX-, -XXX-, -XXXX-

CMSC 838T – Presentation MULTICLUSTAL Optimizations u Optimization on MULTICLUSTAL  Run Clustal W once  Reuse tree generated in the PW/GT Stages l Guide tree calculated only once for multiple runs l Results in speedups from 1.5X to 3X  Use Parallel Clustal W for each run of Clustal W

CMSC 838T – Presentation Observations u Parallelizability  First (pairwise alignment) and second (guide tree) stages are parallelizable  Third stage is mostly sequential – speedup limited u 100 sequence MSAs possible ?  PIR at NBRF (Georgetown University) takes maximum of 20 sequences for MSA  Speedup improves user response, for 20 sequences a PC would be sufficient u Probable applications: u Research Environments ? u PIR servers ? u Speedup only on shared memory SGI 3000 workstation ?