Frank Dehnewww.dehne.net Parallel Computational Biochemistry.

Slides:



Advertisements
Similar presentations
Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, Parallel Computational Biochemistry.
Advertisements

Multiple Alignment Anders Gorm Pedersen Molecular Evolution Group
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
Structural bioinformatics
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Heuristic alignment algorithms and cost matrices
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Multiple alignment June 29, 2007 Learning objectives- Review sequence alignment answer and answer questions you may have. Understand how the E value may.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Multiple sequence alignments and motif discovery Tutorial 5.
Multiple sequence alignment
Evaluating alignments using motif detection Let’s evaluate alignments by searching for motifs If alignment X reveals more functional motifs than Y using.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Multiple Sequence Alignments
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Multiple Sequence Alignment
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Sequencing a genome and Basic Sequence Alignment
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Chapter 5 Multiple Sequence Alignment.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Cédric Notredame (30/08/2015) Chemoinformatics And Bioinformatics Cédric Notredame Molecular Biology Bioinformatics Chemoinformatics Chemistry.
Gene expression & Clustering (Chapter 10)
Multiple Sequence Alignment May 12, 2009 Announcements Quiz #2 return (average 30) Hand in homework #7 Learning objectives-Understand ClustalW Homework#8-Due.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Fixed Parameter Complexity Algorithms and Networks.
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Prediction of protein contact maps Piero Fariselli Department of Biology University of Bologna.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
EMBOSS over a Grid 1. 1st EELA Grid School December 4th of 2006 Eduardo MURRIETA LEON Romualdo ZAYAS-LAGUNAS Pierre-Alain BRANGER Jérôme VERLEYEN Roberto.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Bioinformatics and Computational Biology
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Construction of Substitution matrices
Polish Infrastructure for Supporting Computational Science in the European Research Space EUROPEAN UNION Examining Protein Folding Process Simulation and.
Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy Prediction of protein function from sequence analysis.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Lab 4.11 Lab 4.1: Multiple Sequence Alignment Jennifer Gardy Molecular Biology & Biochemistry Simon Fraser University.
CSCI2950-C Lecture 12 Networks
Multiple sequence alignment (msa)
Prediction of protein function from sequence analysis
MULTIPLE SEQUENCE ALIGNMENT
Basic Local Alignment Search Tool
Presentation transcript:

Frank Dehnewww.dehne.net Parallel Computational Biochemistry

Frank Dehnewww.dehne.net Proteins, DNA, etc. DNA encodes the information necessary to produce proteins Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes)

Frank Dehnewww.dehne.net Proteins are formed from a chain of molecules called amino acids Proteins, DNA, etc.

Frank Dehnewww.dehne.net The DNA sequence encodes the amino acid sequence that constitutes the protein Proteins, DNA, etc.

Frank Dehnewww.dehne.net There are twenty amino acids found in proteins, denoted by A, C, D, E, F, G, H, I,... Proteins, DNA, etc.

Frank Dehnewww.dehne.net Multiple Sequence Alignment

Frank Dehnewww.dehne.net Databases of Biological Sequences >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH NCBI: 14,976,310 sequences 15,849,921,438 nucleotides Swiss-Prot: 104,559 sequences 38,460,707 residues PDB: 17,175 structures

Frank Dehnewww.dehne.net Sequence comparison Compare one sequence (target) to many sequences (database search) Compare more than two sequences simultaneously

Frank Dehnewww.dehne.net Applications Phylogenetic analysis Identification of conserved motifs and domains Structure prediction

Frank Dehnewww.dehne.net

Frank Dehnewww.dehne.net Phylogenetic Analysis

Frank Dehnewww.dehne.net Structure Prediction Genomic sequences > RICIN GLYCOSIDASE MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Protein sequences Protein structures

Frank Dehnewww.dehne.net Our Contributions Parallel min vertex cover for improved sequence alignments (to appear in Journal of Computer and System Sciences) Parallel Clustal W (ICCSA 2003) In progress: “Clustal XP” portal at

Frank Dehnewww.dehne.net Clustal W

Frank Dehnewww.dehne.net Progressive Alignment Scerevisiae [1] Celegans [2] Drosophia [3] Human [4] Mouse [5] S.cerevisiae C.elegans Drosophila Mouse Human 1. Do pairwise alignment of all sequences and calculate distance matrix 2. Create a guide tree based on this pairwise distance matrix 3. Align progressively following guide tree. start by aligning most closely related pairs of sequences at each step align two sequences or one to an existing subalignment

Frank Dehnewww.dehne.net Parallel Clustal Parallel pairwise (PW) alignment matrix Parallel guide tree calculation Parallel progressive alignment Scerevisiae [1] Celegans [2] Drosophia [3] Human [4] Mouse [5] S.cerevisiae C.elegans Drosophila Mouse Human

Frank Dehnewww.dehne.net Relative Speedup

Frank Dehnewww.dehne.net Clustal XP vs. SGI SGI data taken from Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTAL By: Dmitri Mikhailov, Haruna Cofer, and Roberto Gomperts

Frank Dehnewww.dehne.net Parallel Clustal - Improvements Optimization of input parameters –scoring matrices, gap penalties - requires many repetitive Clustal W calculations with various input parameters. Minimum Vertex Cover –use minimum vertex cover to remove erroneous sequences, and identify clusters of highly similar sequences.

Frank Dehnewww.dehne.net Minimum Vertex Cover Conflict Graph –vertex: sequence –edge: conflict (e.g. alignment with very poor score) TASK: remove smallest number of gene sequences that eliminates all conflicts NP-complete

Frank Dehnewww.dehne.net FPT Algorithms Phase 1: Kernelization Reduce problem to size f(k) Phase 2: Bounded Tree Search Exhausive tree search; exponential in f(k)

Frank Dehnewww.dehne.net Kernelization Buss's Algorithm for k-vertex cover Let G=(V,E) and let S be the subset of vertices with degree k or more. Remove S and all incident edges G->G’ k -> k'=k-|S|. IF G' has more than k x k' edges THEN no k-vertex cover exists ELSE start bounded tree search on G'

Frank Dehnewww.dehne.net Bounded Tree Search

Frank Dehnewww.dehne.net Case 1: simple path of length 3 remove selected vertices from G' k' - = 2

Frank Dehnewww.dehne.net Case 2: 3-cycle remove selected vertices from G' k' - = 2

Frank Dehnewww.dehne.net Case 3: simple path of length 2 remove v1, v2 from G' k' - = 1

Frank Dehnewww.dehne.net Case 4: simple path of length 1 remove v, v1 from G' k' - = 1

Frank Dehnewww.dehne.net Sequential Tree Search Depth first search –backtrack when k'=0 and G'<>0 ("dead end" )) –stop when solution found (G'={}, k'>=0 )

Frank Dehnewww.dehne.net Parallel Tree Search Basic Idea: –Build top log p levels of the search tree (T ') –every proc. starts depth- first search at one leaf of T ' –randomize depth-first search by selecting random child

Frank Dehnewww.dehne.net Analysis: Balls-in-bins sequential depth-first search path total length:L, #solutions: m expected sequential time (rand. distr.): L/(m+1) parallel search path expected parallel time (rand. distr.): p + L/(p(m+1)) expected speedup: p / (1 + (m+1)/L) if m << L then expected speedup = p

Frank Dehnewww.dehne.net Simulation Experiment L = 1,000,000

Frank Dehnewww.dehne.net Implementation test platform: –32 node HPCVL Beowulf cluster –each node: dual 1.4 GHz Intel Xeon, 512 MB RAM, 60 GB disk –gcc and LAM/MPI on LINUX Redhat 7.2 code-s: Sequential k-vertex cover code-p: Parallel k-vertex cover

Frank Dehnewww.dehne.net Test Data Protein sequences Same protein from several hundred species Each protein sequence a few hundred amino acid residues in length Obtained from the National Center for Biotechnology Information (

Frank Dehnewww.dehne.net Test Data Somatostatin –neuropeptide involved in the regulation of many functions in different organ systems –Clustal Threshold = 10, |V| = 559, |E| = 33652, k = 273, k' = 255

Frank Dehnewww.dehne.net Test Data WW –small protein domain that binds proline rich sequences in other proteins and is involved in cellular signaling –Clustal Threshold = 10, |V| = 425, |E| = 40182, k = 322, k' = 318

Frank Dehnewww.dehne.net Test Data Kinase –large family of enzymes involved in cellular regulation –Clustal Threshold = 16, |V| = 647, |E| = , k = 497, k' = 397

Frank Dehnewww.dehne.net Test Data SH2 (src-homology domain 2) –involved in targeting proteins to specific sites in cells by binding to phosphor-tyrosine –Clustal Threshold = 10, |V| = 730, |E| = 95463, k = 461, k' = 397

Frank Dehnewww.dehne.net Test Data Thrombin –protease involved in the blood coagulation cascade and promotes blood clotting by converting fibrinogen to fibrin –Clustal Threshold = 15, |V| = 646, |E| = 62731, k = 413, k' = 413

Frank Dehnewww.dehne.net Test Data PHD (pleckstrin homology domain) –involved in cellular signaling –Clustal Threshold = 10, |V| = 670, |E| = , k = 603, k' = 603

Frank Dehnewww.dehne.net Test Data Random Graph |V| = 220, |E| = 2155, k = 122, k' = 122 Grid Graph |V| = 289, |E| = 544, k = 145, k' = 145

Frank Dehnewww.dehne.net Test Data |VC| ~ |V| / 2 k' = k

Frank Dehnewww.dehne.net Sequential Times Kinase, SH2, Thombin: n/a

Frank Dehnewww.dehne.net Code-p on Virtual Proc.

Frank Dehnewww.dehne.net Parallel Times

Frank Dehnewww.dehne.net Speedup: Somatostatin

Frank Dehnewww.dehne.net Speedup: WW

Frank Dehnewww.dehne.net Speedup: Rand. Graph

Frank Dehnewww.dehne.net Speedup: Grid Graph

Frank Dehnewww.dehne.net Clustal W + Parallel Clustal … Parallel FPT MVC Clustal XP Web Portal Clustal XP in progress X : Extended P : Parallel

Frank Dehnewww.dehne.net Clustal XP

Frank Dehnewww.dehne.net