Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Slides:

Advertisements

Similar presentations

Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-

Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Sabegh Singh Virdi ASC Processor Group Computer Science Department

GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.

What's inside a router? We have yet to consider the switching function of a router - the actual transfer of datagrams from a router's incoming links to.

Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.

Heuristic alignment algorithms and cost matrices

Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.

Paper Title Your Name CMSC 838 Presentation. CMSC 838T – Presentation Motivation u Problem paper is trying to solve  Characteristics of problem  … u.

Sorting Algorithms CS 524 – High-Performance Computing.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.

Bioinformatics and Phylogenetic Analysis

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.

1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,

Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.

Sequence Alignment III CIS 667 February 10, 2004.

Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.

Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.

Phylogenetic Tree Construction and Related Problems Bioinformatics.

Alignment of Genomic Sequences Wen-Hsiung Li Ecology & Evolution Univ. of Chicago.

Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.

1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.

Sequence comparison: Local alignment

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

Developing Pairwise Sequence Alignment Algorithms

Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Chao “Bill” Xie, Victor Bolet, Art Vandenberg Georgia State University, Atlanta, GA 30303, USA February 22/23, 2006 SURA, Washington DC Memory Efficient.

Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.

Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.

1 Global Routing Method for 2-Layer Ball Grid Array Packages Yukiko Kubo*, Atsushi Takahashi** * The University of Kitakyushu ** Tokyo Institute of Technology.

Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.

Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.

BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

Genome alignment Usman Roshan. Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology – Evolutionary.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Chapter 3 Computational Molecular Biology Michael Smith

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-

Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,

Parallel Algorithm for Multiple Genome Alignment Using Multiple Clusters Nova Ahmed, Yi Pan, Art Vandenberg Georgia State University SURA Cyberinfrastructure.

Parallel and Distributed Simulation Time Parallel Simulation.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

A Hardware Accelerator for the Fast Retrieval of DIALIGN Biological Sequence Alignments in Linear Space Author: Azzedine Boukerche, Jan M. Correa, Alba.

A Parallel, High Performance Implementation of the Dot Plot Algorithm Chris Mueller July 8, 2004.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

Assembly S.O.P. Overlap Layout Consensus. Reference Assembly 1.Align reads to a reference sequence 2.??? 3.PROFIT!!!!!

1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.

Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Genome alignment Usman Roshan.

Sequence comparison: Local alignment

Bioinformatics: The pair-wise alignment problem

Sequence Alignment 11/24/2018.

Presentation transcript:

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation

CMSC 838T – Presentation Talk Overview u Organization of the paper  Motivation  Technique: Pairwise Sequence Comparison using Dynamic Programming  EARTH Execution Model  Evaluation  Result Graphs  Conclusions  Related Work (MUMmer)

CMSC 838T – Presentation Motivation u Importance of Genome Alignment :  Identify important matched and mismatched regions  “matches” represent homolog pairs, conserved regions or long repeats  “mismatches”represent foreign fragments inserted by transposition, sequence reversal or lateral transfer  Detect functional differences between pathogenic/ non-pathogenic strains, evolutionary distance, mutations leading to disease, phenotypes, etc. u Problems  Large computational power, memory and execution time  Existing algorithms apply dynamic programming only to subsequences  Computationally intensive to apply to whole sequences (O(n 2 ))  Thus applicable only to closely related genomes

CMSC 838T – Presentation Solution.. u Multithreaded parallel implementation of sequence alignment algorithm to align whole genomes  Parallel implementation of dynamic programming technique  Uses collective memory of several nodes  Uses multithreading to overlap computation and communication  Applicable to closely related as well as less similar genomes  Reliable output in reasonable time

CMSC 838T – Presentation Pairwise Sequence Comparison using Dynamic Programming u Basic Idea:  Quantify the similarity between pairs of symbols of target sequences  Associate score for each possible arrangement  Similarity is given by the highest score  Example : sequence x A T A A G T sequence y A T G C A G T SCORE –1 –1 –1 –1 TOTAL = -3 sequence x A T A - A G T sequence y A T G C A G T SCORE – TOTAL = 2  Model mutation by “gaps” (gaps indicate evolution of one sequence into another)

CMSC 838T – Presentation Dynamic Programming u Smith and Waterman approach:  Aligns subsequences of given sequences  Involves: (a) calculation of scores indicating similarity (b) identification of alignment(s) corresponding to the score  Build solution using previous solutions for smaller subsequences  Construct a two-dimensional array – “Similarity Matrix” to store scores corresponding to partial results  Matrix represents all possible alignments of the input sequences  Recurrence equation SM[i, j-1] + gp SM[i-1, j-1] + ss SM[i-1, j] + gp 0 SM[i, j] =

CMSC 838T – Presentation Contd…. Each element of the matrix is the max of the foll four values: Left element + gap, upper-left element + score of replacing vertical with horizontal symbol, upper element + gap, 0. Consider the foll example T G A T G G A G G T A T A G G G 2 = max{0 + (-2), 1 + (1), 0 + (-2), 0}

CMSC 838T – Presentation Identifying alignments  Alignments with score above a given threshold are reported  Start at end of the alignment and move backwards to the beginning T G A T G G A G G T A T A G G G T G A T – G G A G G T G A T A G G T G A T G G A G G T G A T A G G T G A T G G A G G T G A T A G G T G A T G G A G G T G A T A G G

CMSC 838T – Presentation EARTH Execution Model u Program is viewed as a collection of threads  execution order determined by data and control dependencies u Threads further divided into fibers  fibers are non-preemptive and  all data is ready before their execution u Each node in EARTH has  an execution unit  synchronization unit  queues linking the two (RQ and EQ)  local memory  interface to interconnection network

CMSC 838T – Presentation EARTH Architecture Memory bus From RQ To EQ PE … Local Memory EU SU EQRQ node.... Inter connection Network

CMSC 838T – Presentation Multithreaded parallel implementation u Divide scoring matrix as follows  horizontal strips (each element of input sequence X)  strips into rectangular blocks u Blocks are calculated by two fibers within a thread  only one fiber is active at any given time u Each thread is assigned to one horizontal strip  the computation is done by even/ odd fibers within the thread u Initialization delay of reading sequences from server is minimized  Each thread needs only the piece of input sequence it grabs and not the whole of sequence X  After computing a block, fiber sends to fiber beneath a piece of sequence Y among other information u The computation of the anti-diagonal elements of the matrix is as shown

CMSC 838T – Presentation Computation of similarity matrix on EARTH E fibers O E fibers O Thread AThread B P1 P2 P3 Inactive fiber Active fiber Ack Sync Data P1 P2 P3 P4 P1 P2 P3 P4

CMSC 838T – Presentation Evaluation u Experimental environment  Beowulf implementation of EARTH  Uses Beowulf machine consisting of 64 nodes, each containing two 200MHz Pentium Pro processors (a total of 128 processors and 128MB of memory)  Sequences of lengths ranging from 30K to 900K were tested  Execution times for sequential and parallel implementation of Smith and Waterman algorithm is given below: Implementation Time Seq. Smith-Waterman 53 hours ATGC on 16 nodes 3.3 hours ATGC on 32 nodes 2.1 hours ATGC on 64 nodes 1.3 hours

CMSC 838T – Presentation Evaluation u The multithreaded parallel implementation is named ATGC – Another Tool for Genomic Comparison u Experiment alignes  human and mice mitochondrial genomes  human and drosophila mitochondrial genomes u Reason for selection  human and mice are closely related and the other pair are less similar u The results were confirmed with MUMmer – another whole genome alignment tool u Result graphs show that ATGC is more accurate than MUMmer (verified by using NCBI Blast)

CMSC 838T – Presentation Result Graphs

CMSC 838T – Presentation Contd….

CMSC 838T – Presentation Conclusions u Comparison of whole genomes requires high computation and memory u Made convenient by using a multithreaded parallel implementation of dynamic programming on a cluster of PCs u Accurate results obtained in reasonable amount of time u Aligns closely related as well as less similar genomes u Slower, but plays important role where high accuracy is needed ( as seen in comparison with MUMmer for human and drosophila mitochondrial genome)

CMSC 838T – Presentation Related work –MUMmer(Maximal Unique Match) u given genomes A and B  find all maximal, unique, matching subsequences (MUMs)  extract the longest possible set of matches that occur in the same order in both genomes  close the gaps  output the alignment u maximal unique match (MUM):  occurs exactly once in both genomes A and B  not contained in any longer MUM u key idea in identifying MUMs is to build a suffix tree for genomes A and B