Repetitive DNA Detection and Classification Vijay Krishnan Masters Student Computer Science Department.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

Longest Common Subsequence
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
CPSC 335 Dynamic Programming Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Label Placement and graph drawing Imo Lieberwerth.
Multiple Sequence Alignment
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
De novo identification of repeat families in large genomes Alkes L. Price, Neil C. Jones and Pavel A. Pevzner June 28, 2005.
Effective Heuristics for NP-Hard Problems Arising in Molecular Biology Richard M. Karp Bangalore, January 5, 2011.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Heuristic alignment algorithms and cost matrices
SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Bioinformatics and Phylogenetic Analysis
Protein Domain Finding Problem Olga Russakovsky, Eugene Fratkin, Phuong Minh Tu, Serafim Batzoglou Algorithm Step 1: Creating a graph of k-mers First,
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
1 Convolution and Its Applications to Sequence Analysis Student: Bo-Hung Wu Advisor: Professor Herng-Yow Chen & R. C. T. Lee Department of Computer Science.
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Similar Sequence Similar Function Charles Yan Spring 2006.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
October 8, 2013Computer Vision Lecture 11: The Hough Transform 1 Fitting Curve Models to Edges Most contours can be well described by combining several.
Developing Pairwise Sequence Alignment Algorithms
Gene expression & Clustering (Chapter 10)
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
PROTEIN STRUCTURE NAME: ANUSHA. INTRODUCTION Frederick Sanger was awarded his first Nobel Prize for determining the amino acid sequence of insulin, the.
Transposable Elements (TE) in genomic sequence Mina Rho.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
Introduction to Bioinformatics Biological Networks Department of Computing Imperial College London March 18, 2010 Lecture hour 18 Nataša Pržulj
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Lectures on Greedy Algorithms and Dynamic Programming
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Category Independent Region Proposals Ian Endres and Derek Hoiem University of Illinois at Urbana-Champaign.
Comp. Genomics Recitation 10 Clustering and analysis of microarrays.
BLAST, which stands for basic local alignment search tool, is a heuristic algorithm that is used to find similar sequences of amino acids or nucleotides.
Splicing Exons: A Eukaryotic Challenge to Gene Prediction Ian McCoy.
An Exact Algorithm for Difficult Detailed Routing Problems Kolja Sulimma Wolfgang Kunz J. W.-Goethe Universität Frankfurt.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
OPERA highthroughput paired-end sequences Reconstructing optimal genomic scaffolds with.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Prof. Yu-Chee Tseng Department of Computer Science
Graphcut Textures:Image and Video Synthesis Using Graph Cuts
Homology Search Tools Kun-Mao Chao (趙坤茂)
13 Text Processing Hongfei Yan June 1, 2016.
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Pairwise sequence Alignment.
Basic Local Alignment Search Tool (BLAST)
Homology Search Tools Kun-Mao Chao (趙坤茂)
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Repetitive DNA Detection and Classification Vijay Krishnan Masters Student Computer Science Department

2 Repetitive DNA  Refers to substrings of the genome that repeat multiple times.  Different instances of the repeat element can have slightly different patterns  Highly prevalent in eukaryotes (organisms with a visible nucleus and cell structure, as opposed to bacteria) About 50% of the human genome is repetitive DNA.

3 Why detect repetitive DNA?  Repeats Drive Evolution in Diverse Ways (Kazazian, 2004).  Repetitive DNA are generally not found to have any function.  Homology searches need repeat masking. To avoid explosion of unnecessary results.  Repeats also contain information about parentage.

4 Hit  Defined as a local alignment between two regions Q and T.  Q and T are called images of the hit.  Q = partner(T) with respect to the hit.  Completely defined by the endpoint coordinates of Q and T.  Endpoints of Q referred to as start(Q) and end(Q).

5 Dispersed Families (DF)  Often comprise mobile elements like Transposons and Retrotransposons.  Images(x) = {A 1, A 2 }  Signature induced by a Dispersed Family.  Images(y) = {A 1, A 3 }  Images(z) = {A 2, A 3 }

6 Tandem Arrays (TA)  The repeating element is called a “Satellite”. “Pyramidal” Signature Induced by a Tandem Array.

7 Other Repeat Families  Pseudo-Satellites: Intermediate between Satellites and Dispersed Families.  Tandem Repeat: Often defined to be the same as TA. The PILER paper defines it as images with size bases, separated by 50 to bases.

8 De novo identification of repeat families  Input: The Genome sequence  Output: The repeat families and the positions where they occur in the Genome.

PILER: identification and classification of genomic repeats Robert C. Edgar and Eugene W. Myers

10 Finding Local Alignments (Hits)  Pairwise Alignment of Local Sequences (PALS) software used as a black box.  Used to find local alignments of minimum length(λ) and minimum identity(μ).  Additional optimizations for banded search for alignments. Finding regions separated by maximum distance β.

11 Pile  Suppose we are given a list of N hits.  This corresponds to 2N images (intervals).  A pile is a list of all images covering a maximal contiguous region. “Merge” overlapping images and “erase” the boundaries between adjacent images.  Let images = { [1,3], [2,4], [3,6], [8,9], [9,13] } Pile boundaries = { [1,6], [8,13] }. Pile Images = { {[1,3], [2,4], [3,6]}, {[8,9], [9,13]} }

12 Construction of Piles (Example)  Images = { [1,3], [2,4], [6,7] } Index Value Index Value Index Value Index Value Index Value

13 PILER-DF  Let G be a graph with one node for each pile, and no edges.  is-global-image(Q) is true if: #bases in Q >= g * (#bases in pile(Q))  For each pile p in P: For each image Q in p: Let T = partner(Q) if is-global-image(Q) and is-global-image(T ): – Add edge p−pile(T ) to G  Find connected components of G of order ≥ t.  t >= 3 to avoid segmental duplication.  Each Connected Component is a DF.

14 PILER-PS  Similar to the problem of finding DFs, except that PSs are typically closer to one another.  Algorithm identical to PILER-DF except for banded search to identify hits.  Banded Search: Ensures that the PSs are clustered. Allows a faster and more sensitive search for hits.

15 PILER-TA  TAs have pyramids as signatures.  We can avoid comparing every pair of hits since: Hits in a pyramid belong to the same pile. The images should be separated by at most distance β (banded search).  Define first(h) = image in h with smaller start coordinate.  Define last(h) = image in h with larger start coordinate.

16 PILER-TA  For each pile p: Create an empty graph G with all hits in the pile For each pair of hits (h 1, h 2 ) in p: Set shorter_length = min(|h 1 |, |h 2 |) Set longer_length = max(|h 1 |, |h 2 |) Set Q1 = first(h 1 ) … here (B1,B2,B3) Set T1 = last(h 1 ) … here (B2,B3,B4) Set Q2= first(h 2 ) … here (B1,B2) Set T2 = last(h 2 ) … here (B3,B4) Set dS = (start(Q 2 ) − start(Q 1 )) / shorter_length … here 0-0=0 Set dE = (end(T 2 ) − end(T 1 )) / shorter_length …. Here 4-4 = 0 if shorter_length / longer_length > 0.5 and – |dS| < m and |dT | < m: – Add edge h 1 − h 2 to G  Each connected components of G is a TA.  0 <=m <= 1. By default m= 0.05.

17 PILER-TR  Identify and mask Satellites and PSs.  Two pass method: Pass1: perform banded search for TR candidates. Pass2: Find hits that align TR pairs to each other.

18 Library Construction  Use MUSCLE (Edgar, 2004a,b) Create multiple alignments of family members found by PILER. Use these to find consensus sequences.  This library can be used by BLAST or RepeatMasker to find intact and partial instances.

19 Satellites and PSs in A.thalania

De novo identification of repeat families in large genomes Alkes L. Price Neil C. Jones Pavel A. Pevzner

21 The RepeatScout Algorithm  Improves on the RECON algorithm (Bao and Eddy, 2002).  Builds repeat families using high-frequency L-mers as seeds.  Input: DNA Sequences S 1,…..,S n each of which contains a similar repeat element and extends past the repeat element on either side.  Output: Substrings R 1,…..,R n that give the repeat element boundaries, and consensus sequence Q.

22 RepeatScout (contd)  Q is defined to be the sequence that maximizes: A(Q;S 1,...,S k ) = [ ∑ k max{a(Q,S k ),0}] -c|Q|,  Where a(Q,S k ) can be any reasonable sequence alignment score.  The penalty factor c|Q| discourages long Qs, c can be thought of as the minimum number of repeat elements that must align with each given position of Q.

23 Choice of a(Q,S)  Local Alignment Score:  Fit Alignment Score (Waterman, 1995) Boundaries of Q shared by all segments. Strict constraint on Q.

24 Fit-Preferred Alignment Score

25 Comparison of Alignment Scores

26 Optimizing A(Q; S 1,..., S n )  Even dynamic Programming for the optimal solution is intractable. The problem would be n-dimensional. Both time and space requirements are exponential in n.  Greedy Heuristic: Suppose L is the high freqency lmer and S 1,..., S n surround its exact matches. Initialize Q 0 to L and greedily extend Q.

27 Optimizing A(Q; S 1,..., S n )  N Є {A, C, G, T}  Choose Q t+1 =Q t.N where N maximizes: A(Q t.N; S 1,..., S n )  We can re-use alignment scores from the previous iteration while computing alignment scores for the (t+1) th iteration.  Terminate after a certain no. of iterations gives no improvement.  Use this procedure for extending to the right, and then to the left.

28 Optimizing A(Q; S 1,..., S n )  Prevent redundancy in finding consensus sequences.  After identifying Q, locate its occurrences and reduce the counts of L-mers corresponding to those locations.  Algorithm terminates when we have no L-mers with effective count of at least m.  Refine Q after the optimal alignment boundaries are determined.  More details of parameter settings in the paper.

29 Results

30 Results

31 Results

32 Conclusions  Both PILER and RepeatScout address DNA repeats.  PILER focuses more on finding diverse kinds of repeat families and uses MUSCLE to find the consensus sequences  RepeatScout focuses more on finding the consensus sequence given members of a repeat family.

33 Thank You! Questions?