De novo identification of repeat families in large genomes Alkes L. Price, Neil C. Jones and Pavel A. Pevzner June 28, 2005.

Slides:



Advertisements
Similar presentations
RNA-Seq based discovery and reconstruction of unannotated transcripts
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Final presentation Final presentation Tandem Cyclic Alignment.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
Genome organization Lesk, Ch 2 (Lesk, 2008). Genomes and proteomes Genome of a typical bacterium comes as a single DNA molecule of about 5 million characters.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
Mapping Genomes onto each other – Synteny detection CS 374 Aswath Manohar.
DNA Sequencing Lecture 9, Tuesday April 29, 2003.
Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Repetitive DNA Detection and Classification Vijay Krishnan Masters Student Computer Science Department.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Introduction to Bioinformatics Algorithms Greedy Algorithms And Genome Rearrangements.
Assembly.
Of Mice and Men Learning from genome reversal findings Genome Rearrangements in Mammalian Evolution: Lessons From Human and Mouse Genomes and Transforming.
Protein Modules An Introduction to Bioinformatics.
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
DNA Sequencing and Assembly. DNA sequencing How we obtain the sequence of nucleotides of a species …ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Genome Annotation BCB 660 October 20, From Carson Holt.
Online Counseling Resource YCMOU ELearning Drive… School of Architecture, Science and Technology Yashwantrao C havan Maharashtra Open University, Nashik.
Protein Structure Alignment by Incremental Combinatorial Extension (CE) of the Optimal Path Ilya N. Shindyalov, Philip E. Bourne.
Mouse Genome Sequencing
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Transposable Elements (TE) in genomic sequence Mina Rho.
Genome Organization and Evolution. Assignment For 2/24/04 Read: Lesk, Chapter 2 Exercises 2.1, 2.5, 2.7, p 110 Problem 2.2, p 112 Weblems 2.4, 2.7, pp.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
Biological Motivation for Fragment Assembly Rhys Price Jones Anne R. Haake.
Greedy Algorithms And Genome Rearrangements An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Order independent structural alignment of circularly permutated proteins T. Andrew Binkowski Bhaskar DasGupta  Jie Liang ‡ Bioengineering Computer Science.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Genomes & their evolution Ch 21.4,5. About 1.2% of the human genome is protein coding exons. In 9/2012, in papers in Nature, the ENCODE group has produced.
Chapter 21 Eukaryotic Genome Sequences
Algorithms for Biological Sequence Analysis ─ Class Presentation Human-Mouse Alignments with BLASTZ Galaxy: A Platform for Interactive Large-scale Genome.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Genome Annotation Rosana O. Babu.
SPIDA Substitution Periodicity Index and Domain Analysis Combining comparative sequence analysis with EST alignment to identify coding regions Damian Keefe.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.
Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features 王荣 14S
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Introduction to Bioinformatics Algorithms Chapter 5 Greedy Algorithms and Genome Rearrangements By: Hasnaa Imad.
Whole Genome Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.
Reconstructing the Evolutionary History of Complex Human Gene Clusters
Gapless genome assembly of Colletotrichum higginsianum reveals chromosome structure and association of transposable elements with secondary metabolite.
Design and Use of RepeatMasker
Genomes and Their Evolution
SGN23 The Organization of the Human Genome
Basic Local Alignment Search Tool (BLAST)
Extra chromosomal Agents Transposable elements
Presentation transcript:

De novo identification of repeat families in large genomes Alkes L. Price, Neil C. Jones and Pavel A. Pevzner June 28, 2005

What is a repeat family? A repeat family is a collection of similar sequences which appear many times in a genome. For example, the Alu repeat family has over 1 million approximate occurrences in the human genome: Alu

Identifying repeat families: problem formulation Alu INPUT: Genome containing approximate Alu occurrences OUTPUT: 282bp Alu consensus sequence GGCCGGGCGCGGTGGCTCACG………..GCGAGACTCCGTCTC + consensus sequences of all other repeat families in genome

Identifying repeat families: an easy problem? Alu

Identifying repeat families: an easy problem? Alu

Identifying repeat families: an easy problem? Alu Alu consensus

Identifying repeat families: an easy problem? Alu Alu consensus Difficulties:

Identifying repeat families: an easy problem? Alu Alu consensus Difficulties: Regions containing repeat occurrences are not known a priori

Identifying repeat families: an easy problem? Alu Alu consensus Difficulties: Regions containing repeat occurrences are not known a priori Repeat boundaries are not known a priori

Identifying repeat families: an easy problem? Alu Alu consensus Difficulties: Regions containing repeat occurrences are not known a priori Repeat boundaries are not known a priori Many repeat occurrences appear as partial copies

Identifying repeat families: a difficult problem “The problem of automated repeat sequence family classification is inherently messy and ill-defined and does not appear to be amenable to a clean algorithmic attack.” Bao and Eddy, 2002 In this talk, we present a simple and efficient algorithm for solving this problem.

Why is identifying repeat families important? Genome rearrangements (Kazazian, 2004) Drift to new biological function (Kidwell and Lisch, 2001) Increased rate of evolution under stress (Capy et al, 2000) 1. Repeats are biologically meaningful Repeats are drivers of genome evolution (Kazazian, 2004) which can play a beneficial (rather than parasitic) role (Holmes, 2002). In particular, repeats have been implicated in

Why is identifying repeat families important? Repeats need to be masked prior to performing most single-species or multi-species analyses. “Every time we compare two species that are closer to each other than either is to humans, we get nearly killed by unmasked repeats.” Webb Miller (personal communication) 2. Repeat masking

Why is identifying repeat families important? Repeats need to be masked prior to performing most single-species or multi-species analyses. GENE1 GENE2

Why is identifying repeat families important? If repeat families are known, repeats can be masked using RepeatMasker ( GENE1 GENE2

Why is identifying repeat families important? If repeat families are known … GENE1 GENE2

Identifying repeat families: manual approaches For widely studied genomes such as human and mouse, libraries of repeat families have been manually curated: –Repbase Update library ( –RepeatMasker library (

Identifying repeat families: algorithmic approaches Many, many new genomes are being assembled. How to identify the repeat families present in these genomes? Clearly, algorithmic approaches are needed.

Identifying repeat families: algorithmic approaches All existing algorithms for de novo identification of repeat families rely on a set of pairwise similarities: Single-linkage clustering (Agarwal and States, 1994) REPuter (Kurtz et al., 2000) RepeatFinder (Volfovsky et al., 2001) RECON (Bao and Eddy, 2002) RepeatGluer (Pevzner et al., 2004) PILER (Edgar and Myers, 2005)

Identifying repeat families: algorithmic approaches Disadvantages of using pairwise similarities: Computational intractability human genome: ~10 6 Alus => ~10 12 pairwise alignments Difficulty defining repeat boundaries “Local sequence alignments do not usually correspond to the biological boundaries … Difficulty in defining element boundaries causes problems in clustering related elements into families.” Bao and Eddy, 2002

Identifying repeat families: algorithmic approaches Disadvantages of using pairwise similarities: Computational intractability Difficulty defining repeat boundaries Our RepeatScout algorithm uses an efficient method of similarity search which enables a rigorous definition of repeat boundaries.

RepeatScout: the main idea Consider a repeat family with many occurrences in a genome: Equivalently, we have: TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG

RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: ?

RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: ?

RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGC Idea: greedily extend 1 bp at a time from short l-mer seed

RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCT Idea: greedily extend 1 bp at a time from short l-mer seed

RepeatScout: the main idea TAGCACCTTAGGGCGTCTCGCAACGTCTGCCCACGAACGTTAATCAGTAA GATTATCATGAAGCGCTTCGCAACGTCTGCAGCTGTCCAGACCGCTGTCA TATATCCGGTAATCGCCCCGCAACGTCTGCTAACGGGCGTACGGTCGAAT TGACCTGCTCAGGAGCCTTGCAACGTCTGCTCGCGGATGTGTATGCACGC ATCCATGCTCGGTATGAATCCAACGTCTGCTCATGGACATCTCATGACGT CGATCCTCTGAGGCACCTCACAACGTCTGCTCACTGACGCACGGTTGCTG Consensus: CAACGTCTGCTC Idea: greedily extend 1 bp at a time from short l-mer seed

RepeatScout: the main idea Consensus: CAACGTCTGCTCA Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

RepeatScout: the main idea Consensus: CAACGTCTGCTCAC Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

RepeatScout: the main idea Consensus: CAACGTCTGCTCACG Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

RepeatScout: the main idea Consensus: CAACGTCTGCTCACGG Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

RepeatScout: the main idea Consensus: CAACGTCTGCTCACGGA Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

RepeatScout: the main idea Consensus: CAACGTCTGCTCACGGAC Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

RepeatScout: the main idea Consensus: CAACGTCTGCTCACGGACG Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

RepeatScout: the main idea Consensus: CAACGTCTGCTCACGGACGT Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus

RepeatScout: the main idea Consensus: CAACGTCTGCTCACGGACGT Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus Stop extending when most sequences no longer align

RepeatScout: the main idea Consensus: CAACGTCTGCTCACGGACGTACGGT Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence after it stops aligning to consensus Stop extending when most sequences no longer align Note: pairwise alignment is a poor boundary criteria.

RepeatScout: the main idea Consensus: AGGCGCCTCGCAACGTCTGCTCACGGACGT Idea: greedily extend 1 bp at a time from short l-mer seed Discard a sequence “after it stops aligning to consensus” Stop extending “when most sequences no longer align” First extend right, then extend left in similar manner

Repeat boundaries: the objective function Let S 1, …, S n be strings containing occurrences of a repeat family which share a short l-mer seed. We define the consensus sequence Q of the repeat family to be the sequence which maximizes A(Q; S 1, …, S n ) = ∑ k a(Q, S k ) where a(Q, S k ) is a fit-preferred alignment score

Repeat boundaries: the objective function Let S 1, …, S n be strings containing occurrences of a repeat family which share a short l-mer seed. We define the consensus sequence Q of the repeat family to be the sequence which maximizes A(Q; S 1, …, S n ) = ∑ k a(Q, S k ) – c |Q| where a(Q, S k ) is a fit-preferred alignment score c is a repeat frequency threshold

Repeat boundaries: the objective function A(Q; S 1, …, S n ) = ∑ k a(Q, S k ) – c |Q| Optimizing the objective function: Start with Q = short l-mer seed Greedily extend Q to the right (left) 1 bp at a time. Stop when + many consecutive iterations fail to improve upon the optimal Q. The optimal Q defines the consensus sequence of the repeat family. This provides a rigorous definition of repeat boundaries.

Repeat boundaries: the objective function Consensus: AGGCGCCTCGCAACGTCTGCTCACGGACGT Greedily extend right/left to optimize A(Q, S 1, …, S n )

RepeatScout: finding all repeat families To find all repeat families in a genome, we could apply this procedure to extend all frequent l-mers.

RepeatScout: finding all repeat families To find all repeat families in a genome, we could apply this procedure to extend all frequent l-mers. However, each repeat family spawns a large number of frequent l-mers and could be repeatedly rediscovered.

RepeatScout: finding all repeat families To find all repeat families in a genome, we could apply this procedure to extend all frequent l-mers. However, each repeat family spawns a large number of frequent l-mers and could be repeatedly rediscovered. To address this, we dynamically adjust l-mer frequencies to exclude contributions from repeat families we have already identified.

RepeatScout: postprocessing We discard very short “repeat families” arising from spurious frequent l-mers. We discard repeat families with less than 10 copies. We may further wish to distinguish between Low-complexity repeat families Tandem repeat families Multicopy exon families Segmental duplication units Transposon families

Results: the human Alu family Alu Input: Genome containing approximate Alu occurrences Desired Output: 282bp Alu consensus sequence GGCCGGGCGCGGTGGCTCACG………..GCGAGACTCCGTCTC

Results: the human Alu family Alu Input: Genome containing approximate Alu occurrences Desired Output: 282bp Alu consensus sequence GGCCGGGCGCGGTGGCTCACG………..GCGAGACTCCGTCTC RepeatScout Output (on human X chr): 282bp sequence GGCCGGGCGCGGTGGCTCACG………..GCGAGACTCCGTCTC

Results: C. briggsae We benchmarked RepeatScout using the 108Mb C. briggsae genome (Stein et al., 2003), which Stein et al. analyzed using the RECON algorithm (Bao and Eddy, 2002). We ran RepeatMasker ( using either the RECON repeat library or the RepeatScout library as input, and compared the results:

Results: C. briggsae RECON RepeatScout library library 2.0 Mb 23.1 Mb 4.8 Mb

Results: human, mouse, rat We ran RepeatScout on human, mouse and rat X chromosomes. We filtered out Low-complexity repeat families Tandem repeat families Multicopy exon families Known segmental duplication units We ran RepeatMasker using either the RepeatMasker library or the RepeatScout library as input, and compared the results:

Results: human X chromosome RepeatMasker RepeatScout library library 8.3 Mb 53.5 Mb 2.4 Mb

Results: mouse X chromosome RepeatMasker RepeatScout library library 5.3 Mb 47.6 Mb 3.3 Mb

Results: mouse X chromosome RepeatMasker RepeatScout library library 5.3 Mb 47.6 Mb 3.3 Mb

Results: mouse X chromosome Repbase Update RepeatScout library library 2.7 Mb 43.2 Mb 6.4 Mb results presented in our paper

Results: mouse X chromosome RepeatMasker RepeatScout library library 5.3 Mb 47.6 Mb 3.3 Mb latest results

Running times 3.0 Mb (human) 9.0 Mb (human) X chr (human) RECON 4 hours * 39 hours * -- RepeatScout 6 min † 21 min † 8 hours † * on a single 1.7 GHz Intel Xeon processor † on a single 0.5 GHz DEC Alpha processor

Future Directions Distinguish segmental duplications from transposons Unify fragmented repeat families Improve sensitivity via inexact or noncontiguous l-mer seeds Run RepeatScout on entire mammalian genomes

RepeatScout web site Google search on RepeatScout RepeatScout source code and documentation RepeatScout repeat libraries Slides of this talk Google search on RepeatScout

Acknowledgements We are grateful to Lincoln Stein for providing RECON C. briggsae output. Evan Eichler for providing segmental duplication annotations for human, mouse and rat X chromosomes. Arian Smit, Robert Hubley and Brian Haas for testing RepeatScout and offering numerous helpful comments and suggestions.