1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
BLAST Sequence alignment, E-value & Extreme value distribution.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
De novo identification of repeat families in large genomes Alkes L. Price, Neil C. Jones and Pavel A. Pevzner June 28, 2005.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Sequence Similarity Searching Class 4 March 2010.
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Repetitive DNA Detection and Classification Vijay Krishnan Masters Student Computer Science Department.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Comparative ab initio prediction of gene structures using pair HMMs
Similar Sequence Similar Function Charles Yan Spring 2006.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSIS Probabilistic modeling and molecular phylogeny Anders Gorm Pedersen Molecular Evolution Group Center for Biological.
BNFO 602 Multiple sequence alignment Usman Roshan.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Multiple Sequence Alignments
Phylogenetic Tree Construction and Related Problems Bioinformatics.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Sequence alignment, E-value & Extreme value distribution
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Novel computational methods for large scale genome comparison PhD Director: Dr. Xavier Messeguer Departament de Llenguatges i Sistemes Informàtics Universitat.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
Protein Sequence Alignment and Database Searching.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Transposable Elements (TE) in genomic sequence Mina Rho.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Genome Alignment. Alignment Methods Needleman-Wunsch (global) and Smith- Waterman (local) use dynamic programming Guaranteed to find an optimal alignment.
SSAHA, or Sequence Search and Alignment by Hashing Algorithm, is used mainly for fast sequence assembly, SNP detection, and the ordering and orientation.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Outline More exhaustive search algorithms Today: Motif finding
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Expected accuracy sequence alignment Usman Roshan.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignment.
Construction of Substitution matrices
Measuring genetic change Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Section 5.2.
Step 3: Tools Database Searching
Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI-Jena, Germany Introduction: During the last 10 years, a large number of complete.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Pairwise Sequence Alignment and Database Searching
Sequence similarity, BLAST alignments & multiple sequence alignments
The ideal approach is simultaneous alignment and tree estimation.
Gil McVean Department of Statistics, Oxford
Learning Sequence Motif Models Using Expectation Maximization (EM)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

1 Repeats!

2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families of interspersed repeats within a single genome

3 Challenges when identifying repeat families Challenges:  Regions containing repeat occurrences are not known a priori  Repeat boundaries are not known a priori  Many repeat occurrences appear as partial copies

4 Why are repeats important  Repeats have been implicated in:  Genome rearrangements (Kazazian, 2004; Achaz et al 2003)  Accelerated loss of gene order (Rocha et al, 2003)  Creation of novel biological functions (Lynch et al, 2002)  Increased rate of evolution under stress (Capy et al, 2000)

5 Identifying repeats de novo  Assume we get a new genome and we know nothing about it, we can:  Use a database of known repeats (RepeatMasker/RepBase) novel repeat elements may not be in the database repetitive gene families are never in the database  Identify repeats de novo using sequence analysis

6 Existing methods for detection of repeat families  Nearly all existing algorithms for de novo identification of repeat families rely on a set of pairwise similarities:  REPuter (Kurtz et al., 2000)  RepeatFinder (Volfovsky et al., 2001)  RECON (Bao and Eddy, 2002)  RepeatGluer (Pevzner et al., 2004)  PILER (Edgar and Myers, 2005)  RepeatScout (Price et al, 2005)

Mutational forces at play  Over time, indels & substitutions will affect copies of repeat families:  AGGCTACCCCTTTAGGCTAGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTA GCCTATT  AGGCTGCCCCTTTAGGCTDGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTA GCCTATT  AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAG CCTATT  AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAG CDTATT  AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAG CTATT  Require alignments (& gaps) to attempt to reconstruct true repeat boundaries 7

8 de novo repeat detection  One approach: self-search with a pairwise local-alignment tool such as BLAST  Number of pairwise alignments grows O(r 2 ) in the copy number of the repeat  Inherent difficulty defining repeat boundaries among collections of pairwise alignments

9 Alternative methods?  Local multiple alignment A single local multiple alignment uses O(N) space for a genome of length N 1.AACAAGCA-A-ACTTTTATCCATGGTCGTGGTACAGAGGGGTC 2.AACAAGCA-A-ACTTTTGTCCATGGTCGTGGTACAGAGTGGTC 3.AACATGCAGA-ACTTTTATCCATGGTCGTCGTACAGAGGGGT- 4.AACAAGCAGACACTTTTATCCATGGTCGTGGTAC AACAAGCA----CTTTTATCCATAGTCGTGGTA CTTTTATCCATGGTCGTGGTACAGAGGGGTC An example local multiple alignment:

10 Local multiple alignment  Local multiple alignment has the inherent potential to avoid pitfalls associated with pairwise alignment.  But multiple alignment under the SP objective function remains intractable…  Progressive alignment heuristics offer excellent speed and accuracy (i.e. MUSCLE).  So why not directly construct a multiple alignment?

11

12 Steps 1-3: Chaining seeds from the Input Sequence The method incorporated three novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered.

13 Step 4: Gapped Extension  After chaining a seed match, we must perform gapped extension to approximate the true repeat boundaries  This is an essential step to consider, assuming we would like to improve repeat boundary predictions  But how can this be done efficiently?

14 TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA. GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Our approach to gapped extension

15 TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA. GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC HMM approach to gapped extension Dynamically calculate extension window = 70*e -0.01*|Mi| |M i | = 200, l = 10

16 TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA. GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC HMM approach to gapped extension Use MUSCLE to perform alignment of extension window

17 ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA AACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA. ATTCGGCCCC-CGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC HMM approach to gapped extension Use HMM to detect & unalign unrelated sequence

18 HMM approach to gapped extension Extension successful, continue extending

19 ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA. ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC HMM approach to gapped extension

20 ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA. ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC HMM approach to gapped extension Use HMM to detect & unalign unrelated sequence

21 HMM approach to gapped extension Finished leftward extension, now to the right…

22 HMM approach to gapped extension

23 HMM approach to gapped extension Perform MUSCLE alignment on window

24 HMM approach to gapped extension Use HMM to detect & unalign unrelated sequence

25 HMM approach to gapped extension Extension successful, continue extending

26 HMM approach to gapped extension Use MUSCLE to perform alignment of extension window

27 HMM approach to gapped extension Use HMM to detect & unalign unrelated sequence

28 HMM approach to gapped extension Extension failed, stop extending

29 Wait a moment..  The MUSCLE alignment software reports the highest scoring global multiple alignment of the input sequences, regardless of common ancestry.  As a result, it is likely that this method forcibly aligns unrelated sequence.  HMMs to detect alignments of unrelated sequence.

30 Step 5: detecting unrelated sequence  The HMM consists of two hidden states, Homologous and Unrelated.  The observable states are the pairwise alignment columns, which are all possible pairs in {A,G,C,T,-} with strand and species symmetry  i.e. AG=GA=TC=CT.  The emission probabilities for each possible pair of aligned nucleotides were extracted from the HOXD substitution matrix presented by Chiaromonte et al.

31 U H UUUU 0.5  Compute emission frequencies for the Unrelated state of our HMM using the background frequencies of G/C and A/T, assuming strand and species symmetry: U AA = U AT = U TA = U TT = (f AT )/2 * (f AT )/2 U CC = U CG = U GC = U GG = (f GC )/2 * (f GC )/2 U AC = U AG = U TC = U AG = (f AT )/2 * (f GC )/2 U CA = U CT = U GA = U TT = (f GT )/2 * (f AT )/2

32 UU H UUUUUU 0.5  To empirically estimate gap-open and extend values for the unrelated state, align a 10-kb, 48% G+C content region taken from E. coli CFT073 (Accession AF , coordinates 37,300-38,300) with an unrelated sequence.

33 UU H UUUUUUUUUUUU  Alignment with MUSCLE on unrelated sequence and counted the number of gap-open and gap-extend columns in the alignment of unrelated sequences. 0.5

34 UU H UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHH  Gap-open and extend frequencies for the homologous state were estimated by constructing an alignment of 10kb of orthologous sequence shared among a pair of divergent organisms. 0.5

35 UU H UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 0.5