Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.

Similar presentations


Presentation on theme: "1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families."— Presentation transcript:

1 1 Repeats!

2 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families of interspersed repeats within a single genome

3 3 Challenges when identifying repeat families Challenges:  Regions containing repeat occurrences are not known a priori  Repeat boundaries are not known a priori  Many repeat occurrences appear as partial copies............

4 4 Why are repeats important  Repeats have been implicated in:  Genome rearrangements (Kazazian, 2004; Achaz et al 2003)  Accelerated loss of gene order (Rocha et al, 2003)  Creation of novel biological functions (Lynch et al, 2002)  Increased rate of evolution under stress (Capy et al, 2000)

5 5 Identifying repeats de novo  Assume we get a new genome and we know nothing about it, we can:  Use a database of known repeats (RepeatMasker/RepBase) novel repeat elements may not be in the database repetitive gene families are never in the database  Identify repeats de novo using sequence analysis

6 6 Existing methods for detection of repeat families  Nearly all existing algorithms for de novo identification of repeat families rely on a set of pairwise similarities:  REPuter (Kurtz et al., 2000)  RepeatFinder (Volfovsky et al., 2001)  RECON (Bao and Eddy, 2002)  RepeatGluer (Pevzner et al., 2004)  PILER (Edgar and Myers, 2005)  RepeatScout (Price et al, 2005)

7 Mutational forces at play  Over time, indels & substitutions will affect copies of repeat families:  AGGCTACCCCTTTAGGCTAGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTA GCCTATT  AGGCTGCCCCTTTAGGCTDGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTA GCCTATT  AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAG CCTATT  AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAG CDTATT  AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAG CTATT  Require alignments (& gaps) to attempt to reconstruct true repeat boundaries 7

8 8 de novo repeat detection  One approach: self-search with a pairwise local-alignment tool such as BLAST  Number of pairwise alignments grows O(r 2 ) in the copy number of the repeat  Inherent difficulty defining repeat boundaries among collections of pairwise alignments

9 9 Alternative methods?  Local multiple alignment A single local multiple alignment uses O(N) space for a genome of length N 1.AACAAGCA-A-ACTTTTATCCATGGTCGTGGTACAGAGGGGTC 2.AACAAGCA-A-ACTTTTGTCCATGGTCGTGGTACAGAGTGGTC 3.AACATGCAGA-ACTTTTATCCATGGTCGTCGTACAGAGGGGT- 4.AACAAGCAGACACTTTTATCCATGGTCGTGGTAC--------- 5.AACAAGCA----CTTTTATCCATAGTCGTGGTA---------- 6.------------CTTTTATCCATGGTCGTGGTACAGAGGGGTC An example local multiple alignment:

10 10 Local multiple alignment  Local multiple alignment has the inherent potential to avoid pitfalls associated with pairwise alignment.  But multiple alignment under the SP objective function remains intractable…  Progressive alignment heuristics offer excellent speed and accuracy (i.e. MUSCLE).  So why not directly construct a multiple alignment?

11 11

12 12 Steps 1-3: Chaining seeds from the Input Sequence The method incorporated three novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered.

13 13 Step 4: Gapped Extension  After chaining a seed match, we must perform gapped extension to approximate the true repeat boundaries  This is an essential step to consider, assuming we would like to improve repeat boundary predictions  But how can this be done efficiently?

14 14 TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA. GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Our approach to gapped extension

15 15 TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA. GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC HMM approach to gapped extension Dynamically calculate extension window = 70*e -0.01*|Mi| |M i | = 200, l = 10

16 16 TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA. GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC HMM approach to gapped extension Use MUSCLE to perform alignment of extension window

17 17 ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA AACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA. ATTCGGCCCC-CGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC HMM approach to gapped extension Use HMM to detect & unalign unrelated sequence

18 18 HMM approach to gapped extension Extension successful, continue extending

19 19 ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA. ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC HMM approach to gapped extension

20 20 ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA. ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC HMM approach to gapped extension Use HMM to detect & unalign unrelated sequence

21 21 HMM approach to gapped extension Finished leftward extension, now to the right…

22 22 HMM approach to gapped extension

23 23 HMM approach to gapped extension Perform MUSCLE alignment on window

24 24 HMM approach to gapped extension Use HMM to detect & unalign unrelated sequence

25 25 HMM approach to gapped extension Extension successful, continue extending

26 26 HMM approach to gapped extension Use MUSCLE to perform alignment of extension window

27 27 HMM approach to gapped extension Use HMM to detect & unalign unrelated sequence

28 28 HMM approach to gapped extension Extension failed, stop extending

29 29 Wait a moment..  The MUSCLE alignment software reports the highest scoring global multiple alignment of the input sequences, regardless of common ancestry.  As a result, it is likely that this method forcibly aligns unrelated sequence.  HMMs to detect alignments of unrelated sequence.

30 30 Step 5: detecting unrelated sequence  The HMM consists of two hidden states, Homologous and Unrelated.  The observable states are the pairwise alignment columns, which are all possible pairs in {A,G,C,T,-} with strand and species symmetry  i.e. AG=GA=TC=CT.  The emission probabilities for each possible pair of aligned nucleotides were extracted from the HOXD substitution matrix presented by Chiaromonte et al.

31 31 U H UUUU 0.5  Compute emission frequencies for the Unrelated state of our HMM using the background frequencies of G/C and A/T, assuming strand and species symmetry: U AA = U AT = U TA = U TT = (f AT )/2 * (f AT )/2 U CC = U CG = U GC = U GG = (f GC )/2 * (f GC )/2 U AC = U AG = U TC = U AG = (f AT )/2 * (f GC )/2 U CA = U CT = U GA = U TT = (f GT )/2 * (f AT )/2

32 32 UU H UUUUUU 0.5  To empirically estimate gap-open and extend values for the unrelated state, align a 10-kb, 48% G+C content region taken from E. coli CFT073 (Accession AF447814.1, coordinates 37,300-38,300) with an unrelated sequence.

33 33 UU H UUUUUUUUUUUU  Alignment with MUSCLE on unrelated sequence and counted the number of gap-open and gap-extend columns in the alignment of unrelated sequences. 0.5

34 34 UU H UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHH  Gap-open and extend frequencies for the homologous state were estimated by constructing an alignment of 10kb of orthologous sequence shared among a pair of divergent organisms. 0.5

35 35 UU H UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH 0.5


Download ppt "1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families."

Similar presentations


Ads by Google