Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transposable Elements (TE) in genomic sequence Mina Rho.

Similar presentations


Presentation on theme: "Transposable Elements (TE) in genomic sequence Mina Rho."— Presentation transcript:

1 Transposable Elements (TE) in genomic sequence Mina Rho

2 Definition De novo identification of repeat families in large genomes (RepeatScout) Alkes L. Price, Neil C. Jones and Pavel A. Pevzner Combined Evidence Annotation of Transposable Elements in Genome Sequences Hadi Quesneville, Casey M. Bergman, Olivier Andrieu, Delphine Autard, Danielle Nouaud, Michael Ashburner, Dominique Anxolabehere Contents

3 Mobile element/Transposable element Transposon - a segment of DNA that can move around to different positions in the genome of a single cell. - cut out of its location and inserted into a new location. - consisting of DNA. Retrotransposon - copy and paste into a new location. - the copy is made of RNA and transcribed back into DNA using reverse transcriptase. - long terminal repeats (LTRs) at its ends. => expect to get information of evolution, mutation, changes of amount of DNA in the genome.

4

5

6 RepeatScout

7 Definition Repeat family: a collection of similar sequences which appear many times in a genome. –the Alu repeat family has over 1 million approximate occurrences in the human genome –~ 50% Human genome l-mer: substring whose length is l

8 The current status on identification method of repeat families –Given an existing library of repeat families RepeatMasker –De novo identification REPuter (Kurtz et al., 2000) RepeatFinder (Volfovsky et al., 2001) RECON (Bao and Eddy, 2002) RepeatGluer (Pevzner et al., 2004) PILER (Edgar and Myers, 2005) RepeatScout Backgroud

9 Overview of RepeatScout Method –Builds a table of high frequency l-mers as seeds –Extends each seed to a longer consensus sequence Main advantage –an efficient method of similarity search which enables a rigorous definition of repeat boundaries.

10 How to create l-mer table frequencyPosition of last occurrence l-mer 1 l-mer 2 l-mer 3 Hash table l-mer 4 l-mer 5 l-mer 6 Sequence ii+1i+2 jk

11 Output of l-mer table AAAAAAAAAAAGATA 8 2920943 AAAAAAAGGAAAGAA 5 2468525 AGGCTTGAACAATGG 3 1425014 AAAAAAAAGAAAGAA 62 3009663 GTTGGTTTCAAAGAA 7 2855871 AAAAAAAATTTTTTT 22 2992836 ATTCAAGTTAAATGG 4 1473342 ATTCAATGTAACCAC 3 1463008 ATGCATGCAATGCAT 9 1788944 ATGCATTTAAAAGAA 3 1464381 AAAAAACTCACTCCA 5 1489159

12 How to build all positions of repeats l-mer 1 l-mer 2 l-mer 3 Hash table l-mer 4 l-mer 5 l-mer 6 Sequence ii+1i+2 ii j i ii k jk

13 S1S1 S2S2 S3S3 S4S4 S5S5 Q1Q1 Q2Q2 Q3Q3 Q4Q4 High frequency l-mer Extending Q maximizing objective function one nucleotide at a time S1S1 S2S2 S3S3 S4S4 S5S5 Query sequence (with l-mer 1 )

14 Objective Function |Q| : the length of Q C: minimum threshold on the number of repeat elements a(Q, S k ): a pairwise fit_preferred alignment score p: Incomplete-fit penalty

15 Output of optimized Q >R=0 GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACTTGAGGTCAGGAGTTC GAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCCGGGCGTGGTGGCGCGCGCCTG TAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCGGAGGTTGCAGTGAGCCGAGATCGCG CCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAA >R=1 AAAAGGCAGCAGAAACCTCTGCAGACTTAAATGTCCCTGTCTGACAGCTTTGAAGAGAGTAGTGGTTCTCCCAGCACGCA GCTGGAGATCTGAGAACGGACAGACTGCCTCCTCAAGTGGATCCCTGACCCCCGAGTAGCCTAACTGGGAGGCACCCCCC AGTAGGGGCAGACTGACACCTCACACGGCCAGGTACTCCTCTGAGAAAAAACTTCCAGAGGAACAATCAGGCAGCAACAT TTGCTGCTCACCAATATCCACTGTTCTGCAGCCTCTGCTGCTGATACCCAGGCAAACAGGGTCTGGAGTGGACCTCCAGC AAACTCCAACAGACCTGCAGCTGAGGGTCCTGTCTGTTAGAAGGAAAACTAACAAACAGAAAGGACATCCACACCAAAAA CCCATCTGTACGTCACCATCATCAAAGACCAAAAGTAGATAAAACCACAAAGATGGGGAAAAAACAGAGCAGAAAAACTG GAAACTCTAAAAAGCAGAGCGCCTCTCCTCCTCCAAAGGAACGCAGCTCCTCACCAGCAACGGAACAAAGCTGGACGGAG AATGACTTTGATGAGTTGAGAGAAGAAGGCTTCAGATGATCAAACTACTCCAAGCTAAAGGAGGAAATTCAAACCCATGG CAAAGAAGTTAAAAACCTTGAAAAAAAATTAGACGAATGGATAACTAGAATAACCAATGCAGAGAAGTCCTTAAAGGAGC TGATGGAGCTGAAAACCAAGGCTCGAGAACTACGTGAAGAATGCACAAGCCTCAGGAGCCGATGCGATCAACTGGAAGAA AGGGTATCAGTGATGGAAGATCAAATGAATGAAATGAAGTGAGAAGAGAAGTTTAGAGAAAAAAGAATAAAAAGAAATGA >R=2 TTTTTTTTTTTTTTTAGATGCGGGGTGTCACTGTGTTGCTCAGGCTGGTCTCAAACTCCTGGGCTCAAGTGATCCTCCCA CCTCAGCCTCTTTAATAGATGCGATTA >R=3 TTTTTATACATGCTGTAGACAATCAATTCACACCTGTACTTTTTTTTAAGGTTGTGTTATTGCACTTTTATACCTCTTGA CTGGTAGCTGATTTCCTTGAATACCTGTAAGGTAATCACCGGCTCACCAATGAATGTGGTTTTAACAATGGCTCACAGTG GCTTGGAAAGCCCTCATGGGAAGTATTTCTGAGGAAAAGTGGAGAGTGTGCAGGAATAGTTTTGAAAAACAGAGACAACC GATGTCCTCCTTCCCTCCCTTGCCTCTCCTCATGTGCCAGGTTTTCTGTTTTCTCCACTATTACAGAATCACCATGTTGT ATCCTGTGATGAAAAGTTTTTATCTCTTTAATCATCCCATTTCGTCCTCCAGACCTTTTTTTTTCTGGAAGGGTTGTAAG CAGAAGGGACGAAACATCTTCAGAAAAACACATTATGATATAAACTTAGTGAAAAGATTCATCATATTTAAGAAATGGAC AGGATGAAATCCTGAATTCATAAAAATTTTAAAAATCAGTTTACATAACATCCATCCCTTTTGTCTCTATCCCTTATCCA

16 Parameter setting and post processing Parameter setting –Recommend the smallest l = 15 –For the arbitrary length L, –The length of Q up to 10,000bp on each side –Remove repeat families with Q < 50 Postprocessing –Tandem Repeat finder, Nseg Remove repeat families with >50% of their length annotated as low- complexity and tandem repeats –RepeatMasker Mask the repeat families based on the library

17 Benchmark C.briggsae genome (108Mb) 7h on a single 0.5 GHz DEC Alpha processor

18 Combined evidence model of TE

19 Overview Query Sequences: Drosophila melanogaster (Fruit fly) Release 3, 4 Combined evidence model: pipeline of RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, and TE-HMM - Methods for the annotation of known TE families - Methods for the annotation of anonymous TE families Benchmark : FlyBase Release 3.1 annotation Sensitivity and specificity, characteristics of boundary

20 Tools Blaster –compares a query sequences against a subject databank. –Launches one of the BLAST (BLASTN, TBLASTN, BLASTX, TBLASTX). –Cut long sequences before launching BLAST and reassembles the results. MATCHER –Maps match results onto query sequences by filtering overlapping hits. –Keeps the match results with E-value 20 –Chains the remaining matches by dynamic programming. GROUPER –Gather similar sequences into groups

21 Measures For each nucleotide, TP: correctly annotated as belonging to a TE FP: falsely predicted as belonging to a TE TN: correctly annotated as not belonging to a TE FN: falsely predicted as not belonging to a TE

22

23

24 Method for the Annotation of known TE families -BLASTER using BLASTN and MATCHER (BLRn) -RepeatMasker (RM) -RepeatMasker with MATCHER (RMm)

25 Method for the Annotation of known TE families -BLASTER using BLASTN and MATCHER (BLRn) -RepeatMasker (RM) -RepeatMasker with MATCHER (RMm) -RepeatMasker-BLASTER (RMBLR) : combined hits from both BLRn and RM and give them to MATCHER

26 Method for the Annotation of anonymous TE families -all-by-all comparison with BLASTER using BLASTN, MATCHER, and GROUPER -RECON -BLASTER using TBLASTX and MATCHER -HMM

27 What they (we) learned Overall, BLRn outperforms RM with respect to the precise determination of TE boundaries. RM is more sensitive for the detection of small and divergent TE. The difference between BLRn and RM make them complementary for TE annotation. A combined-evidence framework can improve the quality and confidence of TE annotation.

28 Pipeline structure TE detection software : BLASTER, RepeatMasker, TE-HMM, and RECON Tandem repeat detection software : RepeatMasker, Tandem Repeat Finder (TRF), Mreps Database: MySQL Open Portable Batch System Whole genomic sequence was segmented into chucks of 200kb overlapping by 10kb. The results from different tool were stored in the database. XML file is generated from the stored results and loaded into the Apollo genome annotation tool.

29 The Annotation Pipeline

30


Download ppt "Transposable Elements (TE) in genomic sequence Mina Rho."

Similar presentations


Ads by Google