Transposable Elements (TE) in genomic sequence Mina Rho.

Slides:



Advertisements
Similar presentations
Genomics – The Language of DNA Honors Genetics 2006.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
Homology Based Analysis of the Human/Mouse lncRNome
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
De novo identification of repeat families in large genomes Alkes L. Price, Neil C. Jones and Pavel A. Pevzner June 28, 2005.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
BINF350, Tutorial 4 Karen Marshall. Aim ► Examine how blast parameters (e.g. scoring scheme, word length) affect the alignment outcome ► To optimise blast.
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
Repetitive DNA Detection and Classification Vijay Krishnan Masters Student Computer Science Department.
Assembly.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Protein Modules An Introduction to Bioinformatics.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Genome sequencing and assembling
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Genome Annotation BCB 660 October 20, From Carson Holt.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Repetitive Elements May Comprise Over Two-Thirds of the Human Genome
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
UMR ASP UMR ASP Structural & Comparative Genomics in Bread Wheat TriAnnotPipeline A LifeGrid Project based on AUVERGRID F. Giacomoni, M.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
Chapter 21 Eukaryotic Genome Sequences
BACTERIAL TRANSPOSONS
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Transposable elements in Melampsora larici-populina genome Marie-Pierre Oudot-Le Secq Melampsora Genome Consortium 2008 Summer Workshop Melampsora Genome.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Mark D. Adams Dept. of Genetics 9/10/04
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Mobile DNA  Transposons By Anna Purna
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
From Smith-Waterman to BLAST
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
Step 3: Tools Database Searching
Annotation of eukaryotic genomes
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
What is sequencing? Video: WlxM (Illumina video) WlxM.
Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,
BLAST: Basic Local Alignment Search Tool Robert (R.J.) Sperazza BLAST is a software used to analyze genetic information It can identify existing genes.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
BLAST BNFO 236 Usman Roshan. BLAST Local pairwise alignment heuristic Faster than standard pairwise alignment programs such as SSEARCH, but less sensitive.
What is BLAST? Basic BLAST search What is BLAST?
Sequencing, de novo assembling, and annotating the genome of the endangered Chinese crocodile lizard, shinisaurus crocodilurus Jian gao, qiye li, zongji.
Basics of BLAST Basic BLAST Search - What is BLAST?
Genomes and Their Evolution
Transposable Elements
Local alignment and BLAST
Predicting Active Site Residue Annotations in the Pfam Database
Bioinformatics and BLAST
Comparative Genomics.
Protein structure prediction.
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Presentation transcript:

Transposable Elements (TE) in genomic sequence Mina Rho

Definition De novo identification of repeat families in large genomes (RepeatScout) Alkes L. Price, Neil C. Jones and Pavel A. Pevzner Combined Evidence Annotation of Transposable Elements in Genome Sequences Hadi Quesneville, Casey M. Bergman, Olivier Andrieu, Delphine Autard, Danielle Nouaud, Michael Ashburner, Dominique Anxolabehere Contents

Mobile element/Transposable element Transposon - a segment of DNA that can move around to different positions in the genome of a single cell. - cut out of its location and inserted into a new location. - consisting of DNA. Retrotransposon - copy and paste into a new location. - the copy is made of RNA and transcribed back into DNA using reverse transcriptase. - long terminal repeats (LTRs) at its ends. => expect to get information of evolution, mutation, changes of amount of DNA in the genome.

RepeatScout

Definition Repeat family: a collection of similar sequences which appear many times in a genome. –the Alu repeat family has over 1 million approximate occurrences in the human genome –~ 50% Human genome l-mer: substring whose length is l

The current status on identification method of repeat families –Given an existing library of repeat families RepeatMasker –De novo identification REPuter (Kurtz et al., 2000) RepeatFinder (Volfovsky et al., 2001) RECON (Bao and Eddy, 2002) RepeatGluer (Pevzner et al., 2004) PILER (Edgar and Myers, 2005) RepeatScout Backgroud

Overview of RepeatScout Method –Builds a table of high frequency l-mers as seeds –Extends each seed to a longer consensus sequence Main advantage –an efficient method of similarity search which enables a rigorous definition of repeat boundaries.

How to create l-mer table frequencyPosition of last occurrence l-mer 1 l-mer 2 l-mer 3 Hash table l-mer 4 l-mer 5 l-mer 6 Sequence ii+1i+2 jk

Output of l-mer table AAAAAAAAAAAGATA AAAAAAAGGAAAGAA AGGCTTGAACAATGG AAAAAAAAGAAAGAA GTTGGTTTCAAAGAA AAAAAAAATTTTTTT ATTCAAGTTAAATGG ATTCAATGTAACCAC ATGCATGCAATGCAT ATGCATTTAAAAGAA AAAAAACTCACTCCA

How to build all positions of repeats l-mer 1 l-mer 2 l-mer 3 Hash table l-mer 4 l-mer 5 l-mer 6 Sequence ii+1i+2 ii j i ii k jk

S1S1 S2S2 S3S3 S4S4 S5S5 Q1Q1 Q2Q2 Q3Q3 Q4Q4 High frequency l-mer Extending Q maximizing objective function one nucleotide at a time S1S1 S2S2 S3S3 S4S4 S5S5 Query sequence (with l-mer 1 )

Objective Function |Q| : the length of Q C: minimum threshold on the number of repeat elements a(Q, S k ): a pairwise fit_preferred alignment score p: Incomplete-fit penalty

Output of optimized Q >R=0 GGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACTTGAGGTCAGGAGTTC GAGACCAGCCTGGCCAACATGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCCGGGCGTGGTGGCGCGCGCCTG TAATCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATCGCTTGAACCCGGGAGGCGGAGGTTGCAGTGAGCCGAGATCGCG CCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAAAAAAAAAAAA >R=1 AAAAGGCAGCAGAAACCTCTGCAGACTTAAATGTCCCTGTCTGACAGCTTTGAAGAGAGTAGTGGTTCTCCCAGCACGCA GCTGGAGATCTGAGAACGGACAGACTGCCTCCTCAAGTGGATCCCTGACCCCCGAGTAGCCTAACTGGGAGGCACCCCCC AGTAGGGGCAGACTGACACCTCACACGGCCAGGTACTCCTCTGAGAAAAAACTTCCAGAGGAACAATCAGGCAGCAACAT TTGCTGCTCACCAATATCCACTGTTCTGCAGCCTCTGCTGCTGATACCCAGGCAAACAGGGTCTGGAGTGGACCTCCAGC AAACTCCAACAGACCTGCAGCTGAGGGTCCTGTCTGTTAGAAGGAAAACTAACAAACAGAAAGGACATCCACACCAAAAA CCCATCTGTACGTCACCATCATCAAAGACCAAAAGTAGATAAAACCACAAAGATGGGGAAAAAACAGAGCAGAAAAACTG GAAACTCTAAAAAGCAGAGCGCCTCTCCTCCTCCAAAGGAACGCAGCTCCTCACCAGCAACGGAACAAAGCTGGACGGAG AATGACTTTGATGAGTTGAGAGAAGAAGGCTTCAGATGATCAAACTACTCCAAGCTAAAGGAGGAAATTCAAACCCATGG CAAAGAAGTTAAAAACCTTGAAAAAAAATTAGACGAATGGATAACTAGAATAACCAATGCAGAGAAGTCCTTAAAGGAGC TGATGGAGCTGAAAACCAAGGCTCGAGAACTACGTGAAGAATGCACAAGCCTCAGGAGCCGATGCGATCAACTGGAAGAA AGGGTATCAGTGATGGAAGATCAAATGAATGAAATGAAGTGAGAAGAGAAGTTTAGAGAAAAAAGAATAAAAAGAAATGA >R=2 TTTTTTTTTTTTTTTAGATGCGGGGTGTCACTGTGTTGCTCAGGCTGGTCTCAAACTCCTGGGCTCAAGTGATCCTCCCA CCTCAGCCTCTTTAATAGATGCGATTA >R=3 TTTTTATACATGCTGTAGACAATCAATTCACACCTGTACTTTTTTTTAAGGTTGTGTTATTGCACTTTTATACCTCTTGA CTGGTAGCTGATTTCCTTGAATACCTGTAAGGTAATCACCGGCTCACCAATGAATGTGGTTTTAACAATGGCTCACAGTG GCTTGGAAAGCCCTCATGGGAAGTATTTCTGAGGAAAAGTGGAGAGTGTGCAGGAATAGTTTTGAAAAACAGAGACAACC GATGTCCTCCTTCCCTCCCTTGCCTCTCCTCATGTGCCAGGTTTTCTGTTTTCTCCACTATTACAGAATCACCATGTTGT ATCCTGTGATGAAAAGTTTTTATCTCTTTAATCATCCCATTTCGTCCTCCAGACCTTTTTTTTTCTGGAAGGGTTGTAAG CAGAAGGGACGAAACATCTTCAGAAAAACACATTATGATATAAACTTAGTGAAAAGATTCATCATATTTAAGAAATGGAC AGGATGAAATCCTGAATTCATAAAAATTTTAAAAATCAGTTTACATAACATCCATCCCTTTTGTCTCTATCCCTTATCCA

Parameter setting and post processing Parameter setting –Recommend the smallest l = 15 –For the arbitrary length L, –The length of Q up to 10,000bp on each side –Remove repeat families with Q < 50 Postprocessing –Tandem Repeat finder, Nseg Remove repeat families with >50% of their length annotated as low- complexity and tandem repeats –RepeatMasker Mask the repeat families based on the library

Benchmark C.briggsae genome (108Mb) 7h on a single 0.5 GHz DEC Alpha processor

Combined evidence model of TE

Overview Query Sequences: Drosophila melanogaster (Fruit fly) Release 3, 4 Combined evidence model: pipeline of RepeatMasker, BLASTER, TBLASTX, all-by-all BLASTN, RECON, and TE-HMM - Methods for the annotation of known TE families - Methods for the annotation of anonymous TE families Benchmark : FlyBase Release 3.1 annotation Sensitivity and specificity, characteristics of boundary

Tools Blaster –compares a query sequences against a subject databank. –Launches one of the BLAST (BLASTN, TBLASTN, BLASTX, TBLASTX). –Cut long sequences before launching BLAST and reassembles the results. MATCHER –Maps match results onto query sequences by filtering overlapping hits. –Keeps the match results with E-value 20 –Chains the remaining matches by dynamic programming. GROUPER –Gather similar sequences into groups

Measures For each nucleotide, TP: correctly annotated as belonging to a TE FP: falsely predicted as belonging to a TE TN: correctly annotated as not belonging to a TE FN: falsely predicted as not belonging to a TE

Method for the Annotation of known TE families -BLASTER using BLASTN and MATCHER (BLRn) -RepeatMasker (RM) -RepeatMasker with MATCHER (RMm)

Method for the Annotation of known TE families -BLASTER using BLASTN and MATCHER (BLRn) -RepeatMasker (RM) -RepeatMasker with MATCHER (RMm) -RepeatMasker-BLASTER (RMBLR) : combined hits from both BLRn and RM and give them to MATCHER

Method for the Annotation of anonymous TE families -all-by-all comparison with BLASTER using BLASTN, MATCHER, and GROUPER -RECON -BLASTER using TBLASTX and MATCHER -HMM

What they (we) learned Overall, BLRn outperforms RM with respect to the precise determination of TE boundaries. RM is more sensitive for the detection of small and divergent TE. The difference between BLRn and RM make them complementary for TE annotation. A combined-evidence framework can improve the quality and confidence of TE annotation.

Pipeline structure TE detection software : BLASTER, RepeatMasker, TE-HMM, and RECON Tandem repeat detection software : RepeatMasker, Tandem Repeat Finder (TRF), Mreps Database: MySQL Open Portable Batch System Whole genomic sequence was segmented into chucks of 200kb overlapping by 10kb. The results from different tool were stored in the database. XML file is generated from the stored results and loaded into the Apollo genome annotation tool.

The Annotation Pipeline