Homology Based Analysis of the Human/Mouse lncRNome

Slides:



Advertisements
Similar presentations
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Advertisements

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Basics of Comparative Genomics Dr G. P. S. Raghava.
BLAST Tutorial 3 What is BLAST? Basic Local Alignment Search Tool Is a set of similarity search programs designed to explore sequence databases. What are.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
CSE182-L12 Gene Finding.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Protein Modules An Introduction to Bioinformatics.
Displaying associations, improving alignments and gene sets at UCSC Jim Kent and the UCSC Genome Bioinformatics Group.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Genome Annotation BCB 660 October 20, From Carson Holt.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
COURSE OF BIOINFORMATICS Exam_31/01/2014 A.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
1 P6a Extra Discussion Slides Part 1. 2 Section A.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Genome Annotation Rosana O. Babu.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
A Tutorial of Sequence Matching in Oracle Haifeng Ji* and Gang Qian** * Oklahoma City Community College ** University of Central Oklahoma.
Protein and RNA Families
 Read quality  Adaptor trimming  Read sequence collapse Preprocessing Genome mapping  Map read to the spruce genome (Pabies1.0- genome.fa) using Patman
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
The Havana-Gencode annotation GENCODE CONSORTIUM.
PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Tweaking BLAST Although you normally see BLAST as a web page with boxes to place data in and tick boxes, etc., it is actually a command line program that.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
Finding, Aligning and Analyzing Non Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program.
Step 3: Tools Database Searching
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Annotation of eukaryotic genomes
What is BLAST? Basic BLAST search What is BLAST?
Welcome to the combined BLAST and Genome Browser Tutorial.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
What is BLAST? Basic BLAST search What is BLAST?
bacteria and eukaryotes
Basics of BLAST Basic BLAST Search - What is BLAST?
Basics of Comparative Genomics
Sequence based searches:
Genome Annotation Continued
Genome Center of Wisconsin, UW-Madison
Introduction to Bioinformatics II
Gene Prediction.
Ensembl Genome Repository.
Identify D. melanogaster ortholog
lincRNAs: Genomics, Evolution, and Mechanisms
Homology Modeling.
Basic Local Alignment Search Tool (BLAST)
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

Homology Based Analysis of the Human/Mouse lncRNome Cédric Notredame Giovanni Bussotti Comparative Bioinformics lab CRG

Part 1: GENCODE v10 lncRNA screening vs human and mouse genomes Strategy: PipeR one2many homolog assignment Template: PipeR Parameters: Blast - Freyhult parametrization - Lower case masking - Low complexity masking Exonerate - est2genome model - 70% coverage required - seed extension 2X (the span of the genomic size of the query on both sides) genes 10840 transcripts 17547 exons 58857 sum of mature transcript length (nt) 16·927·027 real coverage (nt) 13·083·478 non overlapping loci 7428

PipeR: a pipeline for mapping lncRNAs blast-exonerate based framework to map lncRNAs against target genomes algorithm used: chromosome 2 Blast hits mapping extension Exonerate spliced transcript lncRNA

PipeR: lncRNA Homology Mapping GENECODE lncRNAs Vs Complete Genomes PipeR: lncRNA Homology Mapping Anchor points: ENCODE vs Mouse with tuned Blast Extension: Exonerate Filtering: Id and Coverage Validation of the GFF annotation Overlap with Annotation Overlap with Cufflink Models RPKM on target genome Further Mapping Parameter Space Exploration using Experimental Evidences GFF File Notredame, Bussotti

Mapping overview Query species Multiple Homologues Target species Gene A Gene B Query species Transcript 1 Transcript 3 Transcript 2 Multiple Homologues Blast/Exonerate failed Homolog 1 Homolog 2 Homolog 3 Homolog 4 Best reciprocal Conserved exon number High repeat coverage Overlap with protein Target species

GENCODEv10 vs human genome mapped 17327 transcripts out of 17547 many lncRNAs found in multiple copies (lncRNA families) - found 144566 homologs corresponding to 501355 exons Annotations of discovered homologs are readily available

Homolog repeat coverage About the 10% of all our homolog predictions are fully covered by repeats

Homolog repeat coverage We could sub-group the homologs in 3 set according with the repeat coverage: <= 20 < = 80 < = 100

Mapping statistics HUMAN <= 20% <= 80% <= 100% genV10 mapped genes 6088 10425 10698 genV10 mapped transcripts 9318 16856 17327 Total homologs 35399 102250 144566 Homologs whose exons overlap protein coding exons (same strand) 3621 5076 8988 HUMAN

GENCODEv10 vs mouse genome mapped 3190 transcripts out of 17547 representing 2249 human genes many lncRNAs found in multiple copies (lncRNA families) - found 14936 homologs corresponding to 38910 exons Annotations of discovered homologs are readily available

Exon Number Conservation Human/Mouse Exon Number Conservation Difference between the number of exons in the human transcripts and in the mouse homologs “0” means that the exon number is the same Negative bins indicate mouse homologs having more exons than the human query 1160 GENCODE v10 transcripts find at least 1 homolog in mouse with the same exon number human < mouse human > mouse

Homolog repeat coverage We could sub-group the homologs in 3 set according with the repeat coverage: <= 20 < = 80 < = 100

Mapping statistics MOUSE <= 20% <= 80% <= 100% Reciprocal homologs genV10 mapped genes 1867 2172 2249 1445 genV10 mapped transcripts 2586 3076 3190 1966 Total homologs 6108 11141 14936 Homologs whose exons overlap protein coding exons (same strand) 1611 2290 3177 497 Homologs with conserved number of exons 1534 2407 2958 689 MOUSE Best Candidates: There are 148 transcripts that have < 20% repeat coverage, conserved exon structure, do not overlap protein coding exons and are best reciprocal homologs with the human queries

PipeR: lncRNA Homology Mapping GENECODE lncRNAs Vs Complete Genomes PipeR: lncRNA Homology Mapping Anchor points: ENCODE vs Mouse with tuned Blast Extension: Exonerate Filtering: Id and Coverage Validation of the GFF annotation Overlap with Annotation Overlap with Cufflink Models RPKM on target genome Further Mapping Parameter Space Exploration using Experimental Evidences GFF File Notredame, Bussotti

BlastR vs The World

BlastR vs The World

blastnOpt (12487) a) blastn (8749) all (7492) blastr (12093) b) c) Figure 2: Exon read support. Venn-diagram indicating the number of exon detected by different methods (numbers in parentesis) and their intersection (transcripts annotated identically by the three methods). Average amount of reads per exons Percent of reads covered by at least one exon all (7492) blastr (12093) b) c)

Part 2: Ensembl.v65 lncRNAs screening vs human and mouse genomes Strategy: PipeR one2many homolog assignment Template: PipeR Parameters: Blast - Freyhult parametrization - Lower case masking - Low complexity masking Exonerate - est2genome model - 70% coverage required - seed extension 2X (the span of the genomic size of the query on both sides) genes 3845 transcripts 5669 exons 18353 sum of mature transcript length (nt) 7279679 real coverage (nt) 6091050 non overlapping loci 2790

Ensembl.v65 vs human genome mapped 1187 transcripts out of 5669 many lncRNAs found in multiple copies (lncRNA families) - found 13193 homologs corresponding to 46770 exons Annotations of discovered homologs are readily available

Ensembl.v65 vs mouse genome mapped 5622 transcripts out of 5669 many lncRNAs found in multiple copies (lncRNA families) - found 41005 homologs corresponding to 121515 exons Annotations of discovered homologs are readily available

Exon Number Conservation Mouse/Human Exon Number Conservation Difference between the number of exons in the mouse transcripts and in the human homologs “0” means that the exon number is the same Negative bins indicate human homologs having more exons than the mouse query 481 Ensemblv65 transcripts find at least 1 homolog in human with the same exon number mouse < human mouse > human

Homolog repeat coverage Not observed a peak of homolog predictions fully covered by repeats

Ensemble.65 and GENCODEv10 repeat coverage Input lncRNA datasets have similar repeat distributions

Mapping statistics HUMAN MOUSE 879 1187 3815 13193 5622 3642 41005 ensV65 mapped genes 879 ensV65 mapped transcripts 1187 Total homologs 13193 Homologs whose exons overlap protein coding exons (same strand) 3642 Homologs whose exons do not overlap any gencode v10 element (same strand) 6085 Homologs with conserved number of exons 4925 ensV65 mapped genes 3815 ensV65 mapped transcripts 5622 Total homologs 41005 Homologs whose exons overlap protein coding exons (same strand) 10086 MOUSE HUMAN

Part 3: GENCODE v10 lncRNA coding potential check Strategies: 1) GeneId ORF score comparison between mRNAs and lncRNAs 2) BlastX against human proteins (ensembl 65) 3) Overlap with protein coding gene exon annotations (gencodeV10) 4) PipeR filtering routines

1) ORF scores as returned by GeneID 2) blastX against human proteins indicates that 1202 GENCODE v10 lncRNAs match proteins Parameters: seg low complexity filtering, repeat filtering , evalue 10e-10, search just the plus strand. Human Ensembl 65 protein set

3) Checked the overlap between GENCODE v10 lncRNA exons and GENCODE v10 protein coding exons. - Found 846 lncRNA having at least one exon overlapping with a protein coding gene exon Example 1 Example 2

4) Extensive filtering 7813 GENCODE v10 transcripts passed *ALL* PipeR filtering routines Filtering rules: - overlap with protein coding exons - geneID ORF score similar to the ones of mRNA - blastX to uniprot database (50% redundancy) - blastX to nr database - rpsBlast to pfam domain families - blast against Rfam