Using RNA-seq data to improve gene annotation

Slides:

Advertisements

Similar presentations

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy

Advertisements

Homology Based Analysis of the Human/Mouse lncRNome

Transcriptomics Jim Noonan GENE 760.

1 Alternative Splicing. 2 Eukaryotic genes Splicing Mature mRNA.

How to access genomic information using Ensembl August 2005.

Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.

mRNA-Seq: methods and applications

Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.

NGS Analysis Using Galaxy

1 1 - Lectures.GersteinLab.org Overview of ENCODE Elements Mark Gerstein for the "ENCODE TEAM"

1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.

Expression Analysis of RNA-seq Data

Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.

RNAseq analyses -- methods

Introduction to RNA-Seq & Transcriptome Analysis

Coding Domain Sequence Prediction and Alternative Splicing Detection in Human Malaria Gambiae Jun Li 1, Bing-Bing Wang 2, Jose M. Ribeiro 3, Kenneth D.

Experimental validation. Integration of transcriptome and genome sequencing uncovers functional variation in human populations Tuuli Lappalainen et al.

Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.

Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.

Next Generation DNA Sequencing

Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.

TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.

RNA-Seq in Galaxy Igor Makunin QAAFI, Internal Workshop, April 17, 2015.

1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.

Sackler Medical School

The Havana-Gencode annotation GENCODE CONSORTIUM.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Introduction to RNAseq

Geuvadis achievements and contributions Robert Häsler, functional genomics.

Comparative Genomics Methods for Alternative Splicing of Eukaryotic Genes Liliana Florea Department of Computer Science Department of Biochemistry GWU.

TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.

Overview of ENCODE Elements

UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.

Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.

1 of 28 Evaluating Genes and Transcripts (“Genebuild”)

A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.

Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.

Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

Canadian Bioinformatics Workshops

Considerations for multi-omics data integration Michael Tress CNIO,

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

GENCODE: a rich dataset of all gene features in the human genome The GENCODE consortium aims to identify all gene features in the human genome, using a.

RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on

Introduction to Genes and Genomes with Ensembl

GCC Workshop 9 RNA-Seq with Galaxy

Cancer Genomics Core Lab

WS9: RNA-Seq Analysis with Galaxy (non-model organism )

Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017

S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.

How to store and visualize RNA-seq data

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

ENCODE Pseudogenes and Transcription

Ensembl Genome Repository.

Expression profiling of snoRNAs in normal hematopoiesis and AML

lincRNAs: Genomics, Evolution, and Mechanisms

Alex M. Plocik, Brenton R. Graveley Molecular Cell

Additional file 2: RNA-Seq data analysis pipeline

Universal Alternative Splicing of Noncoding Exons

Sequence Analysis - RNA-Seq 2

Transcriptomics – towards RNASeq – part III

Regulating gene expression

Volume 11, Issue 7, Pages (May 2015)

RNA-Seq Data Analysis UND Genomics Core.

Presentation transcript:

Using RNA-seq data to improve gene annotation

The GENCODE consortium HAVANA Manual annotation Ensembl Computational annotation Annotation hints, experimental and computational validation The gene annotation is supported by computational and wet lab groups who feedback and QC our work, which in turn we feedback to improve their pipelines through their predictions, highlight regions of interest in the genome to be followed up by manual annotation, identify potential features missing from annotation and experimentally validate the annotated transcripts – feeding back to computational groups to help improve pipelines Used to QC transcripts. default annotation in Ensembl and UCSC browsers and is dynamic (>95% of Ensembl) GENCODE geneset

Gene models HAVANA produces GENCODE[1] reference gene model annotation used in production of whole exome sequence (WES) arrays[2] default gene models in Ensembl and UCSC genome browsers Ensembl Manual gene annotation for the human genome UCSC [1] Harrow J, Frankish A, Gonzalez JM et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012 Sep;22(9):1760-74 [2] Coffey AJ, Kokocinski F, Calafato MS et al. The GENCODE exome: sequencing the complete human exome. Eur J Hum Genet. 2011 Jul;19(7):827-31

Olfr RNAseq analysis workflow fastq files Align to reference with TopHat2 or STAR BAM files Merge BAMs Run Cufflinks Merge Cufflinks models Cufflinks models Filter for ORs using HMM QC for best models Filtered Cufflinks models Upload to gencode db from GTF Copy across CDS and biotype data Add to Gencode db

QC system for filtered OR Cufflinks models – using an in-house web server with MySQL and IGV

QC system for filtered OR Cufflinks models – using an in-house web server with MySQL and IGV

Change in gene coverage for mouse olfactory receptor annotation

Comparison of fpkm values for human ORs (olender Vs logan)

Intron spanning reads from Intropolis

Early infantile epileptic encephalopathies (EIEE) EIEE - early onset seizures (< 1 year) developmental delay, potentially fatal, comorbidities e.g. cerebral palsy Include Dravet, Ohtahara, West Syndrome (infantile spasms), etc Pilot study of 70 genes (66 from GOSH) Clinical significance already demonstrated @31% of children have a diagnosis through genetic studies – are we looking in all the right places? Severe disorders chracterised by Chaotic brain activity called hypsarrhythmia – these often evolve to other syndromes While we are constantly finding new genes, but are the current gene models correct? 3–5 per 10,000 live births

“Deep diving” using next generation derived data from brain PacBio and RNA CaptureSeq - adult brain Synthetic long-read RNA sequencing (SLRseq) - adult brain Paired Illumina RNAseq, 6 life stages from brain, Jaffe et al., Nat Neurosci 2015 We wanted to see how complete our geneset is. We wanted to look specifically at transcripts expressed in brain. We have state of the art data sets that allow much more in depth study of gene structures and allow us to do better functional characterisation. The three techniques we used are PacBio, SLRseq, and illumina short read data set from 6 life stages in brain, where we annotated exclusively foetal and infant brain transcripts. Tilgner et al., Nat Biotech 2015 Mercer et al., Nature Protocols 2014 Trapnell et al., Nat Biotech 2010

Genome annotation improvements We are now moving from exome to genome sequence, so can query any variant on the genome for its functional significance. Use CAGE PolyAseq RP Mass spec data to enrich our annotation. Our feeling is that the more detailed annotation the better. Exome is frozen at the date given by e.g. Agilent, but we can add new features regularly. We can take genome sequence and analyse it 10 years down the line. While large scale annotation is time consuming, small scale projects are much quicker than automated ENCODE 3 – tissue specificty – Matthias Uhlen, bodymap from illumina. The human protein atlas – body atlas

Genome annotation improvements We are now moving from exome to genome sequence, so can query any variant on the genome for its functional significance. Use CAGE PolyAseq RP Mass spec data to enrich our annotation. Our feeling is that the more detailed annotation the better. Exome is frozen at the date given by e.g. Agilent, but we can add new features regularly. We can take genome sequence and analyse it 10 years down the line. While large scale annotation is time consuming, small scale projects are much quicker than automated ENCODE 3 – tissue specificty – Matthias Uhlen, bodymap from illumina. The human protein atlas – body atlas

Genome annotation improvements We are now moving from exome to genome sequence, so can query any variant on the genome for its functional significance. Use CAGE PolyAseq RP Mass spec data to enrich our annotation. Our feeling is that the more detailed annotation the better. Exome is frozen at the date given by e.g. Agilent, but we can add new features regularly. We can take genome sequence and analyse it 10 years down the line. While large scale annotation is time consuming, small scale projects are much quicker than automated ENCODE 3 – tissue specificty – Matthias Uhlen, bodymap from illumina. The human protein atlas – body atlas

Genome annotation improvements We are now moving from exome to genome sequence, so can query any variant on the genome for its functional significance. Use CAGE PolyAseq RP Mass spec data to enrich our annotation. Our feeling is that the more detailed annotation the better. Exome is frozen at the date given by e.g. Agilent, but we can add new features regularly. We can take genome sequence and analyse it 10 years down the line. While large scale annotation is time consuming, small scale projects are much quicker than automated ENCODE 3 – tissue specificty – Matthias Uhlen, bodymap from illumina. The human protein atlas – body atlas

Genome annotation improvements We are now moving from exome to genome sequence, so can query any variant on the genome for its functional significance. Use CAGE PolyAseq RP Mass spec data to enrich our annotation. Our feeling is that the more detailed annotation the better. Exome is frozen at the date given by e.g. Agilent, but we can add new features regularly. We can take genome sequence and analyse it 10 years down the line. While large scale annotation is time consuming, small scale projects are much quicker than automated ENCODE 3 – tissue specificty – Matthias Uhlen, bodymap from illumina. The human protein atlas – body atlas

Genome annotation improvements NMD We are now moving from exome to genome sequence, so can query any variant on the genome for its functional significance. Use CAGE PolyAseq RP Mass spec data to enrich our annotation. Our feeling is that the more detailed annotation the better. Exome is frozen at the date given by e.g. Agilent, but we can add new features regularly. We can take genome sequence and analyse it 10 years down the line. While large scale annotation is time consuming, small scale projects are much quicker than automated ENCODE 3 – tissue specificty – Matthias Uhlen, bodymap from illumina. The human protein atlas – body atlas Retained Intron

Addition of many novel alternatively-spliced transcripts Already well annotated, but more than doubled. RefSeq has about 10% of these transcripts Annotated more than 1000 transcripts, all of them supported by transcriptional evidence. GENCODE GENCODE RefSeq RefSeq

Significant increase in exonic coverage Total number of novel transcripts New exons 1092 706 New introns 1132 SSJs 224 New exon coverage 128,817 bp SSJ coverage 12,402 bp UTR/transcripts 125,936 bp Extra coding sequence coverage 15,283 bp Total amount of new sequence 141,219 bp Already well annotated, but more than doubled. RefSeq has about 10% of these transcripts Annotated more than 1000 transcripts, all of them supported by transcriptional evidence. GENCODE GENCODE RefSeq RefSeq

Cross-species conservation of coding sequence Severe disorders chracterised by Chaotic brain activity called hypsarrhythmia – these often evolve to other syndromes While we are constantly finding new genes, but are the current gene models correct? CE = constrained elements from alignment of 39 mammalian genomes from Ensembl