Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using RNA-seq data to improve gene annotation

Similar presentations


Presentation on theme: "Using RNA-seq data to improve gene annotation"— Presentation transcript:

1 Using RNA-seq data to improve gene annotation

2 The GENCODE consortium
HAVANA Manual annotation Ensembl Computational annotation Annotation hints, experimental and computational validation The gene annotation is supported by computational and wet lab groups who feedback and QC our work, which in turn we feedback to improve their pipelines through their predictions, highlight regions of interest in the genome to be followed up by manual annotation, identify potential features missing from annotation and experimentally validate the annotated transcripts – feeding back to computational groups to help improve pipelines Used to QC transcripts. default annotation in Ensembl and UCSC browsers and is dynamic (>95% of Ensembl) GENCODE geneset

3 Gene models HAVANA produces GENCODE[1] reference gene model annotation
used in production of whole exome sequence (WES) arrays[2] default gene models in Ensembl and UCSC genome browsers Ensembl Manual gene annotation for the human genome UCSC [1] Harrow J, Frankish A, Gonzalez JM et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res Sep;22(9): [2] Coffey AJ, Kokocinski F, Calafato MS et al. The GENCODE exome: sequencing the complete human exome. Eur J Hum Genet Jul;19(7):827-31

4

5 Olfr RNAseq analysis workflow
fastq files Align to reference with TopHat2 or STAR BAM files Merge BAMs Run Cufflinks Merge Cufflinks models Cufflinks models Filter for ORs using HMM QC for best models Filtered Cufflinks models Upload to gencode db from GTF Copy across CDS and biotype data Add to Gencode db

6 QC system for filtered OR Cufflinks models – using an in-house web server with MySQL and IGV

7 QC system for filtered OR Cufflinks models – using an in-house web server with MySQL and IGV

8 Change in gene coverage for mouse olfactory receptor annotation

9

10 Comparison of fpkm values for human ORs (olender Vs logan)

11

12 Intron spanning reads from Intropolis

13 Early infantile epileptic encephalopathies (EIEE)
EIEE - early onset seizures (< 1 year) developmental delay, potentially fatal, comorbidities e.g. cerebral palsy Include Dravet, Ohtahara, West Syndrome (infantile spasms), etc Pilot study of 70 genes (66 from GOSH) Clinical significance already demonstrated @31% of children have a diagnosis through genetic studies – are we looking in all the right places? Severe disorders chracterised by Chaotic brain activity called hypsarrhythmia – these often evolve to other syndromes While we are constantly finding new genes, but are the current gene models correct? 3–5 per 10,000 live births

14 “Deep diving” using next generation derived data from brain
PacBio and RNA CaptureSeq - adult brain Synthetic long-read RNA sequencing (SLRseq) - adult brain Paired Illumina RNAseq, 6 life stages from brain, Jaffe et al., Nat Neurosci 2015 We wanted to see how complete our geneset is. We wanted to look specifically at transcripts expressed in brain. We have state of the art data sets that allow much more in depth study of gene structures and allow us to do better functional characterisation. The three techniques we used are PacBio, SLRseq, and illumina short read data set from 6 life stages in brain, where we annotated exclusively foetal and infant brain transcripts. Tilgner et al., Nat Biotech 2015 Mercer et al., Nature Protocols 2014 Trapnell et al., Nat Biotech 2010

15 Genome annotation improvements
We are now moving from exome to genome sequence, so can query any variant on the genome for its functional significance. Use CAGE PolyAseq RP Mass spec data to enrich our annotation. Our feeling is that the more detailed annotation the better. Exome is frozen at the date given by e.g. Agilent, but we can add new features regularly. We can take genome sequence and analyse it 10 years down the line. While large scale annotation is time consuming, small scale projects are much quicker than automated ENCODE 3 – tissue specificty – Matthias Uhlen, bodymap from illumina. The human protein atlas – body atlas

16 Genome annotation improvements
We are now moving from exome to genome sequence, so can query any variant on the genome for its functional significance. Use CAGE PolyAseq RP Mass spec data to enrich our annotation. Our feeling is that the more detailed annotation the better. Exome is frozen at the date given by e.g. Agilent, but we can add new features regularly. We can take genome sequence and analyse it 10 years down the line. While large scale annotation is time consuming, small scale projects are much quicker than automated ENCODE 3 – tissue specificty – Matthias Uhlen, bodymap from illumina. The human protein atlas – body atlas

17 Genome annotation improvements
We are now moving from exome to genome sequence, so can query any variant on the genome for its functional significance. Use CAGE PolyAseq RP Mass spec data to enrich our annotation. Our feeling is that the more detailed annotation the better. Exome is frozen at the date given by e.g. Agilent, but we can add new features regularly. We can take genome sequence and analyse it 10 years down the line. While large scale annotation is time consuming, small scale projects are much quicker than automated ENCODE 3 – tissue specificty – Matthias Uhlen, bodymap from illumina. The human protein atlas – body atlas

18 Genome annotation improvements
We are now moving from exome to genome sequence, so can query any variant on the genome for its functional significance. Use CAGE PolyAseq RP Mass spec data to enrich our annotation. Our feeling is that the more detailed annotation the better. Exome is frozen at the date given by e.g. Agilent, but we can add new features regularly. We can take genome sequence and analyse it 10 years down the line. While large scale annotation is time consuming, small scale projects are much quicker than automated ENCODE 3 – tissue specificty – Matthias Uhlen, bodymap from illumina. The human protein atlas – body atlas

19 Genome annotation improvements
We are now moving from exome to genome sequence, so can query any variant on the genome for its functional significance. Use CAGE PolyAseq RP Mass spec data to enrich our annotation. Our feeling is that the more detailed annotation the better. Exome is frozen at the date given by e.g. Agilent, but we can add new features regularly. We can take genome sequence and analyse it 10 years down the line. While large scale annotation is time consuming, small scale projects are much quicker than automated ENCODE 3 – tissue specificty – Matthias Uhlen, bodymap from illumina. The human protein atlas – body atlas

20 Genome annotation improvements
NMD We are now moving from exome to genome sequence, so can query any variant on the genome for its functional significance. Use CAGE PolyAseq RP Mass spec data to enrich our annotation. Our feeling is that the more detailed annotation the better. Exome is frozen at the date given by e.g. Agilent, but we can add new features regularly. We can take genome sequence and analyse it 10 years down the line. While large scale annotation is time consuming, small scale projects are much quicker than automated ENCODE 3 – tissue specificty – Matthias Uhlen, bodymap from illumina. The human protein atlas – body atlas Retained Intron

21 Addition of many novel alternatively-spliced transcripts
Already well annotated, but more than doubled. RefSeq has about 10% of these transcripts Annotated more than 1000 transcripts, all of them supported by transcriptional evidence. GENCODE GENCODE RefSeq RefSeq

22 Significant increase in exonic coverage
Total number of novel transcripts New exons 1092 706 New introns 1132 SSJs 224 New exon coverage 128,817 bp SSJ coverage 12,402 bp UTR/transcripts 125,936 bp Extra coding sequence coverage 15,283 bp Total amount of new sequence 141,219 bp Already well annotated, but more than doubled. RefSeq has about 10% of these transcripts Annotated more than 1000 transcripts, all of them supported by transcriptional evidence. GENCODE GENCODE RefSeq RefSeq

23 Cross-species conservation of coding sequence
Severe disorders chracterised by Chaotic brain activity called hypsarrhythmia – these often evolve to other syndromes While we are constantly finding new genes, but are the current gene models correct? CE = constrained elements from alignment of 39 mammalian genomes from Ensembl


Download ppt "Using RNA-seq data to improve gene annotation"

Similar presentations


Ads by Google