Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.

Similar presentations


Presentation on theme: "1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled."— Presentation transcript:

1 1 of 34 Ensembl use of RNASeq Steve Searle

2 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled RNASeq data sets Incorporate into a new Ensembl gene set Add novel models into a gene set UTR Filtering Models Improve old gene sets Introduction

3 3 of 34 RNASeq pipeline Building genes from RNASeq

4 4 of 34 Reads are aligned to the genome with a quick un-gapped alignment using BWA Transcriptome reads split over introns - we need to allow for this: Align with up to 50% miss-matches to get intron spanning reads to align The alignments are then processed to collapse overlapping reads into blocks representing exons Read pairing is used (if available) to group the exon blocks into approximate transcript structures RNASeq Pipeline Alignment and Initial Processing

5 5 of 17

6 6 of 34 RNASeq Pipeline Intron Alignment We align split reads using Exonerate – has a good splice model but is not a short read aligner Intron alignment is made faster in 2 ways: Don’t realign all the reads: Introns are resolved by realigning partially aligned reads. Use Exonerate word length to define which reads to realign Align to a single transcript: Reads are realigned either to the rough transcript sequence or to the genomic span of the rough transcript. Limiting the search space allows us to do a more sophisticated Exonerate alignment with a splice model and a shorter word length. Aligning to the genomic span of the transcript can identify small exons that were missed by the BWA alignment that can be incorporated into the final model.

7 Exonerate spliced alignment Partially aligned reads Split reads Collapsed Intron Features Final Models

8 BLASTP Coverage (PE12)

9 9 of 34 Website Display of RNASeq pipeline results Data visible in Ensembl Transcript models Intron features BAM files of BWA alignments

10 10 of 34 Human gene ZMPSTE24 RNASeq introns by tissue RNASeq models by tissue & merged CCDS GENCODE transcript

11 11 of 34

12 12 of 34 Nile tilapia: BAM files

13 13 of 34 Nile tilapia: BAM files

14 14 of 34 RNASeq Volume We are collecting more and more RNASeq We now have sizeable RNASeq sets for 12 species + Pipeline is now being used in production Further automation has allowed us to speed up model building: Process spreadsheet data to automate the pipeline setup and configuration Parse meta data out of spreadsheets into the final BAM files

15 15 of 34 Using RNASeq in the Ensembl genebuild pipeline

16 16 of 34 Using RNASeq in the Ensembl genebuild pipeline Some species have little specific data Eg. Nile tilapia 131 proteins in Uniprot 35 cDNAs, 119531 ESTs Rely on data from related species RNASeq supplements the above data Species-specific Fills gaps, alternate splice sites, faster genebuild

17 17 of 34 Raw Computes Targeted stageSimilarity stage cDNAs/ESTs UTR addition Final gene set Filtering Genebuild process Filtering TranscriptConsensus LayerAnnotation Annotation Projection (primates)

18 18 of 34 Raw Computes Targeted stageSimilarity stage cDNAs/ESTs UTR addition Final gene set Filtering Genebuild process Filtering Merged RNA-Seq models Annotation Projection (primates)

19 19 of 34 RNASeq helps with: 1. Choice of splice site RNASeq Similarity models Ensembl model

20 20 of 34 RNASeq helps with: 2. UTR addition RNASeq model Similarity model Ensembl model

21 21 of 34 RNASeq helps with: 3. New models RNASeq introns RNASeq model Similarity model Ensembl model

22 22 of 34 Species with RNASeq used in generating Ensembl gene set Released: Zebrafish Tasmanian Devil Coelacanth Tilapia In progress: Dog, Turtle, Rat, Cat, Chicken, Platyfish So RNASeq is becoming a central part of the genebuild process with many species having components of RNASeq going forward

23 23 of 34 Gene set update pipeline using RNASeq

24 24 of 34 Gene set Update Pipeline using RNASeq 1.RNA-Seq RNA-Seq is pipeline is highly automated, many species take around a week to process 2.Split core gene set into single transcript genes 3.Transcript scoring / filtering UTR addition done at the same time 4.Layering avoiding pseudogenes gap filling with fragments 5.Rebuild core set 6.Transfer pseudogenes + ncRNAs Gene set update pipeline is fast and is using existing code in a novel way with very few alterations

25 RNASeq model Ensembl models RNASeq Introns Filter and add UTRs

26 Add ‘UTR’ Extend CDS RNASeq models Ensembl models RNASeq Introns

27 27 of 34

28 28 of 34

29 29 of 34

30 30 of 34

31 31 of 34 Results MonodelphisPlatypus GenesTranscripts 19,46632,541 21,32422,307 132 GenesTranscripts 17,95126,836 21,69523,581 204 before merge after merge joined genes

32 32 of 34 Gene set update pipeline - Summary Quick, straightforward method of tidying up gene sets Add species specific models into gene-sets that were previously mostly based on proteins from other species Much more efficient than a new genebuild Future work: Lots of other species we could apply this to See what effect it has on primates / projection builds - in progress

33 33 of 34 Ensembl Use of NHPRT data Primates in Ensembl currently: Chimp, Gorilla, Rhesus macaque, Marmoset, Mouse lemur*, Squirrel monkey +, Baboon +, Orangutan, Gibbon, Tarsier* ( + = Pre!, *=2x) Run RNASeq pipeline on NHPRT primates in Ensembl to generate: –Transcript models –Introns –BAM files of alignments (would like individual tissue RNASeq data for this) Use NHPRT RNASeq in Ensembl gene builds on new species eg. Baboon Use NHPRT RNASeq to improve existing Ensembl gene sets eg. Rhesus macaque Consider other uses - –targeted improvement of models for ‘important’ genes (disease related) –Long non coding genes –Alignment to human

34 34 of 34 Steve Searle Bronwen Aken Daniel Barrell Susan Fairley Carlos Garcia Giron Thibaut Hourlier Andreas Kahari Rishi Nag Magali Ruffier Amy Tang Jan-Hinnerk Vogel Amonida Zadissa Acknowledgements John E Collins Stephen Keenan Henrik Kaessman Jessica Alfoldi Illumina (Human Body Map data)


Download ppt "1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled."

Similar presentations


Ads by Google