Download presentation
Presentation is loading. Please wait.
Published byDaisy Newman Modified over 9 years ago
1
1 of 34 Ensembl use of RNASeq Steve Searle
2
2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled RNASeq data sets Incorporate into a new Ensembl gene set Add novel models into a gene set UTR Filtering Models Improve old gene sets Introduction
3
3 of 34 RNASeq pipeline Building genes from RNASeq
4
4 of 34 Reads are aligned to the genome with a quick un-gapped alignment using BWA Transcriptome reads split over introns - we need to allow for this: Align with up to 50% miss-matches to get intron spanning reads to align The alignments are then processed to collapse overlapping reads into blocks representing exons Read pairing is used (if available) to group the exon blocks into approximate transcript structures RNASeq Pipeline Alignment and Initial Processing
5
5 of 17
6
6 of 34 RNASeq Pipeline Intron Alignment We align split reads using Exonerate – has a good splice model but is not a short read aligner Intron alignment is made faster in 2 ways: Don’t realign all the reads: Introns are resolved by realigning partially aligned reads. Use Exonerate word length to define which reads to realign Align to a single transcript: Reads are realigned either to the rough transcript sequence or to the genomic span of the rough transcript. Limiting the search space allows us to do a more sophisticated Exonerate alignment with a splice model and a shorter word length. Aligning to the genomic span of the transcript can identify small exons that were missed by the BWA alignment that can be incorporated into the final model.
7
Exonerate spliced alignment Partially aligned reads Split reads Collapsed Intron Features Final Models
8
BLASTP Coverage (PE12)
9
9 of 34 Website Display of RNASeq pipeline results Data visible in Ensembl Transcript models Intron features BAM files of BWA alignments
10
10 of 34 Human gene ZMPSTE24 RNASeq introns by tissue RNASeq models by tissue & merged CCDS GENCODE transcript
11
11 of 34
12
12 of 34 Nile tilapia: BAM files
13
13 of 34 Nile tilapia: BAM files
14
14 of 34 RNASeq Volume We are collecting more and more RNASeq We now have sizeable RNASeq sets for 12 species + Pipeline is now being used in production Further automation has allowed us to speed up model building: Process spreadsheet data to automate the pipeline setup and configuration Parse meta data out of spreadsheets into the final BAM files
15
15 of 34 Using RNASeq in the Ensembl genebuild pipeline
16
16 of 34 Using RNASeq in the Ensembl genebuild pipeline Some species have little specific data Eg. Nile tilapia 131 proteins in Uniprot 35 cDNAs, 119531 ESTs Rely on data from related species RNASeq supplements the above data Species-specific Fills gaps, alternate splice sites, faster genebuild
17
17 of 34 Raw Computes Targeted stageSimilarity stage cDNAs/ESTs UTR addition Final gene set Filtering Genebuild process Filtering TranscriptConsensus LayerAnnotation Annotation Projection (primates)
18
18 of 34 Raw Computes Targeted stageSimilarity stage cDNAs/ESTs UTR addition Final gene set Filtering Genebuild process Filtering Merged RNA-Seq models Annotation Projection (primates)
19
19 of 34 RNASeq helps with: 1. Choice of splice site RNASeq Similarity models Ensembl model
20
20 of 34 RNASeq helps with: 2. UTR addition RNASeq model Similarity model Ensembl model
21
21 of 34 RNASeq helps with: 3. New models RNASeq introns RNASeq model Similarity model Ensembl model
22
22 of 34 Species with RNASeq used in generating Ensembl gene set Released: Zebrafish Tasmanian Devil Coelacanth Tilapia In progress: Dog, Turtle, Rat, Cat, Chicken, Platyfish So RNASeq is becoming a central part of the genebuild process with many species having components of RNASeq going forward
23
23 of 34 Gene set update pipeline using RNASeq
24
24 of 34 Gene set Update Pipeline using RNASeq 1.RNA-Seq RNA-Seq is pipeline is highly automated, many species take around a week to process 2.Split core gene set into single transcript genes 3.Transcript scoring / filtering UTR addition done at the same time 4.Layering avoiding pseudogenes gap filling with fragments 5.Rebuild core set 6.Transfer pseudogenes + ncRNAs Gene set update pipeline is fast and is using existing code in a novel way with very few alterations
25
RNASeq model Ensembl models RNASeq Introns Filter and add UTRs
26
Add ‘UTR’ Extend CDS RNASeq models Ensembl models RNASeq Introns
27
27 of 34
28
28 of 34
29
29 of 34
30
30 of 34
31
31 of 34 Results MonodelphisPlatypus GenesTranscripts 19,46632,541 21,32422,307 132 GenesTranscripts 17,95126,836 21,69523,581 204 before merge after merge joined genes
32
32 of 34 Gene set update pipeline - Summary Quick, straightforward method of tidying up gene sets Add species specific models into gene-sets that were previously mostly based on proteins from other species Much more efficient than a new genebuild Future work: Lots of other species we could apply this to See what effect it has on primates / projection builds - in progress
33
33 of 34 Ensembl Use of NHPRT data Primates in Ensembl currently: Chimp, Gorilla, Rhesus macaque, Marmoset, Mouse lemur*, Squirrel monkey +, Baboon +, Orangutan, Gibbon, Tarsier* ( + = Pre!, *=2x) Run RNASeq pipeline on NHPRT primates in Ensembl to generate: –Transcript models –Introns –BAM files of alignments (would like individual tissue RNASeq data for this) Use NHPRT RNASeq in Ensembl gene builds on new species eg. Baboon Use NHPRT RNASeq to improve existing Ensembl gene sets eg. Rhesus macaque Consider other uses - –targeted improvement of models for ‘important’ genes (disease related) –Long non coding genes –Alignment to human
34
34 of 34 Steve Searle Bronwen Aken Daniel Barrell Susan Fairley Carlos Garcia Giron Thibaut Hourlier Andreas Kahari Rishi Nag Magali Ruffier Amy Tang Jan-Hinnerk Vogel Amonida Zadissa Acknowledgements John E Collins Stephen Keenan Henrik Kaessman Jessica Alfoldi Illumina (Human Body Map data)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.