Download presentation
Presentation is loading. Please wait.
1
Canadian Bioinformatics Workshops
2
Module #: Title of Module
2
3
Module 6 Metatranscriptomics
John Parkinson Analysis of Metagenomic Data June 22-24, 2016
4
Learning Objectives of Module
At the end of this module the student will have an appreciation of the opportunities and challenges of metatranscriptomics by: Understanding the capabilities of metatranscriptomics Gaining an appreciation of sample collection and experimental design Learning important steps in data processing Being able to process a simple metatranscriptomic dataset
5
Overview What is metatranscriptomics – how does it relate to RNA-Seq?
Experimental design, sample collection and preparation Processing of reads Filters Assembly Dereplication Functional annotation Taxonomic annotation Statistical analysis Visualization Consequtive basepairs
6
Metagenomics and metatranscriptomics reveal function
16S rRNA surveys (“Who is there?”) have been widely applied but yield only limited mechanistic insights – cause or consequence? Metagenomics (“What can they do?”) have revealed dramatic differences in community composition but with conserved microbiome functions Metatranscriptomics (“Who is doing what?”) examines microbiome activity Frank et al Cell Host Microbe 2008 / The human microbiome consortium Nature 2012/ Xiong et al PLoS ONE 2012
7
Metatranscriptomics focuses on community activity
Metatranscriptomics exploits RNA-Seq to determine which genes and pathways are being actively expressed within a community Genes involved in pathways associated with cell wall biogenesis Metatranscriptomics can reveal active functions It can also reveal which taxa are responsible for the active functions Consequtive basepairs Relative abundance Relative abundance and contributing taxa
8
Similar microbiomes can express different functions
Distribution of Taxa (n=4) Perilipin2 KO mice fed a high fat diet exhibited a similar microbiome as WT mice fed the same diet…. Acetolactate synthatse …however metabolic pathways are differentially expressed – host genotype impacts microbiome function Isopropylmalate isomerase Xiong et al In Preparation
9
Metatranscriptomics through RNA Seq
Species B Species D Species A Species C Extract RNA Fragment and sequence RNA-Seq is the unbiased sequencing of an RNA sample to yield a digital readout of the relative expression of transcripts within a sample Typically applied to organisms with a reference (sequenced) genome, microbiome applications face a number of challenges Gene A1 6 3 1 2 4 Gene A2 Gene A3 Gene A4 Gene B1 Gene B2 Gene B3 Gene C1 Consequtive basepairs Align reads to known transcripts to obtain relative expression Gene D4
10
Metatranscriptomics: Challenges
In a typical RNA-Seq experiment applied to a single eukaryotic organism, mRNA is isolated. After fragmentation and sequencing, reads are mapped to a reference genome using standard software such as MAQ and BWA to provide: 1) support that the transcript is expressed; 2) the relative abundance of the transcript; and 3) the presence and abundance of isoforms For microbiome samples we have the following problems: Lack of a polyA signal makes it difficult to isolate bacterial mRNA and resulting in (massive) rRNA contamination Environmental microbiome samples lack reference genomes making it difficult to map reads back to their source transcripts Consequtive basepairs
11
A typical metatranscriptomic analytical pipeline
1. Obtain RNA 2. Prep for sequencing 3. Generate Reads 4. Remove low quality 5. Remove rRNA reads 6. Remove host reads 7. Assemble 7. Identify bacterial transcript 8. Map to pathways Gene A Gene C 9. Sample comparisons Gene B Gene D
12
1. Sample collection and RNA extraction
RNA quality deteriorates rapidly – Method of storage and preparation can impact taxa recovered and has yet to be standardized Best(?): Process immediately to extract RNA then store at -80 Next best(?): Snap freeze in liquid nitrogen and store at -80 Avoid use of RNALater – it lyses some cells and can interfere with RNA extraction kits (e.g Ambion RiboPure Bacteria Kit / LifeTech Trizol Plus) Number of biological replicates “At least 2!” Power analyses challenging Cost per sample (20 million reads) ~$300-$400
13
2. Preparing sample for sequencing
Bacterial mRNA’s lack a polyA tail so how to remove abundant rRNA species? Once RNA has been extracted, several kits are available to remove rRNA – need 500ng-2.5ug RNA/sample Ribo-Zero (Illumina) provides reasonable success mRNA Host rRNA Adaptor / Low quality Host mRNAs can also prove challenging – can also be informative!
14
2. Preparing sample for sequencing
Temperature (4oC > -20oC) Kits (RiboZero > Mxpress) Storage Days (Fresh > 1 week) RNA later (+ > -) Comparisons of storage and RNA treatments
15
3. Generating reads How many reads are “enough”? While 454 and MiSeq provide long reads useful for annotation HiSeq (or NextSeq) provide sequencing depth and offer possibility of multiplexing ~5 million mRNA reads provide 90-95% of ECs in a microbiome With kits yielding mRNA read rates of ~25%, this suggests 20 million/sample mRNA
16
4. Analysing the data Metatranscriptomics is a relatively new field
Assembly methods? Annotation methods? Analysis methods? Processing methods? Least established Most established Metatranscriptomics is a relatively new field Robust software standards and methods still to be developed New tools continually created – can swap in as they prove more effective Celaj et al Microbiome 2014 Jiang et al Microbiome 2016
17
Read processing - filtering
To identify reads derived from mRNA bioinformatics pipelines need to be in place that remove contaminating reads: Low quality - Trimmomatic Adaptors – Trimmomatic Host – BWA / BLAT rRNA – BLAT / Infernal Assembly methods? Annotation methods? Analysis methods? Processing methods? Least established Most established Trimmomatic uses a sliding window approach from the 5` end to identify low quality regions which are then trimmed from the 3` end. Reads < 36 bp are discarded
18
Read processing - Assembly
contig1 contig2 Assembly methods? Annotation methods? Analysis methods? Processing methods? Least established Most established attagcggcgattttcggcgatcttatcttgatctgggcgcgtatcggtagcgtagcgattcgtagc Assembly improves annotation accuracy Trinity appears to provide best performance in terms of reads that can be annotated Chimera’s, misassembled contigs, can become a problem due to reads derived from orthologs from different species
19
Read processing – functional annotation
Assembly methods? Annotation methods? Analysis methods? Processing methods? Least established Most established Functional annotations rely on the identification of sequence similarity searches using tools such as: BWA, BLAT and BLAST While fast, BWA and BLAT rely on near perfect matches to reference genomes However sequence diversity is huge! Every time a new strain of S. agalactae was sequenced, new genes are discovered that were not present in existing strains Diversity at the nucleotide level is >> protein level
20
Read processing – functional annotation
One solution is to work in peptide space and use BLASTX to search protein databases - this is very time consuming and requires cloud/cluster computing Other solutions include DIAMOND and USEARCH (issues over quality and cost) Even with BLAST many reads remain unannotated Can be improved with longer read length
21
Quality of BLASTX matches
Typical match to a 71bp read E-Value = 39 (NOT e-39) Species looks about right though % of Read Length 55 60 65 70 75 80 85 90 95 100 35 0.0 40 45 50 0.1 0.2 0.4 0.3 0.5 0.7 1.6 0.9 1.3 1.5 2.2 1.7 3.8 2.0 2.4 4.9 1.1 0.8 1.8 2.7 3.3 0.6 1.9 4.0 3.7 4.8 2.9 8.6 12.1 Summary for all matches % ID of match
22
Read processing – converting mappings to expression
To normalize expression levels to account for differences in gene length, read counts are converted to Reads per kilobase of transcript mapped (RPKM) Expression is biased for gene length (longer transcripts should have more reads) to normalize, reads are converted to Reads per Kilobase of transcript per million reads mapped RPKMgeneA = 109 CgeneA / NL CgeneA = number of reads mapped to geneA N = total number of reads L = length of transcript in units of Kb # RPKM 8 1 24 6 8 12 Several software tools available to do mapping and calculate normalized expression measurements across different samples including Bowtie and Cufflinks
23
Read processing – taxonomic annotation
Previous studies have shown that microbiomes can vary significantly in taxonomic contributions while yielding similar functionality Assembly methods? Annotation methods? Analysis methods? Processing methods? Least established Most established Knowing which species are providing which functions may not be that important - On the other hand, assigning RNA reads to taxa may reveal critical functions contributed by keystone taxa, can also help in binning for assembly How do we assign taxonomic information? The human microbiome consortium 2012
24
Read processing – taxonomic annotation
Alignment based methods such as BLAST and BWA can fail where we lack suitable reference genomes – particularly for short read datasets where assignments may be ambiguous Compositional methods (e.g. nt frequency, codon bias) offer alternative strategies NBC Nearest neighbours methods then try to assign a sequence to the genome with the closest distribution CLARK Here a sequences is classified into frequencies of 3-mers KRAKEN
25
Taxonomic assignment through Gist
mRNA genomes (*.ffn) taxonomy database 16S counts other abundances classifier construction environment description taxonomic assignment class definitions taxonomy tags reshape into strain-specific records (DELIN) generate class distributions AA NB AA CD AA 1NN AA GMM BWA NT NB NT CD NT 1NN NT GMM class priors log normalization infer protein sequence FragGeneScan amino acid sequences Classification 1 select candidate hits for detailed processing for each read Classification 2 select best hits final confidence measurements (T-test for hit specificity) final assignment scores CalcTaxonInclusion initial confidence measurements with T-test for hit specificity Gist is a computational pipeline for accurate assignment of reads to individual species Integrates several methods, but uniquely assigns different weights to methods for each genome Can also take in expected sequence distributions (e.g. based on 16S rRNA surveys)
26
GIST exploits several orthogonal methods to define genome signatures
Naïve Bayes average + variance Nearest Neighbor exact gene composition Gaussian mixture subclasses of genes Expected Codelta signature features 19:20389:F:275+18M2D19M M2 BWA exact alignment hit FragGeneScan Protein translation
27
Assigning genome specific weights for each method
Increasing weight of method used to score genome Methods A key feature of Gist is defining combinations of weights for each feature / method that best discriminate between different genomes For example, some genomes may perform better using alignment methods than compositional methods
28
Gist performance against a mouse metatranscriptome
The performance of Gist was compared with two other classifiers - MetaCV and NBC using three mouse metatranscriptome datasets. Each mouse was raised under germ free conditions and seeded with altered Schaedler flora (ASF – comprises 8 taxa). MetaCV is unable to assign ~30-40% of reads, while NBC assigns reads to ~15 main taxa. Gist assigns reads to ~9 main taxa, closely reflecting ASF taxa.
29
mRNA expression is not equivalent to rRNA abundance
Comparisons between rRNA and mRNA read distributions reveal subtle differences in read abundance Artefact due to biases in rRNA sequencing? Abundance != expression?
30
Visualizing results Assembly methods? Annotation methods? Analysis methods? Processing methods? Least established Most established Once reads have been assigned to transcripts, we can explore their function through e.g. functional categories such as Gene Ontology or COG However these types of analyses tend not to be very informative as the categories are very broad Transcription Energy production and metabolism Lipid transport and metabolism
31
Visualizing results Beyond focusing on broad functional categories, we can also start to undertake systems based analyses Genes and proteins do not operate in isolation but form parts of interconnected functional modules Signalling networks Metabolic pathways By placing a bacterial transcripts within these functional contexts, we can understand how the microbiome functions at a molecular level Protein complexes
32
Visualizing results By mapping abundance of reads that map to E. coli homologs, protein interaction networks can be used as a template to provide more mechanistic insights – can we identify modules of genes that are coordinately expressed
33
Visualizing results How do we identify those systems which are differentially regulated across samples (e.g. systems up-regulated in disease vs. health)? Development of a statistically robust platform (1) Gene set enrichment analysis (GSEA) requires discrete groups of functionally related genes (2) Network approach; identify groups of genes based on functional interactions AND their differential expression GSEA may define trpA-E as a single group of related genes, but group as a whole may not be significantly upregulated A network based approach may define a group based on their interactions AND expression
34
Visualizing results Metabolic pathways are among the most highly conserved and best characterized systems MG-RAST and MEGAN are automated metagenomic annotation tools that rely on KEGG A major problem with KEGG pathway definitions is that the boundaries of pathways are arbitrarily defined and links between pathways (i.e. functional relationships) can be lost
35
Metabolic network of a mouse gut microbiome
Visualizing results Combining taxonomic and functional information allows us to identify contributions of individual taxa within a microbiome Metabolic network of a mouse gut microbiome Piecharts indicate relative contribution of each taxon to an enzymatic activity The interconversion of fructose-1,6 bisphosphate to glyceraldehyde 3-phosphate is largely mediated by bacteroides
36
Visualizing results In addition to representing protein interaction or metabolic relationships through network visualization, Cytoscape offers the ability to represent nodes (proteins) as pie charts or donuts, allowing the visualization of taxonomic contributions to discrete functions
37
Statistical considerations
There is no dedicated software or statistical tool for statistical comparisons of metatranscriptomic datasets Number of biological replicates? (at least two, preferably at least three but these are expensive experiments) Power analyses? Differential expression of individual genes Gene set enrichment analyses Ultimately metatranscriptomics could be viewed as hypothesis generating requiring subsequent targeted validation
38
Statistical considerations
While there are no dedicated tools for metatranscriptomics analyses, tools used for RNA Seq offer potential DESeq, EdgeR, ALDEx Alternatively simply rely on fold change Challenges include which genes to include (minimum RPKM?) Differentially expressed genes can be subsequently analysed through Gene Set Enrichment Approaches based on (for example) hypergeometric distributions
39
We are on a Coffee Break & Networking Session
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.