Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:

Advertisements

Similar presentations

Advertisements

Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.

Metabarcoding 16S RNA targeted sequencing

Peter Tsai Bioinformatics Institute, University of Auckland

Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.

Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.

Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520

Transcriptomics Jim Noonan GENE 760.

Review of important points from the NCBI lectures. –Example slides Review the two types of microarray platforms. –Spotted arrays –Affymetrix Specific examples.

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

The Sorcerer II Global ocean sampling expedition Katrine Lekang Global Ocean Sampling project (GOS) Global Ocean Sampling project (GOS) CAMERA CAMERA METAREP.

High Throughput Sequencing

mRNA-Seq: methods and applications

341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London

Metagenomics Binning and Machine Learning

Metagenomic Analysis Using MEGAN4

Development of Bioinformatics and its application on Biotechnology

Discussion on Metagenomic Data for ANGUS Course Adina Howe.

H = -Σp i log 2 p i. SCOPI Each one of the many microbial communities has its own structure and ecosystem, depending on the body environment it exists.

Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.

RNAseq analyses -- methods

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:

Advancing Science with DNA Sequence Metagenome definitions: a refresher course Natalia Ivanova MGM Workshop September 12, 2012.

The iPlant Collaborative

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.

Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.

Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Central dogma: the story of life RNA DNA Protein.

Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.

Introduction to RNAseq

Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.

Accurate estimation of microbial communities using 16S tags

The iPlant Collaborative

No reference available

Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.

CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Shotgun sequencing reveals transkingdom alterations in immunodeficiency associated enteropathy Xiaoxi Dong (Oregon State University), Jialu Hu (Oregon.

MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Discussion on Genomic/Metagenomic Data for ANGUS Course Adina Howe.

Canadian Bioinformatics Workshops

Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.

RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.

bacteria and eukaryotes

Canadian Bioinformatics Workshops

Alastair Grant Environmental Sciences, University of East Anglia

Metagenomic Species Diversity.

RNA Quantitation from RNAseq Data

The Transcriptional Landscape of the Mammalian Genome

Networks and Interactions

Quality Control & Preprocessing of Metagenomic Data

Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017

Gene expression from RNA-Seq

RNA-Seq analysis in R (Bioconductor)

Metagenomic assembly Cedric Notredame

Canadian Bioinformatics Workshops

Gene expression.

H = -Σpi log2 pi.

Summary and Recommendations

Summary and Recommendations

Sequence Analysis - RNA-Seq 2

Toward Accurate and Quantitative Comparative Metagenomics

Presentation transcript:

Canadian Bioinformatics Workshops

2Module #: Title of Module

Module 5 Metatranscriptomics

Module 5 bioinformatics.ca Learning Objectives of Module At the end of this module the student will have an appreciation of the opportunities and challenges of metatranscriptomics by: Understanding the capabilities of metatranscriptomics Gaining an appreciation of sample collection and experimental design Learning important steps in data processing Being able to process a simple metatranscriptomic dataset

Module 5 bioinformatics.ca Overview What is metatranscriptomics and RNA-Seq? Experimental design, sample collection and preparation Processing of reads – Filters – Assembly – Dereplication – Functional annotation – Taxonomic annotation Statistical analysis Visualization

Module 5 bioinformatics.ca Metagenomics is typically focussed on community makeup, identifying the proportion of sequences (DNA or 16S rRNA) that map to different taxa Although metagenomics can inform on the species present it has several drawbacks Does not provide much in the way of functional context, i.e. which genes are important Different sets of species have been shown to result in similar functional potential Relying on 16S sequencing provides species level information, but due to HGT the gene complements of different strains of the same species can vary considerably Although metagenomics can inform on the species present it has several drawbacks Does not provide much in the way of functional context, i.e. which genes are important Different sets of species have been shown to result in similar functional potential Relying on 16S sequencing provides species level information, but due to HGT the gene complements of different strains of the same species can vary considerably Qin et al Nature 2010 Metagenomics focuses on community makeup

Module 5 bioinformatics.ca Metatranscriptomics exploits RNA-Seq to determine which genes and pathways are being actively expressed within a community Metatranscriptomics can reveal active functions (knowing the taxa responsible is unimportant) It can also reveal which taxa are responsible for the active functions Metatranscriptomics can reveal active functions (knowing the taxa responsible is unimportant) It can also reveal which taxa are responsible for the active functions Metatranscriptomics focuses on community activity Genes involved in pathways associated with cell wall biogenesis Relative abundance Relative abundance and contributing taxa

Module 5 bioinformatics.ca RNA-Seq is the unbiased sequencing of an RNA sample to yield a digital readout of the relative expression of transcripts within a sample Typically applied to organisms with a reference (sequenced) genome, microbiome applications face a number of challenges RNA-Seq is the unbiased sequencing of an RNA sample to yield a digital readout of the relative expression of transcripts within a sample Typically applied to organisms with a reference (sequenced) genome, microbiome applications face a number of challenges Metatranscriptomics through RNA Seq Extract RNA Fragment and sequence IIIIV Species A Species B Species C Species D Align reads to known transcripts to obtain relative expression Gene A1 Gene A2 Gene A3 Gene A4 Gene B1 Gene B2 Gene B3 Gene C1 Gene D

Module 5 bioinformatics.ca Metatranscriptomics: Challenges IIIIV For microbiome samples we have the following problems: a)Lack of a polyA signal makes it difficult to isolate bacterial mRNA and resulting in (massive) rRNA contamination b)Environmental microbiome samples lack reference genomes making it difficult to map reads back to their source transcripts In a typical RNA-Seq experiment applied to a single eukaryotic organism, mRNA is isolated. After fragmentation and sequencing, reads are mapped to a reference genome using standard software such as MAQ and BWA to provide: 1) support that the transcript is expressed; 2) the relative abundance of the transcript; and 3) the presence and abundance of isoforms

Module 5 bioinformatics.ca A typical metatranscriptomic analytical pipeline 1. Obtain RNA 2. Prep for sequencing 3. Generate Reads 4. Remove low quality 5. Remove rRNA reads 6. Remove host reads 7. Identify bacterial transcript Gene B Gene A 8. Map to pathways 9. Sample comparisons 7. Assemble Gene D Gene C

Module 5 bioinformatics.ca 1. Sample collection and RNA extraction RNA quality deteriorates rapidly Best: Process immediately to extract RNA then store at -80 Next best: Snap freeze in liquid nitrogen and store at -80 Avoid use of RNALater – it lyses some cells and can interfere with RNA extraction kits (e.g Ambion RiboPure Bacteria Kit / LifeTech Trizol Plus) RNA quality deteriorates rapidly Best: Process immediately to extract RNA then store at -80 Next best: Snap freeze in liquid nitrogen and store at -80 Avoid use of RNALater – it lyses some cells and can interfere with RNA extraction kits (e.g Ambion RiboPure Bacteria Kit / LifeTech Trizol Plus) Cost per sample (20 million reads) ~$300-$400 Number of biological replicates “At least 2!” Power analyses challenging

Module 5 bioinformatics.ca 2. Preparing sample for sequencing Once RNA has been extracted, several kits are available to remove rRNA – need 500ng-2.5ug RNA/sample Ribo-Zero (Illumina) provides reasonable success Once RNA has been extracted, several kits are available to remove rRNA – need 500ng-2.5ug RNA/sample Ribo-Zero (Illumina) provides reasonable success Bacterial mRNA’s lack a polyA tail so how to remove abundant rRNA species? Host mRNAs can also prove challenging – can also be informative! mRNA rRNA Host Adaptor / Low quality

Module 5 bioinformatics.ca 2. Preparing sample for sequencing Comparisons of storage and RNA treatments Temperature (4 o C > -20 o C) Kits (RiboZero > Mxpress) Storage Days (Fresh > 1 week) RNA later (+ > -)

Module 5 bioinformatics.ca 3. Generating reads How many reads are “enough”? ~5 million mRNA reads provide 90-95% of ECs in a microbiome With kits yielding mRNA read rates of ~25%, this suggests 20 million/sample mRNA ~5 million mRNA reads provide 90-95% of ECs in a microbiome With kits yielding mRNA read rates of ~25%, this suggests 20 million/sample mRNA While 454 and MiSeq provide long reads useful for annotation HiSeq (or NextSeq) provide sequencing depth and offer possibility of multiplexing

Module 5 bioinformatics.ca Read processing - filtering To identify reads derived from mRNA bioinformatics pipelines need to be in place that remove contaminating reads: Low quality - Trimmomatic Adaptors – Trimmomatic & Cross_Match Host - BWA rRNA – BLAT / Infernal To identify reads derived from mRNA bioinformatics pipelines need to be in place that remove contaminating reads: Low quality - Trimmomatic Adaptors – Trimmomatic & Cross_Match Host - BWA rRNA – BLAT / Infernal Trimmomatic uses a sliding window approach from the 5` end to identify low quality regions which are then trimmed from the 3` end. Reads < 36 bp are discarded

Module 5 bioinformatics.ca 6. Read processing - Assembly Chimera’s, misassembled contigs, can become a problem due to reads derived from orthologs from different species Assembly improves annotation accuracy Trinity appears to provide best performance in terms of reads that can be annotated contig1 contig2 attagcggcgattttcggcgatcttatcttgatctgggcgcgtatcggtagcgtagcgattcgtagc

Module 5 bioinformatics.ca 7. Read processing – functional annotation Functional annotations rely on the identification of sequence similarity searches using tools such as: BWA, BLAT and BLAST While fast, BWA and BLAT rely on near perfect matches to reference genomes However sequence diversity is huge! Functional annotations rely on the identification of sequence similarity searches using tools such as: BWA, BLAT and BLAST While fast, BWA and BLAT rely on near perfect matches to reference genomes However sequence diversity is huge! Every time a new strain of S. galactae was sequenced, new genes are discovered that were not present in existing strains Diversity at the nucleotide level is >> protein level

Module 5 bioinformatics.ca 7. Read processing – functional annotation One solution is to work in peptide space and use BLASTX to search protein databases - this is very time consuming and requires cloud/cluster computing Other solutions include MBLAST and USEARCH (issues over quality and cost) One solution is to work in peptide space and use BLASTX to search protein databases - this is very time consuming and requires cloud/cluster computing Other solutions include MBLAST and USEARCH (issues over quality and cost) Even with BLAST many reads remain unannotated Can be improved with longer read length

Module 5 bioinformatics.ca 7. Quality of BLASTX matches Typical match to a 71bp read E-Value = 39 (NOT e -39 ) Species looks about right though % of Read Length % ID of match Summary for all matches

Module 5 bioinformatics.ca 7. Read processing – converting mappings to expression To normalize expression levels to account for differences in gene length, read counts are converted to Reads per kilobase of transcript mapped (RPKM) # RPKM Expression is biased for gene length (longer transcripts should have more reads) to normalize, reads are converted to Reads per Kilobase of transcript per million reads mapped RPKM geneA = 10 9 C geneA / NL C geneA = number of reads mapped to geneA N = total number of reads L = length of transcript in units of Kb RPKM geneA = 10 9 C geneA / NL C geneA = number of reads mapped to geneA N = total number of reads L = length of transcript in units of Kb Several software tools available to do mapping and calculate normalized expression measurements across different samples including Bowtie and Cufflinks

Module 5 bioinformatics.ca 8. Read processing – taxonomic annotation Previous studies have shown that microbiomes can vary significantly in taxonomic contributions while yielding similar functionality Previous studies have shown that microbiomes can vary significantly in taxonomic contributions while yielding similar functionality The human microbiome consortium 2012 How do we assign taxonomic information? Knowing which species are providing which functions may not be that important On the other hand, assigning RNA reads to taxa may reveal critical functions contributed by keystone taxa, can also help in binning for assembly

Module 5 bioinformatics.ca 8. Read processing – taxonomic annotation Alignment based methods such as BLAST and BWA can fail where we lack suitable reference genomes – particularly for short read datasets where assignments may be ambiguous Compositional methods (e.g. nt frequency, codon bias) offer alternative strategies Alignment based methods such as BLAST and BWA can fail where we lack suitable reference genomes – particularly for short read datasets where assignments may be ambiguous Compositional methods (e.g. nt frequency, codon bias) offer alternative strategies Here a sequences is classified into frequencies of 3-mers Nearest neighbours methods then try to assign a sequence to the genome with the closest distribution MetaCV NBC RITA

Module 5 bioinformatics.ca 8. Read processing – Gist Gist is a computational pipeline for accurate assignment of reads to individual species Integrates several methods, but uniquely assigns different weights to methods for each genome Can also take in expected sequence distributions (e.g. based on 16S rRNA surveys) mRNA genomes (*.ffn) taxonomy database 16S counts other abundances classifier construction environment description taxonomic assignment class definitions taxonomy tags reshape into strain-specific records (DELIN) generate class distributions AA NBAA CD AA 1NN AA GMM BWANT NBNT CD NT 1NN NT GMM class priors log normalization infer protein sequence FragGeneSca n amino acid sequences Classification 1 select candidate hits for detailed processing for each read Classification 2 select best hits final confidence measurements (T-test for hit specificity) final assignment scores CalcTaxonInclu sion initial confidence measurements with T-test for hit specificity CalcTaxonInclu sion

Module 5 bioinformatics.ca 8. Read processing – taxonomic annotation A key advance in Gist is finding the set of weights for each feature/method that best discriminates between different genomes – here C. diff 630 is reliant on Naïve Bayes analysis of amino acid frequencies, whereas S. agalactiae A909 relies more on a Gaussian mixture model of amino acid frequencies

Module 5 bioinformatics.ca 8. Read processing – taxonomic annotation Gist outperforms MetaCV which features a large number of ‘unclassifiable’ NBC works very well if you have reference genomes, but due to reliance on large kmers, can be prone to genetic diversity

Module 5 bioinformatics.ca mRNA expression is not equivalent to rRNA abundance Comparisons between rRNA and mRNA read distributions reveal subtle differences in read abundance -Artefact due to biases in rRNA sequencing? -Abundance != expression? Comparisons between rRNA and mRNA read distributions reveal subtle differences in read abundance -Artefact due to biases in rRNA sequencing? -Abundance != expression?

Module 5 bioinformatics.ca 9. Visualizing results Energy production and metabolism Lipid transport and metabolism Transcription Once reads have been assigned to transcripts, we can explore their function through e.g. functional categories such as Gene Ontology or COG However these types of analyses tend not to be very informative as the categories are very broad Once reads have been assigned to transcripts, we can explore their function through e.g. functional categories such as Gene Ontology or COG However these types of analyses tend not to be very informative as the categories are very broad

Module 5 bioinformatics.ca 9. Visualizing results Protein complexes Genes and proteins do not operate in isolation but form parts of interconnected functional modules Beyond focusing on broad functional categories, we can also start to undertake systems based analyses By placing a bacterial transcripts within these functional contexts, we can understand how the microbiome functions at a molecular level Metabolic pathways Signalling networks

Module 5 bioinformatics.ca 9. Visualizing results By mapping abundance of reads that map to E. coli homologs, protein interaction networks can be used as a template to provide more mechanistic insights – can we identify modules of genes that are coordinately expressed

Module 5 bioinformatics.ca 9. Visualizing results How do we identify those systems which are differentially regulated across samples (e.g. systems up-regulated in disease vrs. health)? Development of a statistically robust platform (1) Gene set enrichment analysis (GSEA) requires discrete groups of functionally related genes (2) Network approach; identify groups of genes based on functional interactions AND their differential expression GSEA may define trpA-E as a single group of related genes, but group as a whole may not be significantly upregulated A network based approach may define a group based on their interactions AND expression

Module 5 bioinformatics.ca 9. Visualizing results MG-RAST and MEGAN are automated metagenomic annotation tools that rely on KEGG A major problem with KEGG pathway definitions is that the boundaries of pathways are arbitrarily defined and links between pathways (i.e. functional relationships) can be lost MG-RAST and MEGAN are automated metagenomic annotation tools that rely on KEGG A major problem with KEGG pathway definitions is that the boundaries of pathways are arbitrarily defined and links between pathways (i.e. functional relationships) can be lost Metabolic pathways are among the most highly conserved and best characterized systems

Module 5 bioinformatics.ca 9. Visualizing results Combining taxonomic and functional information allows us to identify contributions of individual taxa within a microbiome Nucleotide metabolism Amino acid metabolism Carbohydrate metabolism Lipid metabolism The interconversion of pyruvate to acetyl CoA is largely mediated by proteobacteria Piecharts indicate relative contribution of each taxon to an enzymatic activity Metabolic network of a mouse gut microbiome

Module 5 bioinformatics.ca 9. Visualizing results In addition to representing protein interaction or metabolic relationships through network visualization, Cytoscape offers the ability to represent nodes (proteins) as pie charts or donuts, allowing the visualization of taxonomic contributions to discrete functions Genes involved in pathways associated with cell wall biogenesis Relative abundance Relative abundance and contributing taxa

Module 5 bioinformatics.ca 10. Statistical considerations There is no dedicated software or statistical tool for statistical comparisons of metatranscriptomic datasets -Number of biological replicates? (at least two, preferably at least three but these are expensive experiments) -Power analyses? -Differential expression of individual genes -Gene set enrichment analyses Ultimately metatranscriptomics could be viewed as hypothesis generating requiring subsequent targeted validation There is no dedicated software or statistical tool for statistical comparisons of metatranscriptomic datasets -Number of biological replicates? (at least two, preferably at least three but these are expensive experiments) -Power analyses? -Differential expression of individual genes -Gene set enrichment analyses Ultimately metatranscriptomics could be viewed as hypothesis generating requiring subsequent targeted validation

Module 5 bioinformatics.ca 10. Statistical considerations While there are no dedicated tools for metatranscriptomics analyses, tools used for RNA Seq offer potential -DESeq, EdgeR, ALDEx -Alternatively simply rely on fold change -Challenges include which genes to include (minimum RPKM?) Differentially expressed genes can be subsequently analysed through Gene Set Enrichment Approaches based on (for example) hypergeometric distributions While there are no dedicated tools for metatranscriptomics analyses, tools used for RNA Seq offer potential -DESeq, EdgeR, ALDEx -Alternatively simply rely on fold change -Challenges include which genes to include (minimum RPKM?) Differentially expressed genes can be subsequently analysed through Gene Set Enrichment Approaches based on (for example) hypergeometric distributions

Module 5 bioinformatics.ca We are on a Coffee Break & Networking Session