Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day.

Slides:



Advertisements
Similar presentations
RNA-seq library prep introduction
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
RNAseq.
12/04/2017 RNA seq (I) Edouard Severing.
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
RNAseq analysis Bioinformatics Analysis Team
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
RNA-seq Analysis in Galaxy
RNAseq Applications in Genome Studies
High Throughput Sequencing
mRNA-Seq: methods and applications
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
RNA-Seq and RNA Structure Prediction
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Introduction to RNA-Seq and Transcriptome Analysis
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Todd J. Treangen, Steven L. Salzberg
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
RNAseq analyses -- methods
Introduction to RNA-Seq & Transcriptome Analysis
Next Generation DNA Sequencing
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
RNA-Seq Analysis Simon V4.1.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Introduction to RNA-Seq
Introduction To Next Generation Sequencing (NGS) Data Analysis
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
The iPlant Collaborative
RNA-seq: Quantifying the Transcriptome
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
No reference available
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Transcriptomics History and practice.
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on
Simon v RNA-Seq Analysis Simon v
Introductory RNA-seq Transcriptome Profiling
RNA Quantitation from RNAseq Data
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
RNA-Seq analysis in R (Bioconductor)
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Kallisto: near-optimal RNA seq quantification tool
From: TopHat: discovering splice junctions with RNA-Seq
Transcriptomics History and practice.
Additional file 2: RNA-Seq data analysis pipeline
Sequence Analysis - RNA-Seq 2
Transcriptomics – towards RNASeq – part III
RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Schedule change Day 2: AM - Introduction to RNA-Seq (and a touch of miRNA-Seq) Day 2: PM - RNA-Seq practical (Tophat + Cuffdiff pipeline on Galaxy) Day 3: AM – Introduction to Exome Sequencing and Variant Discovery Day 3: PM - Exome sequence analysis practical (Galaxy) Galaxy server going down for maintenance on Thursday

Quick Recap NGS data production becoming commonplace Many applications -> research intent determines technology platform choice High volume data BUT error prone FASTQ is accepted format standard Must assess quality scores before proceeding ‘Bad’ data can be rescued

Introduction to RNAseq

The Central Dogma of Molecular Biology 4 Reverse Transcription

RNAseq Protocols cDNA, not RNA sequencing Types of libraries available: – Total RNA sequencing (not advised) – polyA+ RNA sequencing – Small RNA sequencing (specific size range targeted)

cDNA Synthesis

Genome-scale Applications Transcriptome analysis Identifying new transcribed regions Expression profiling Resequencing to find genetic polymorphisms: – SNPs, micro-indels – CNVs – Question: Why even bother with exome sequencing then?

Sequencing details Standard sequencing – polyA/total RNA – Size selection – Primers and adapters – Single- and paired-end sequencing Strand-specific sequencing – still immature tech – Sequencing only + or – strand – Mostly paired-end

What about microarrays??!!! Assumes we know all transcribed regions and that spliceforms are not important Cannot find anything novel BUT may be the best choice depending on QUESTION

Arrays vs RNAseq (1) Correlation of fold change between arrays and RNAseq is similar to correlation between array platforms (0.73) Technical replicates almost identical Extra analysis: prediction of alternative splicing, SNPs Low- and high-expressed genes do not match

RNA-Seq promises/pitfalls can reveal in a single assay: – new genes – splice variants – quantify genome-wide gene expression BUT – Data is voluminous and complex – Need scalable, fast and mathematically principled analysis software and LOTS of computing resources

Experimental considerations Comparative conditions must make biological sense Biological replicates are always better than technical ones Aim for at least 3 replicates per condition ISOLATE the target mRNA species you are after NOT looking for new transcripts can bias expression estimates

Analysis strategies De novo assembly of transcripts: + re-constructs actual spliced transcripts + does not require genome sequence easier to work post-transcriptional modifications - requires huge computational resources (RAM) - low sensitivity: hard to capture low abundance transcripts Alignment to the genome => Transcript assembly + computationally feasible + high sensitivity + easier to annotate using genomic annotations - need to take special care of splice junctions # 13

Basic analysis flowchart # 14 Illumina reads Remove artifacts AAA...,...N... Clip adapters (small RNA) Pre-filter: low complexity synthetic Count and discard mapped Align to the genome un-mapped Re-align with different number of mismatches etc "Collapse" identical reads Assemble: contigs (exons) + connectivity mapped Annotate Filter out low confidence contigs (singletons)

Software Short reads aligners Stampy, BWA, Novoalign, Bowtie, TOPHAT Data preprocessing Fastx toolkit samtools Expression studies Cufflinks package R packages (DESeq, edgeR, more…) Alternative splicing Cufflinks Augustus

The ‘Tuxedo’ protocol TOPHAT + CUFFLINKS TopHat aligns reads to genome and discovers splice sites Cufflinks predicts transcripts present in dataset Cuffdiff identifies differential expression Very widely adopted suite

‘Tuxedo’ protocol limitations Uses shortread data - Illumina OR SOLiD Requires a sequenced genome No GUI Versions implemented in GALAXY are old(ish)

Read alignment with TopHat

Splice junctions In humans, terminal exons are ~1kb long, and since mRNAs are ~2kb, ~half of the reads should originate in initial and internal exons Initial and internal exons are ~200b long => for 75-mer reads, ~20% of reads are supposed to cross splice junctions R L exon RNA: Genome:

Splice junctions strategies Create a splice junctions database joining together donors and acceptors Typically, use known (annotated) splice junctions or known splice sites TopHat: uses putative exons from mapped reads, database is made of canonical splice sites around putative exons

Read alignment with TopHat (2) Uses BOWTIE aligner to align reads to genome BOWTIE cannot deal with large gaps (introns) Tophat segments reads that remain unaligned Smaller segments mostly end up aligning

Read alignment with TopHat (3) When there is a large gap between segments of same read -> probable INTRON Tophat uses this to build an index of probable splice sites Allows accurate measurement of spliceform expression Possibility of detecting gene fusion events

Cufflinks package Cufflinks: – Expression values calculation – Transcripts de novo assembly Cuffcompare: – Transcripts comparison (de novo/genome annotation) Cuffdiff: – Differential expression analysis

Cufflinks: Transcript assembly Assembles individual transcripts based on aligned reads Infers likely spliceforms of each gene Builds ‘transfrags’ The smallest number of spliceforms that can be explained by the data NOTE: assembly errors do occur -> sequencing depth helps

Cufflinks: Transcript assembly (2) Quantifies expression level of each transfrag Filters out those likely to be premature terminations, non-mature mRNAs, etc

Cuffmerge Merges transfrags into transcripts where appropriate Also performs a reference based assembly of transcripts using known transcripts Produces single annotation file which aids downstream analysis

Cuffdiff: Differential expression Calculates expression level in two or more samples Expression level relates to read abundance Because of bias sources, cuffdiff tries to model the variance in its significance calculation What else is important?

FPKM (RPKM): Expression Values  Fragments Reads Per Kilobase of exon model per Million mapped fragments  Nat Methods. 2008, Mapping and quantifying mammalian transcriptomes by RNA-Seq. Mortazavi A et al. C= the number of reads mapped onto the gene's exons N= total number of reads in the experiment L= the sum of the exons in base pairs.

Cufflinks (Expression analysis) gene_id bundle_id chr left right FPKM FPKM_conf_lo FPKM_conf_hi status ENSG chr OK ENSG chr OK ENSG chr OK ENSG chr OK ENSG chr OK ENSG chr OK ENSG chr OK ENSG chr OK ENSG chr OK ENSG chr OK ENSG chr OK ENSG chr OK

Cuffdiff (differential expression) Pairwise or time series comparison Normal distribution of read counts Fisher’s test test_idgenelocussample_1sample_2statusvalue_1value_2ln(fold_change)test_statp_valuesignificant ENSG TSPAN6chrX: q1q2NOTEST00001no ENSG TNMDchrX: q1q2NOTEST00001no ENSG DPM1chr20: q1q2NOTEST no ENSG SCYL3chr1: q1q2OK yes

Visualization: Genome Viewers Web based: – UCSC Genome Browser ( Standalone – Integrated Genome Viewer (

RNAseq hands-on practical (Galaxy) Data QC and trimming Aligning reads to reference genome Running CUFFLINKS and looking at some transcripts using the UCSC genome browser Finding differentially expressed genes with CUFFDIFF