Download presentation
1
NESCENT : NGS : Measuring expression
Jen Taylor Bioinformatics Team CSIRO Plant Industry
2
Measuring Expression What & Why How In action
What is expression and why do we care? How Platforms / Technology Closed approaches – Microarray Open approaches - Sequencing Experimental Design Analysis Biases Bioinformatics Statistical Issues and Analysis In action Workshop – Detection of Differential Expression Case Studies in Plant functional genomics CSIRO. Nescent August Measuring Expression
3
What is expression / transcriptome ?
mRNA rRNA tRNA siRNA microRNA piRNA tasiRNA lncRNA DNA CSIRO. Nescent August Measuring Expression
4
Gonville & Caius College, Cambridge, UK.
Beyond the Genome: 1995 Human Genome sequencing begins in earnest “Mapping the Book of Life” First Draft Essential Completion = approx 140, 000 genes = 30, 000 – 40,000 genes ?? = 24, 195 genes !!!??? Commemorative stained glass window for F.C. Crick, designed by Maria McClafferty.(Photograph: Paul Forster) Gonville & Caius College, Cambridge, UK. CSIRO. Nescent August Measuring Expression
5
“The failure of the human genome”
“despite more than 700 genome-scanning publications and nearly $100bn spent, geneticists still had not found more than a fractional genetic basis for human disease “ Manolio et al., Nature, 2009 “The most likely explanation for why genes for common diseases have not been found is that, with few exceptions, they do not exist. …., if inherited genes are not to blame for our commonest illnesses, can we find out what is? “ Guardian, 2011 CSIRO. Nescent August Measuring Expression
6
Gene Number ≠ Complexity
Beyond the Genome: Gene Number ≠ Complexity Transcriptome Complexity Regulation Gene Commemorative stained glass window for F.C. Crick, designed by Maria McClafferty.(Photograph: Paul Forster) Gonville & Caius College, Cambridge, UK. CSIRO. Nescent August Measuring Expression
7
Why the expression ? High-throughput friendly Genome Predicts Biology
** Regulatory network Transcriptome Context dependent Proteome **Li et al., 2004 CSIRO. Nescent August Measuring Expression
8
Measuring Expression ? Comparisons Parts Description
Population - level Between genomes Parts Description Function? Interconnectedness? CSIRO. Nescent August Measuring Expression
9
Measuring Expression ? What are important members of a transcriptome?
mRNA polyadenylated, coding alternatively spliced Noncoding RNA (small RNA) varying lengths, functions (18 – 32 bases) microRNA, siRNA, piRNA, tasiRNA, long non-coding RNA “Dark” RNA transcription outside of annotated genes Non-polyadenylated Anti-sense transcription CSIRO. Nescent August Measuring Expression
10
Measuring Expression ? How does the transcriptome vary to give rise to phenotype ? Changes in Abundance Abundance = Rate of Transcription – Rate of Decay Changes in Function Availability for function – polyadenylation, silencing, localisation Suitability for function – alternate splicing CSIRO. Nescent August Measuring Expression
11
How to measure Expression
PLATFORMS / TECHNOLOGY CSIRO. Nescent August Measuring Expression
12
Measuring Expression : platforms
Closed systems – microarray Probes immobilised on a substrate profile target species in the transcriptome CSIRO. Nescent August Measuring Expression
13
CSIRO. Nescent August 2011 - Measuring Expression
14
Single and two colour arrays
Labelling Two colour Control Experimental Labelling Single colour Sample A Hybridisation Probe Library Array Manufacture Array Scanning CSIRO. Nescent August Measuring Expression
15
Array profiling Affymetrix Array Targets Arabidopsis Genome 24,000
C. elegans Genome ,500 Drosophila Genome , 500 E. coli Genome , 366 Human Genome U133 Plus 47,000 Mouse Genome 39, 000 Yeast Genome S.cerevisiae 5, 841 S. pombe 5, 031 Rat Genome 30, 000 Zebrafish 14, 900 Plasmodium / Anopheles P. faciparum 4,300 A. gambiae 14,900 Barley (25,500), Soybean (37, ,300 pathogen), Grape (15,700) Canine (21,700), Bovine (23,000) B.subtilis (5,000), S. aureus (3,300 ORFS), Xenopus (14, 400) CSIRO. Nescent August Measuring Expression
16
CSIRO. Nescent August 2011 - Measuring Expression
17
CSIRO. Nescent August 2011 - Measuring Expression
18
Closed System – Microarray
Pros High-throughput Targeted profiling Inexpensive – “population friendly” Analytical methods are standardised Negative “Closed system” , novel = invisible Difficult to see allelle-specific expression Biases due to hybridisation SNPs Competitive and non-specific hybridisation CSIRO. Nescent August Measuring Expression
19
Open systems – RNA Sequencing
Technology: Illumina SOLiD, IonTorrent 454 Pros: Transcript discovery Allelic expression High resolution abundance measures Cons: Analysis can be complex Expensive Sensitivity is sequencing depth dependent CSIRO. Nescent August Measuring Expression
20
RNA Sequencing Mortazavi et al., 2008
CSIRO. Nescent August Measuring Expression
21
RNASeq - Correspondence
Range > 5 orders of magnitude Better detection of low abundance transcripts Marioni et al., 2009 CSIRO. Nescent August Measuring Expression
22
Platform Choice / Sample Preparation Choice
What do you want to profile ? Polyadenylated PolyA RNA extraction Small RNA (< 100 bases) Size filtering by gel Strand-specific RNA – Protein Interactions RNA Immunoprecipitation (IP) CSIRO. Nescent August Measuring Expression
23
Differential Expression
RNASeq - Workflow Sample Total RNA PolyA RNA Small RNA Mapping to Genome Differential Expression SNP detection Transcript structure Secondary structure Targets or Products Library Construction Assembly to Contigs Sequencing Base calling & QC CSIRO. Nescent August Measuring Expression
24
Illumina RNASeq : TruSeq
CSIRO. Nescent August Measuring Expression
25
Small RNA sequencing Small RNA small RNA < 35bp 134
smallRNA separation: PAGE 134 110 75 25 small RNA < 35bp Control of: Adaptor removal (contaminants) Theoretical distributions (GC contents) Sequencing artefacts Low quality and low complexity sequence removal Base call reliability (PHRED score) Collapsing sequence redundancy Resolved issues: Adaptor version Illumina (one codon difference 3nt) ATCTCGTATGCCGTCTTCTGCTTG v1.5 TCGTATGCCGTCTTCTGCTTG SRA CSIRO. Nescent August Measuring Expression
26
Strand - specificity Using adaptors Using chemical modification
Ligation : 3’ and 5’ adaptors added sequentially dUTP : Addition and removal after selection SMART : addition of C’s on 5’ end Levin et al., 2010 CSIRO. Nescent August Measuring Expression
27
Levin et al., 2010 CSIRO. Nescent August Measuring Expression
28
Non-polyA methods Total RNA extraction
Ribosomal RNA and tRNA > 95-97% of total RNA Ribosomal reduction methods Subtractive hybridisation with rRNA probes Exonuclease cleave of rRNA NuGen – “proprietary combination of reverse transcriptase and primers in the Ovation RNA-Seq System” cDNA normalisation methods Partial digestion of any highly abundant species (Evrogen) CSIRO. Nescent August Measuring Expression
29
Platform Choice / Sample Preparation Choice
What do you want to profile ? Polyadenylated PolyA RNA extraction Small RNA (< 100 bases) Size filtering by gel Strand-specific RNA – Protein Interactions RNA Immunoprecipitation (IP) Non - PolyA rRNA reduction CSIRO. Nescent August Measuring Expression
30
EXPERIMENTAL DESIGN and ANALYSIS
CSIRO. Nescent August Measuring Expression
31
RNASeq Experimental Design
Issues: sequencing depth - how much ? number of replicates – how many ? Aims of the data : Transcriptome assembly / transcript characterisation Maximise depth Detection of differential expression (denovo or reference) Balance depth and replication CSIRO. Sequencing Depth V.S. Number of Replicates
32
Defining Replicates Technical Replicates Biological Replicates
Lane 1 Library 4 Multiplex Library 3 Library 2 Library 1 L1 L2 L3 L4 25% lane / sample Technical Replicates Biological Replicates Individual Individual 1 Individual 2 , Library 1 Library 2 Library 1 Library 2 Lane 1 Lane 2 Lane 3 Lane 4 Lane 1 Lane 2 Depth = 2 x 100% lane / sample 100% lane / sample CSIRO. Sequencing Depth V.S. Number of Replicates
33
CSIRO. Sequencing Depth V.S. Number of Replicates
34
Coverage Depth CSIRO. Sequencing Depth V.S. Number of Replicates
35
Number of Replicates # Reps 2 4 6 8 10 12 False P 0.03 False N 0.84
0.72 0.64 0.59 0.54 0.50 True P 0.16 0.28 0.36 0.41 0.46 True N 0.97 edgeR <= 0.01 , DESeq <= 0.01 More information in biological replicates than depth For differential expression CSIRO. Sequencing Depth V.S. Number of Replicates
36
RNASeq Analysis Overall Aim : Biases and Compositions Alignment
To get an accurate measurement of transcript abundance, structure and identity Biases and Compositions Alignment TopHat / Cufflinks Assembly ABySS CSIRO. Nescent August Measuring Expression
37
Assumptions Every transcript / k-mer has equal chance of being sequenced No. sequences observed ≈ transcript abundance Gene A = z Reads / million Gene B = y Reads / million z = 2 x y Gene A > Gene B CSIRO. Nescent August Measuring Expression
38
Length Bias Oshlack and Wakefield, 2009
CSIRO. Nescent August Measuring Expression
39
Alignment Bias CSIRO. Nescent August Measuring Expression
40
Alignment Bias CSIRO. Nescent August Measuring Expression
41
Sequencing Bias Hansen et al., 2010
CSIRO. Nescent August Measuring Expression
42
Bias Every transcript / k-mer has equal chance of being sequenced
No. sequences observed ≈ transcript abundance Gene A = z Reads / million / kb Gene B = y Reads / million / kb Weighting schemas (e.g. Cufflinks) : Mapability kmer / fragment frequencies CSIRO. Nescent August Measuring Expression
43
Bias Every transcript / k-mer has equal chance of being sequenced
No. sequences observed ≈ transcript abundance Sample A vs Sample B Gene A1 = z Reads per million Gene A2 = y Reads per million z = 2 x y CSIRO. Nescent August Measuring Expression
44
Read density variability
CSIRO. Nescent August Measuring Expression
45
RNASeq – Compositional properties
Depth of Sequence Sequence count ≈ Transcript Abundance Majority of the data can be dominated by a small number of highly abundant transcripts Ability to observe transcripts of smaller abundance is dependent upon sequence depth Fixed budget of reads CSIRO. Nescent August Measuring Expression
46
A simple example – compositional bias
Sequencing budget / depth: 4000 reads A D C B sample I Expected counts 1000 2000 Expected counts sample II A B CSIRO. Nescent August Measuring Expression
47
Soil diversity by phylogenetic analysis - Phylum level
454-sequence analysis of bacterial 16S rRNA gene ~410,000 sequences Recognized bacterial phyla A B C 0% 20% 40% 60% 80% 100% % distribution A. Richardson, CSIRO CSIRO. Nescent August Measuring Expression
48
RNASeq Bioinformatics Analysis
Aims: To get an accurate measurement of transcript abundance, structure and identity Biases and Compositions Relative abundances NOT absolute Alignment TopHat Assembly ABySS CSIRO. Nescent August Measuring Expression
49
RNA Sequencing analysis
Sequence Data Genome? Assembly Alignment Contigs Read Density Differential Expression SNPs Transcript Characterisation CSIRO. Nescent August Measuring Expression
50
RNASeq – Alignment Considerations
Reads with multiple locations Discard / Random Allocation Clustering - local coverage Weighting Reads Spanning Exons Make and align to exon junction libraries Denovo junction detection Summarisation of counts Exons Transcript boundaries Inferred read boundaries CSIRO. Nescent August Measuring Expression
51
TopHat Multimapping : ≤10 sites Assembly : consensus ‘island’ exon
Trapnell et al., 2009; Roberts et al., 2011 CSIRO. Nescent August Measuring Expression
52
TopHat / Cufflinks Heuristics : “Correct” errors in low coverage areas
Grabs 45 bp either side of islands to capture splice sites Collapse small islands Looks for junctions within larger islands, highly covered Cufflinks : calculates the probability of observing a certain fragment within a given transcript given surrounding fragments. Trapnell et al., 2009; Roberts et al., 2011 CSIRO. Nescent August Measuring Expression
53
Alignment Great if you have a fully annotated, reference
Okay.. If you have a partially annotated reference “Different” if you have a big bunch of ESTs Options: Align to a neighbouring genome or EST library Denovo transcriptome assembly Tools: ABySS, Mira, Trinity, HT-Seq, SAMtools CSIRO. Nescent August Measuring Expression
54
RNA Sequencing analysis
Sequence Data Genome? Assembly Alignment Contigs Read Density Differential Expression SNPs Transcript Characterisation CSIRO. Nescent August Measuring Expression
55
Denovo transcriptome assembly
ABySS MIRA Trinity Velvet AllPaths Soap-denovo Euler CABOG Edena SHARCGS VCAKE SSAKE CAP3 Will run on reasonable computer resources for large genomes (e.g. < 1 TB of RAM) Paired end data handling Platform flexible Handles haplotype complexity and polyploid genomes CSIRO. Nescent August Measuring Expression
56
Denovo transcriptome assembly
ABySS MIRA Trinity Velvet AllPaths Soap-denovo Euler CABOG Edena SHARCGS VCAKE SSAKE CAP3 Will run on reasonable computer resources for large genomes (e.g. < 1 TB of RAM) Handles paired end data Handles data from all platforms Handles haplotype complexity and polyploid genomes CSIRO. Nescent August Measuring Expression
57
Assembly – Kmer graphs K = 4 Miller et al., 2010
CSIRO. Nescent August Measuring Expression
58
Assembly – Kmer graphs Spurs Sequencing error Bubbles Sequencing error
Polymorphism Frayed Rope / Cycles Repeats Miller et al., 2010 CSIRO. Nescent August Measuring Expression
59
Assembly – Kmer graphs Spurs Sequencing error Bubbles Sequencing error
Polymorphism Frayed Rope / Cycles Repeats Miller et al., 2010 CSIRO. Nescent August Measuring Expression
60
ABySS & TransABySS User specifies k
Optimal k depends on sequencing depth CSIRO. Nescent August Measuring Expression
61
ABySS & TransABySS Sequencing depth is relative to transcript abundance Iterate over multiple k and merge Contigs contained within a large contig are “buried” CSIRO. Nescent August Measuring Expression
62
Assessing assembly quality ?
Comparisons between assembly algorithms Contig summary statistics Comparisons to known resources (e.g. ESTs) Trial on Rice Transcriptome: 120 Million 75 bp single end Illumina reads – embryo ABySS : Number of contigs = 6, 804 Contig length range = 38 – 2,818 [mean = 203] Database comparisons : Rice public cDNA sequences : 67, 393 Contigs with high quality matches to cDNA : 6,555 (96%) CSIRO. Nescent August Measuring Expression
63
RNASeq Bioinformatics Analysis
Aims: To get an accurate measurement of transcript abundance, structure and identity Biases and Compositions Relative abundances NOT absolute Alignment Assembly CSIRO. Nescent August Measuring Expression
64
STATISTICAL ISSUES CSIRO. Nescent August Measuring Expression
65
Measuring Expression – Statistical Issues
Data elements Normalisation Detection of Differential Expression CSIRO. Nescent August Measuring Expression
66
Count Data : of what ? CSIRO. Nescent August Measuring Expression
67
Count Data : of what ? Garber et al., 2011
CSIRO. Nescent August Measuring Expression Garber et al., 2011
68
Statistical analysis of RNASeq
Count data Distribution is positively skewed, not normal Between sample variability in counts - normalisation CSIRO. Nescent August Measuring Expression
69
Normalization is required
Two scenarios : 1. Different sizes of total reads (library size) 2. Fixed library size, subset of highly expressed reads in 1 sample. Both reduce sequencing budget available for the majority of transcripts CSIRO. Nescent August Measuring Expression
70
Normalisation Assume the majority of log ratios = 0 [No change]
TMM : Trimmed Mean of M values (log ratios) Adjust TMM to be equal between samples CSIRO. Nescent August Measuring Expression Robinson and Oshlack, 2010
71
DE genes with and without TMM normalization
CSIRO. Nescent August Measuring Expression
72
RNASeq data – Poisson Distributions
Poisson distributions are used when things are counted The probability of seeing n events in a fixed time or space The number of lions on a 1 day safari The number of raindrops on a tennis court The number of flying elephants in a year Requires λ : rate of events Variance = mean = λ CSIRO. Nescent August Measuring Expression
73
RNASeq data – Negative Binomial
RNASeq data is more variable than Poisson Variance > mean = λ Less prominent for large mean Over-dispersed Poisson Noise types Shot noise Unavoidable, prominent for low mean Technical noise Small, hopefully, can be managed Biological noise Sample differences CSIRO. Nescent August Measuring Expression
74
RNA Seq Variance also depends on the mean Anders, 2010
CSIRO. Nescent August Measuring Expression
75
Library normalisation
RNASeq Model The total counts for a transcript in sample j from condition c : Library normalisation Mean Value Fitted Variance (overdispersion) For a given gene , test for a difference in counts between conditions. Is mean c1 + mean c2 statistically different to mean c1 + mean c1? CSIRO. Nescent August Measuring Expression
76
RNASeq DE Testing DESeq – Anders and Huber, 2010
EdgeR – Robinson et al., 2009 – R BaySeq – Hardcastle and Kelley, 2010 – R DEGSeq – Wang et al., 2010 – R NBP - Di et al., 2011 LOX – Zhang et al., 2010 Infers expression measures allowing for incorporation of noise from different methodologies in the one experimental design CSIRO. Nescent August Measuring Expression
77
Measuring Expression What & Why How In action
What is expression and why do we care? How Platforms / Technology Closed approaches – Microarray Open approaches - Sequencing Experimental Design Analysis Biases Bioinformatics Statistical Issues and Analysis In action Workshop – Detection of Differential Expression Case Studies in Plant functional genomics CSIRO. Nescent August Measuring Expression
78
Thank you Acknowledgements Plant Industry Jennifer M Taylor
Contact Us Phone: or Web: Plant Industry Jennifer M Taylor Bionformatics Leader Phone: Acknowledgements Jose Robles Stuart Stephen Hua Ying Andrew Spriggs Alexie Pa NESCENT Funding Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.