Transcriptome Analysis by High-Throughput Sequencing (RNA-Seq) Mark Reimers Virginia Institute for Psychiatric and Behavioral Genetics.

Transcriptome Analysis by High-Throughput Sequencing (RNA-Seq) Mark Reimers Virginia Institute for Psychiatric and Behavioral Genetics

Outline Technology What RNA-Seq data look like Normalization: ensuring comparability Testing for differential expression Issues specific to brain genomics

Illumina (Solexa) Genome Analyzer and Flow Cell

Illumina (Solexa) Sequencing

Advantages & Drawbacks PRO –Very high throughput –Most widespread technology so that comparisons seem easier CON –Sequencing representation biases, especially at beginning –Slow – up to a week for a run

SOLiD Sequencing by Oligonucleotide Ligation and Detection

SOLiD Color Coding Scheme Blue is color of homopolymer runs If you translate color reads directly into base reads then every sequence with an error in the color calls will result in a frame-shift of the base calls. it is best to convert the reference sequence into color-space. There is one unambiguous conversion of a base reference sequence into color-space, but there are four possible conversions of a color string into base strings

Advantages & Drawbacks PRO –Very high throughput –Di-base ligation ensures built-in accuracy check Low error rate for low-coverage –Can handle repetitive regions easily CON –Strong cycle-dependent biases (can be modeled and partly overcome – see Wu et al, Nature Methods, 2011) –Low quality color calls (Phred < 20) are common –Reported problems with paired ends – most mapped tags don’t map to the same chromosome

RNA-Seq Data

Raw and Aligned Reads Raw data is a (large) set of sequences Typical file format is FASTQ @HWI-EAS255_4_FC2010Y_1_43_110_790 TTAATCTACAGAATAGATAGCTAGCATATATTT + hhhhhhhhhhhhhhhdhhhhhhhhhhhdRehdh Alignment to genome is done by efficient indexing of seed sequences Aligned reads in SAM format @HWI-… 163 chr19 9900 10000 16M2I25M Base quality codes Read identifier Bases called Start and end positions Codes for match: 16 matches, 2 extra,… Where this read matched Read identifier

RNA-Seq data is often represented by ‘pile-up’ diagrams From: The ENCODE Project Consortium (2011) A User's Guide to the Encyclopedia of DNA Elements (ENCODE). PLoS Biol 9(4): e1001046.

Basic Statistics Distribution of counts in one sample typically follows a power law Variation in read counts from replicate samples should follow Poisson distribution This is almost true if replicates are done by same lab in same batch on same machine from same library Biologically distinct samples are ‘over- dispersed’

Distribution of counts within one sample often follows a power law If we plot the counts against the number of genes with those counts on log-log plot, we see something like a straight line (with some bump at 0)

Distribution of Reads Within Genes Most reads fall into coding exons, or UTR UTR coverage is ragged… often several initiation or termination sites Sometimes higher-than-background read depth from a specific intron

Inter-Genic Reads Many reads reflect unannotated genes SEQC detects about 100,000 transcriptionally active regions (TARs) (Lindblad-Toh et al, Nature, 2011) found 2,000 unannotated exons through conservation and public RNA-seq data

Summarization of RNA Reads Count how many reads per gene (or exon) Several software packages including Bioconductor provide facilities to count how many reads map to within each annotated gene (or exon) Need to choose gene models (NCBI, GENCODE or ENSEMBL)

Summarized Data Example From Recount DB: data from (Marioni et al, Gen. Res., 2008)

Normalization of RNA-Seq Data

RPKM – A Simple Normalization Different numbers of counts per sample (sequencing depth) Divide counts in a region of interest (a genomic region or a gene or an exon) by all counts (reads per million reads -RPM) Genes have different lengths: divide also by length of gene Obtain RPKM (reads per kilobase of exon per million reads) –Some use FPKM (fragments/kb/Mr)

Biases in RPKM Normalization Robinson et al noticed that most genes appeared less expressed in some liver samples in a landmark study Fig 1 from Robinson & Oshlak, Genome Biology 2010 Their TMM procedure centers trimmed log-ratios between samples

Alternate Normalizations In most samples, roughly half of genes are expressed –Reads mapped to other genes are ‘noise’ We could standardize median of expressed genes: 75 th percentile of all genes Many factors contribute to technical differences; we could sweep them all under the rug and force distributions to be the same by Quantile normalization Critique: we give up linearity

Ratio-Average Plots Show Non- Linear Effects in Gene Counts These may reflect differences in the nonlinearities in the amplification step Sequencing depth affects apparent expression levels (RPKM) in a non- linear fashion Sample 5 from Bottomly Data plotted against average of other samples (square root scale)

GC Content May Affect Read Counts From Hansen et al 2011 log RPKM White: sample 1 Grey: sample 2 Apparent differential expression Log ratio (sample1/sample2) Boxplot of abundances

Approaches to Compensating for technical covariates Regress RPKM values on technical covariates – CG content, fragment length, average intensity –Typically 10%-40% of variance can be so explained –Result – an adjusted RPKM Estimate bias factor for count data Use as offset for counts in GLM

New Analyses Possible with RNA-Seq

New Kinds of Analysis with RNA-Seq Allele-specific expression Alternate initiation sites –Select 5’ capped RNA fragments Alternate termination protocols –Select 3’ poly-A tails Splice variation –Between tissues –In disease

Allelic Expression Differences It is possible to compare allele-specific expression counts Replicate samples P-values for binomial tests of equality About half show evidence of differential expression!

Mapping Start Sites Select capped mRNA Sequence selectively Identify proportions of initiation sites Issue - bias

Mapping Termination From Mangone et al, Nature, 2011

Detecting Splice Variation Deep sequencing shows clear variation in exon usage Wang et al Nature 2008

Tissue Map of Splice Variation Brain is most distinctive Individuals seem to differ Cell lines seem to have distinct splice patterns From Wang et al

Issues in Mapping Spliced Reads Not all splice forms are known The most common alternative is splicing into or from within an exon Not all exons are known (2,000 just discovered by conservation) Until recently reads have been too short to uniquely identify both exons, especially if the junction is near one end of the read

GSNAP Algorithm Wu T D, Nacu S Bioinformatics 2010;26:873-881 © The Author(s) 2010. Published by Oxford University Press.

Measuring Alternate Splicing Could count most direct measures: exon- specific reads ~ 80% of reads –Exons are different lengths –Genes are expressed at different levels Could try to estimate isoform abundance Pavel Mazin has developed a binomial test for exon inclusion

Exon Inclusion Ratios Definition: fraction of transcripts including a particular exon Reasonable approximation: –RPKM exon / RPKM gene –Some ratios bigger than 1 Counts for exons are lower than for genes In principle GLM more appropriate Biological differences confounded with technical artifacts

Issues in Detecting Splice Variants Counts in exons reflect technical biases (as yet uncharacterized) as well as actual abundance Reads that bridge splice junctions would be definitive but only a small proportion of reads cross junctions –Mapping is difficult All possible splice junctions are not known –Hard to even search through the known ones

RNA-Seq Significance Testing

Approaches to Significance Continuous (easy) –At current read depths (>50M) most genes of interest are well above the threshold for continuity approximation Discrete (hard) –All data are counts, and many are quite low, well below the acceptable n > 5 for continuous approximation –Cost-effective studies will use multiplexing and so counts will remain low

Issues with Continuous Approximation Data are NOT anywhere near Gaussian Discrete counts under five may be poorly approximated by continuous distributions Select only those with mean at least five Ad-hoc fix: Winsorize data and do t-tests Typically there are excess zeroes resulting in extreme values

Models for Count Data Poisson model –Standard model for count data Negative Binomial Model –Higher variance than Poisson Zero-inflated (mixture) model –Allows excess 0 counts beyond either above

Poisson Model Describes counts of independent events where each has a small probability of occurring, such as reads aligning to one gene Technical replicates of RNA-Seq under identical conditions follow close to the Poisson law Poisson distributions with various means

Variance Proportional to Mean Early RNA-seq studies found count variance across samples was proportional to count mean Among standard parametric distributions for counts, the negative binomial with fixed dispersion parameter has this property

Negative Binomial Distribution Generalization of geometric distribution Repeat Bernoulli(p) trials Count number of non-selected outcomes until r selected outcomes Negative Binomial distributions for p = 0.2, and various r Note r = 0.5 defined by analogy

Alternate Parameterization by Mean and Over-Dispersion Negative Binomial may also be parameterized by mean and variance:  = pr/(1-p)  2 =pr/(1-p) 2 Over-dispersion parameter  :  2 =  +  2  = 1/r; p =  If  = 0, Poisson Negative Binomial distributions for  = 10, and various 

Using the Negative Binomial Model to Test for Differential Expression Assume dispersion parameters are identical between samples Test for difference of means using Likelihood Ratio Test –2 log( P(x 1 |  1  ) P(x 2 |  2  ) / P((x 1, x 2 ) |  ) ) ~  2 Can also use t-test if estimate covariance matrix for parameters Issue: library sizes differ

Issues with Negative Binomial A Negative Binomial variable with  > 10 cannot have many zeros In some large studies excess zeros are common Hence negative binomial model does not appear to fit data very well Genes

Issues in RNA-Seq for Brain Genomics

Tissue Admixture Brain tissue is made of four major cell types intertwined: neurons (10%-20%) and glia: astrocytes (50%), oligodendrocytes (20%), microglia (macrophages; 10%) Cerebellum has a lower proportion of glia Interneurons have very distinct profiles Most differences between cerebellum and cortex reflect differences in cell-type composition more than differences in neural gene expression profiles

How to Distinguish Differences within Neurons from Differences in Proportions of Neurons In principle a factor-analysis approach should be able to recover profiles of individual cell types No-one has made this work so far for brain Some fairly specific markers are known for oligodendrocytes and microglia; astrocytes are harder to tell apart from neurons by gene expression profile.

Transcriptome Analysis by High-Throughput Sequencing (RNA-Seq) Mark Reimers Virginia Institute for Psychiatric and Behavioral Genetics.

Similar presentations

Presentation on theme: "Transcriptome Analysis by High-Throughput Sequencing (RNA-Seq) Mark Reimers Virginia Institute for Psychiatric and Behavioral Genetics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Transcriptome Analysis by High-Throughput Sequencing (RNA-Seq) Mark Reimers Virginia Institute for Psychiatric and Behavioral Genetics.

Similar presentations

Presentation on theme: "Transcriptome Analysis by High-Throughput Sequencing (RNA-Seq) Mark Reimers Virginia Institute for Psychiatric and Behavioral Genetics."— Presentation transcript:

Similar presentations

About project

Feedback