Download presentation
1
Lecture 12 RNA – seq analysis
2
Some background RNA-seq (RNA sequencing), also called whole transcriptome shotgun sequencing(WTSS), is a technology that uses the capabilities of next generation sequencingto reveal a snapshot of presence and quantity of RNA at a given moment in time. RNA sequencing is a high-throughput tool for investigating gene expression, made possible with rapid advances in the speed and efficiency of sequencing technologies. Unlike microarrays, RNA-seq benefits from a highly dynamic range of signal detection, identifying both rare and common transcripts with no a priori knowledge of the organism’s genome or transcriptome. The additional information captured in RNA-seq libraries has revolutionized our understanding of cancer, stem cell differentiation, and plant genetics.”
3
Next Generation Sequencing
Next-generation sequencing (NGS), also known as high-throughput sequencing, is the catch-all term used to describe a number of different modern sequencing technologies including: Illumina (Solexa) sequencing. Roche 454 sequencing. Ion torrent: Proton / PGM sequencing. SOLiD sequencing. REALLY revolutionized genome sequencing as many many can be done in smaller amounts of time.
4
More Background An RNA-seq run reads and quantifies the transcriptome (complete set of mRNA) in a single sequencing run. RNA is extracted from tissue, cleaved into fragments a few hundred nucleotides long, and then converted to a complementary DNA (cDNA) library (Wilhelm & Landry, 2009). Sequencing adaptors are ligated to both ends of each fragment, and the products are sequenced using any high-throughput method such as 454, SOLiD, or Ion Torrent.
5
Comparison with Microarrays:advantages
New sequences can be discovered. RNA-seq, on the other hand, determines all sequences empirically. This has proved invaluable in non-model species with large genomes, False positives from cross-hybridization are not an issue in RNA-seq. Quantification is possible even at extremely low and high expression levels. Whereas microarrays have a dynamic range of one to a few hundred fold, RNA-seq boasts a dynamic range of >8,000 fold (Wang, Gerstein, & Snyder, 2009).
6
Comparison with Microarrays: disadvantages
Considerably more processing power is required to handle millions of RNA-seq reads, and chemical manipulation of RNA and cDNA can introduce artifacts. Slower than microarrays when the genome is known. But as sequencing costs have plummeted and computing power has increased, RNA-seq is now the transcriptomics method of choice for most applications.
7
Pictures
8
Data structure So here you have a sequence and for each sequence you have the number of READS The data is the COUNT of the sequences read. NOT continuous like expression data So, normal and other related distributions cannot be used. General modeling is done using the Poisson distribution
9
Poisson Distribution Generally used to model count data
The mass function is given by P(Y=y)=f(y)= Properties: Has a range from 0 to positive infinity Mean, E(Y)= m Variance, = m Hence, mean and Variance are same.
10
Issues with Poisson The property that requires that mean and Variance are the same is problematic for RNA-seq data, where Variance is often much larger than the mean. This is called the over-dispersion problem. Common in litter studies where over-dispersion is induced by auto-correlation.
11
Solutions: The NB Distribution
To try and address this question ne distribution that is used is the Negative Binomial Distribution. It is used to model the number of trials till the rth success and is related to the geometric distribution. Model: P(Y=y)=f(y) =
12
Properties of the NB distribution
So, the mean and variance are related by a proportionality constant
13
Theoretical Background
To model over-dispersion in Poisson regression one generally adds a random effect qi to represent the unobserved heterogenity. So the conditional distribution of Yi given qi is indeed Poisson with mean and variance miqi. Idea is: if we knew and observed qi the data would be Poisson. But, we don’t know it, so if we assume a assume that qi has a gamma distribution with both parameters a=b=1/s2 which represents the variance of the unobserved. Then the unconditional distribution is given by:
14
Theory: The form is a NB distribution with r=a, p= b/(m+b)
The mean and variance are related with a proportionality constant. This is the form used in the Anders and Huber paper laying the basic theory for D-seq.
15
DE Seq Theory The library DESeq2 uses Empirical Bayesian ideas for Differential Expression for looking at differences in the genes across conditions. The idea, let Kij be the count associated with the ith gene and the jth sample The assumption is: Kij ~ NB(mij, ai) Where mij=sjqij And log2(qij)=xjbi Here xj is the sample specific design and beta is our gene specific parameters.
16
DE Seq2 package: contrasts
Contrasts can be calculated for a DESeqDataSet object for which the GLM coecients have already been fit using the Wald test steps (DESeq with test="Wald" or using nbinomWaldTest). The vector of coefficients is left multiplied by the contrast vector c to form the numerator of the test statistic. The denominator is formed by multiplying the covariance matrix for the coefficients on either side by the contrast vector c. The square root of this product is an estimate of the standard error for the contrast. The contrast statistic is then compared to a normal distribution as are the Wald statistics for the DESeq2 package.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.