Download presentation
1
DEG Mi-kyoung Seo
2
RNA-seq for DEG Sequencing FASTQ Data quality control
FastQC / FASTX-Toolkit Mapping TopHat2 HTSeq Transcripts assembly Cufflinks Final transcripts assembly Differential expression analysis Cuffdiff (or R) DESeq, EdgeR.. Visualization CummeRbund
3
RNA-Seq versus microarrays
A. Comparison of the number of expressed genes detected by RNA-Seq and microarrays Fig. 2. RNA-Seq versus microarrays. Evaluation of the sensitivity of RNA-Seq over microarrays on the same RNA source and based on 13,118 genes represented on the array. (A) Comparison of the number of expressed genes detected by RNA-Seq and microarrays. Values for relaxed (at least one read) and stringent (at least five reads) RNA-Seq parameters are in bold or in brackets, respectively. (B) Distribution of the RNA-Seq NEs and the proportion of genes detected on microarrays. Genes missed by microarrays are shown with gray (HEK) and black (B cells) bars. Genes detected by microarrays are shown with light red (HEK) and dark red (B cells) bars. B. Distribution of the RNA-Seq NEs and the proportion of genes detected on microarrays. Genes missed by microarrays are shown with gray (HEK) and black (B cells) bars Sultan M et al. 2008
4
RNA-seq vs. microarray From Sonia Tarazona
5
Definition From Sonia Tarazona
6
Source of variable Between-lane normalization
Library size (sequencing depth or library size) Within-lane normalization Gene-specific biases: length, GC-content Mappability of reads Differences on the counts distribution among samples. Count data with RNA-seq biases Normalization DEG
7
DEG Differentially expressed gene
A gene is declared differentially expressed if an observed difference or change in read counts between two experimental conditions is statistically significant. Statistical framework for RNA-seq
8
Variance depends strongly on the mean
Distribution Technical replicate Poisson Biological replicate Negative binomial Poisson v = μ Poisson + constant CV v = μ + α μ2 (edgeR) Poisson + local regression v = μ + f(μ2) (DESeq) Poisson distribution Negative binomial distribution
9
RNA-seq within a library (sample) Lg2=3 Lg1=6 Yg1=6 Yg2=3
Expressiong1=1 Expressiong2=1 Read count ∝ Expression of a given gene ∝ Transcript length
10
RNA-seq within different libraries (comparison of two samples)
For gene 1, Lg1=6 Yl1=6 Yl2=12 Ll1=600 Ll2=1200 Expressiong1l1=1 Expressiong1l2=1 Read count ∝ Expression of a given gene ∝ Transcript length ∝ Library size
11
RPKM Reads Per Kilobase per Million mapped reads
FPKM, Fragments per kilobase per million fragments reads, which is suitable for paired-end reads (Garber et al. 2011) The number of reads of the region RPKM = Length of region/103 x Total number of mapped read/109 109 x C RPKM (X) = N x L C is the number of mappable reads on feature (transcript, exons..) N is the total number of mappable reads in the experiment (in millions) L is the sum of the exons (in kb) Mortazavi et al (2008) Nature Methods
12
RPKM’s drawback The fact that a small number of highly expressed genes can generate a big portion of the total reads (Bullard, et al., 2010) complicates normalization. Even after normalization based on length (e.g., RPKM), longer transcripts or genes are still more prone to be called as differentially expressed than shorter ones using t-test (Oshlack and Wakefield, 2009).
13
Gene length bias sequencing array
Differential expression as a function of transcript length. 33% of highest expressed genes 33% of lowest expressed genes Oshlack and Wakefield (2009) Biology Direct.
14
Gene length bias Let X be the measured number of reads in a library mapping to a specific transcript. m = E(X) = cNL N : the total number of transcripts L: the length of the gene C: proportionality constant Var(X) = m = cNL Poisson random variable DEG between two samples of the same library size test if the difference in counts from a particular gene between two samples of the same library size is significantly different from zero using a t-test E(D)/S.E.(D) = δ
15
Gene length bias Dividing by gene length
The distribution is no longer Poisson and μ' ≠ Var(μ').
16
Technical and biological replicates
Nagalakshmi et al. (2008) have found that counts for the same gene from different technical replicates have a variance equal to the mean (Poisson). counts for the same gene from different biological replicates have a variance exceeding the mean (overdispersion). Marioni et al. (2008) have looked confirmed the first fact. “ We find that the sequencing data are highly reproducible, with few systematic differences among technical replicates. Statistically, we find that the variation across technical replicates can be captured using a Poisson model, with only a small proportion (∼0.5%) of genes showing clear deviations from this model.”
17
RNA-Seq as draws from infinite urn
Imagine taking N colored balls from an urn which contains >> N balls The colors are genes, and the balls are fragments in the library A column of the count matrix is then multinomial(N,p) BRCA1 BRCA2 library (sample)
18
Binomial 이항분포는 시행횟수 n과 성공률 p인 두개의 모수를 갖고 있으며,
X가 모수 n,p를 갖는 이항분포에 따름을 기호 X~B(n,p)로 나타내기도 한다.
19
Problems with Poisson Poisson v = μ Poisson distribution
Poisson + constant CV v = μ + α μ2 (edgeR) Poisson + local regression v = μ + f(μ2) (DESeq) Poisson distribution Negative binomial distribution
20
DEG tools The basic idea is that the count data is over-dispersed and modeled using a negative binomial distribution. Poisson distribution (mean=variance) + Overdispersion => Negative binomial distribution * DEG tools for RNA-seq DEGSeq (Wang et al.): Poisson distribution edgeR (Robinson et al., 2010): Exact test based on Negative Binomial distribution. DESeq (Anders and Huber, 2010): Exact test based on Negative Binomial
21
RNA-seq for DEG
22
Tuxedo protocol Align the RNA-seq reads to the genome
Condition A Condition B C1_R1_1.fq C1_R1_2.fq C1_R2_1.fq C1_R2_2.fq C1_R3_1.fq C1_R3_2.fq C2_R1_1.fq C2_R1_2.fq C2_R2_1.fq C2_R2_2.fq C2_R3_1.fq C2_R3_2.fq Align the RNA-seq reads to the genome 1| Map the reads for each sample to the reference genome: $ tophat -p 8 -G genes.gtf -o C1_R1_thout genome C1_R1_1.fq C1_R1_2.fq $ tophat -p 8 -G genes.gtf -o C1_R2_thout genome C1_R2_1.fq C1_R2_2.fq $ tophat -p 8 -G genes.gtf -o C1_R3_thout genome C1_R3_1.fq C1_R3_2.fq $ tophat -p 8 -G genes.gtf -o C2_R1_thout genome C2_R1_1.fq C2_R1_2.fq $ tophat -p 8 -G genes.gtf -o C2_R2_thout genome C2_R2_1.fq C2_R2_2.fq $ tophat -p 8 -G genes.gtf -o C2_R3_thout genome C2_R3_1.fq C2_R3_2.fq
23
Tuxedo protocol Assemble expressed genes and transcripts
2| Assemble transcripts for each sample: 3| Create a file called assemblies.txt that lists the assembly file for each sample. The file should contain the following lines: $ cufflinks -p 8 -o C1_R1_clout C1_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R2_clout C1_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C1_R3_clout C1_R3_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R1_clout C2_R1_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R2_clout C2_R2_thout/accepted_hits.bam $ cufflinks -p 8 -o C2_R3_clout C2_R3_thout/accepted_hits.bam ./C1_R1_clout/transcripts.gtf ./C2_R2_clout/transcripts.gtf ./C1_R2_clout/transcripts.gtf ./C2_R1_clout/transcripts.gtf ./C1_R3_clout/transcripts.gtf ./C2_R3_clout/transcripts.gtf assemblies.txt
24
Tuxedo protocol Assemble expressed genes and transcripts
4| Run Cuffmerge on all your assemblies to create a single merged transcriptome annotation: Identify differentially expressed genes and transcripts 5| Run Cuffdiff by using the merged transcriptome assembly along with the BAM files from TopHat for each replicate: $ cuffmerge -g genes.gtf -s genome.fa -p 8 assemblies.txt $ cuffdiff -o diff_out -b genome.fa -p 8 –L C1,C2 -u merged_asm/merged.gtf ./C1_R1_thout/accepted_hits.bam,./C1_R2_thout/accepted_hits.bam,./C1_R3_thout/accepted_hits.bam ./C2_R1_thout/accepted_hits.bam,./C2_R3_thout/accepted_hits.bam,./C2_R2_thout/accepted_hits.bam –L Cancer,Normal C1.bam,C2.bam,C3.bam N1.bam,N2.bam,N3.bam
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.