Presentation is loading. Please wait.

Presentation is loading. Please wait.

Next – generation Transcriptome Analysis Workshop

Similar presentations


Presentation on theme: "Next – generation Transcriptome Analysis Workshop"— Presentation transcript:

1 Next – generation Transcriptome Analysis Workshop
Lecture 4.1

2 Differential gene expression analysis
DESeq2 and edgeR Reference Transcriptome De novo Transcriptome HTSeq ??? R Statistical Analysis Package Differentially Expressed Genes MA plots Volcano plots DESeq2 EdgeR Counts

3 Differential gene expression analysis
The R Project What is R??? How to run R Unix command line Rstudio plots are a little more convenient R console OK slightly less convenient for plots and help On PC, run as administrator makes installing packages easier

4 Differential gene expression analysis
Bioconductor load Bioconductor first, then use biocLite to load other packages DESeq2 edgeR > source(" Bioconductor version 3.1 (BiocInstaller ), ?biocLite for help > biocLite("DESeq2") BioC_mirror: Using Bioconductor version 3.1 (BiocInstaller ), R version Installing package(s) ‘DESeq2’ trying URL ' Content type 'application/zip' length bytes (3.2 MB) downloaded 3.2 MB package ‘DESeq2’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\Gribskov Admin\AppData\Local\Temp\RtmpeErDxa\downloaded_packages Old packages: 'MASS', 'S4Vectors' Update all/some/none? [a/s/n]: a There are binary versions available but the source versions are later: binary source needs_compilation MASS TRUE S4Vectors TRUE Binaries will be installed trying URL ' Content type 'application/zip' length bytes (1.0 MB) downloaded 1.0 MB trying URL ' Content type 'application/zip' length bytes (1.6 MB) downloaded 1.6 MB package ‘MASS’ successfully unpacked and MD5 sums checked package ‘S4Vectors’ successfully unpacked and MD5 sums checked

5 Differential gene expression analysis
DESeq2 source(" biocLite("DESeq2") library("DESeq2") # set a working directory – this is on my desktop PC directory <- "c:/users/mgribsko/Desktop/htseq" setwd( directory) # make a list of the count files produced by htseq, my files from htseq are called SRR<something>.count files <- grep("SRR.*count",list.files(directory),value=TRUE) # list a variable by typing its name files [1] "SRR count" "SRR count" "SRR count" "SRR count" "SRR count" "SRR count" [7] "SRR count" "SRR count" "SRR count" #alternatively, just list the file names by hand files <- c("SRR count","SRR count","SRR count","SRR count","SRR count","SRR count", +"SRR count","SRR count","SRR count") # make a list of the sample names by extracting the substring before .count samples <- sub("(*).count","\\1",files) samples [1] "SRR " "SRR " "SRR " "SRR " "SRR " "SRR " "SRR " "SRR " "SRR " # create a dataframe, called matrix, with the sample names, files, and conditions matrix <- data.frame(sampleName=samples, fileName=files, condition=c("wt","wt","wt","gpf1","gpf1","gpf1","cnf2","cnf2","cnf2")) matrix sampleName fileName condition 1 SRR SRR count wt 2 SRR SRR count wt 3 SRR SRR count wt 4 SRR SRR count gpf1 5 SRR SRR count gpf1 6 SRR SRR count gpf1 7 SRR SRR count cnf2 8 SRR SRR count cnf2 9 SRR SRR count cnf2 # read data into DESeq2, this is a special method designed to work with HTSeq files ds<-DESeqDataSetFromHTSeqCount(sampleTable=matrix, directory=directory, design= ~ condition)

6 Differential gene expression analysis
DESeq2 some rows(genes) have no counts! # reorder the "levels" in the data so wt is first. levels are the sample types colData(ds)$condition<-factor(colData(ds)$condition,levels=c("wt","gpf1","cnf2")) ds class: DESeqDataSet dim: exptData(0): assays(1): counts rownames(12709): XLOC_ XLOC_ XLOC_ XLOC_012709 rowData metadata column names(0): colnames(9): SRR SRR SRR SRR colData names(1): condition #look at the first 10 rows counts(ds[1:10,]) SRR SRR SRR SRR SRR SRR SRR SRR SRR XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ # total counts in each sample colSums(counts(ds[,1:9])) plot(counts(ds[,1]),counts(ds[,2]),pch=20) plot(log10(counts(ds[,1])),log10(counts(ds[,2])),pch=20) 12709 rows, 9 columns

7 Differential gene expression analysis
DESeq2 # DESeq2 is run in a single command ds <- DESeq(ds) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing #results are extracted using the results function res_gpf1_wt <- results(ds,contrast=c("condition","wt","gpf1")) res_gpf1_wt log2 fold change (MAP): condition wt vs gpf1 Wald test p-value: condition wt vs gpf1 DataFrame with rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> XLOC_ NA XLOC_ NA NA NA NA NA XLOC_ NA XLOC_ XLOC_ NA XLOC_ NA NA NA NA NA XLOC_ NA NA NA NA NA XLOC_ NA NA NA NA NA XLOC_ XLOC_ NA

8 Differential gene expression analysis
DESeq2 should be approximately log-normal, we see a big tail on the left (low expression) side DESeq incorporates a test for outliers (Cook's cutoff) # show in order of adjusted P Value (FDR) res_gpf1_wt[order(res_gpf1_wt$padj),] log2 fold change (MAP): condition wt vs gpf1 Wald test p-value: condition wt vs gpf1 DataFrame with rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> XLOC_ e e+00 XLOC_ e e+00 XLOC_ e e-296 XLOC_ e e-280 XLOC_ e e-272 XLOC_ NA NA NA NA NA XLOC_ NA NA NA NA NA XLOC_ NA NA NA NA NA XLOC_ NA NA NA NA NA XLOC_ NA # histogram of mean expression levels hist(log(res_gpf1_wt[,"baseMean"]),breaks=seq(-5,15,by=0.25),main="gpf1 vs wt, no cutoff")

9 Differential gene expression analysis
DESeq2 # Apply Cook's cutoff cut_gpf1_wt <- results(ds,contrast=c("condition","wt","gpf1"),cooksCutoff=1) par(mfrow=c(2,1)) hist(log(cut_gpf1_wt[!is.na(cut_gpf1_wt[,"padj"]),"baseMean"]),breaks=seq(-5,15,by=0.25),main="gpf1 vs wt,cutoff=1") hist(log(res_gpf1_wt[,"baseMean"]),breaks=seq(-5,15,by=0.25),main="gpf1 vs wt,no cutoff")

10 Differential gene expression analysis
DESeq2 # remove rows with adjusted P Value = NA (Cook's cutoff) cut_gpf1_wt <- cut_gpf1_wt[!is.na(cut_gpf1_wt[,"padj"]),] cut_gpf1_wt log2 fold change (MAP): condition wt vs gpf1 Wald test p-value: condition wt vs gpf1 DataFrame with 8519 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> XLOC_ e e-01 XLOC_ e e-01 XLOC_ e e-02 XLOC_ e e-01 XLOC_ e e-14 XLOC_ e e-01 XLOC_ e e-07 XLOC_ e e-01 XLOC_ e e-01 XLOC_ e e-01 # Simple PCA to check whether the dat make sense, i.e., do the replicates seem more like each other than the different samples vsd=varianceStabilizingTransformation(cut_gpf1_wt) plotPCA(vsd,intgroup=c("condition")) 12709 rows -> 8519

11 Differential gene expression analysis
DESeq2 # MA plot is a traditional way to look at DGE results plotMA(cut_gpf1_wt,ylim=c(-8,8)) abline(h=2,col="blue") abline(h=-2,col="blue") # volcano plots are another traditional method with(cut_gpf1_wt,plot(log2FoldChange,-log10(padj),pch=20)) with(subset(cut_gpf1_wt,padj<0.01),points(log2FoldChange,-log10(padj),pch=20,col="red")) with(subset(cut_gpf1_wt,abs(log2FoldChange)>3 & padj<0.01),points(log2FoldChange,-log10(padj),pch=20,col="green"))

12 Differential gene expression analysis
DESeq2 # plot estimated dispersion plotDispEsts(ds) # select significant genes with large fold change select=cut_gpf1_wt[cut_gpf1_wt[,"padj"]<0.01 & abs(cut_gpf1_wt[,"log2FoldChange"])>4,] select=select[order(select$log2FoldChange),] # write results to file write.table( select,file="c:/users/mgribsko/Desktop/htseq/edger.selected",row.names=T,col.names=T,sep="\t")

13 Differential gene expression analysis
edgeR >ds # starting with the counts from htseq that were read into DESeq2 class: DESeqDataSet dim: exptData(0): assays(1): counts rownames(12709): XLOC_ XLOC_ XLOC_ XLOC_012709 rowRanges metadata column names(0): colnames(9): SRR SRR SRR SRR colData names(1): condition mm = counts(ds) # extract the count matrix head(mm) SRR SRR SRR SRR SRR SRR SRR SRR SRR XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ > colSums( mm ) # Library Sizes # some simple exploratory plots hist(log(mm[,1]),breaks=100)

14 Differential gene expression analysis
edgeR plot(log(mm[,1]),log(mm[,2]),pch=20, col=rgb(1,0,0,0.01)) pairs(log(mm[,1:9]),pch=20,cex=0.5)

15 Differential gene expression analysis
edgeR # make the DGEList (data structure used by edgeR) group=c("wt","wt","wt","gpf1","gpf1","gpf1","cnf2","cnf2","cnf2") group [1] "wt" "wt" "wt" "gpf1" "gpf1" "gpf1" "cnf2" "cnf2" "cnf2" cds = DGEList( mm, group=group ) cds An object of class "DGEList" $counts SRR SRR SRR SRR SRR SRR SRR SRR SRR XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ 12704 more rows ... $samples group lib.size norm.factors SRR wt SRR wt SRR wt SRR gpf SRR gpf SRR gpf SRR cnf SRR cnf SRR cnf # filter counts that are too low to be reliable, the edgeR author's suggest something like the following # cds <- cds[rowSums(1e+06 * cds$counts/expandAsMatrix(cds$samples$lib.size, dim(cds)) > 1) >= 3, ] # I'll do something a little more complex, I want to keep any gene for which one or more samples have at least # two values with at least 30 counts cds=cds[(rowSums(mm[,1:3]>=30)>2|rowSums(mm[,4:6]>=30)>2|rowSums(mm[,7:9]>=30)>2),] dim(cds) [1] # 7620 of pass this filter

16 Differential gene expression analysis
edgeR # get library normalization factors cds <- calcNormFactors( cds ) cds$samples group lib.size norm.factors SRR wt SRR wt SRR wt SRR gpf SRR gpf SRR gpf SRR cnf SRR cnf SRR cnf # do my replicates seem closer to each other than the samples plotMDS( cds , main = "MDS Plot for Count Data", pch=20,cex=5,col=c(rep("red",3),rep("green",3),rep("blue",3)),top=1000 ) # estimate the dispersions cds <- estimateCommonDisp(cds) cds <- estimateTagwiseDisp( cds )

17 Differential gene expression analysis
edgeR An object of class "DGEList" $counts SRR SRR SRR SRR SRR SRR SRR SRR SRR XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ $samples group lib.size norm.factors SRR wt SRR wt SRR wt SRR gpf SRR gpf SRR gpf SRR cnf SRR cnf SRR cnf $common.dispersion [1] $pseudo.counts XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ $pseudo.lib.size [1] $AveLogCPM [1] 7615 more elements ... $prior.n [1] $tagwise.dispersion [1]

18 Differential gene expression analysis
edgeR - dispersion meanVarPlot <- plotMeanVar( cds , show.raw.vars=T,show.tagwise.vars=T,pch=19,show.binned.common.disp.vars=F,show.ave.raw.vars=F, + nbins = 100 ,xlab ="Mean Expression (Log10 Scale)" ,ylab = "Variance (Log10 Scale)" ,main = "Mean-Variance Plot")

19 Differential gene expression analysis
edgeR # the significance test gpf_wt=exactTest(cds,pair=c("wt","gpf1")) # select the top DEGs sel=abs(res[,"logFC"])>5 & res[,"PValue"]<0.01 selres=res[sel,] selres = selres[order(selres[,"logFC"]),] head(selres) logFC logCPM PValue XLOC_ e-233 XLOC_ e-28 XLOC_ e-26 XLOC_ e-71 XLOC_ e-45 XLOC_ e-21 # save the top DEGs write.table( selres,file="c:/users/mgribsko/Desktop/edger.selected",row.names=T,col.names=T,sep="\t")

20 Differential gene expression analysis
edgeR - volcano plot head(res) logFC logCPM PValue XLOC_ e+00 XLOC_ e+00 XLOC_ e-245 XLOC_ e-233 XLOC_ e-221 XLOC_ e-202 with(res,plot(logFC,-log10(PValue),pch=20)) with(subset(res,PValue<0.01),points(logFC,-log10(PValue),pch=20,col="red")) with(subset(res,abs(logFC)>5),points(logFC,-log10(PValue),pch=20,col="orange")) with(subset(res,abs(logFC)>5 & PValue<0.01),points(logFC,-log10(PValue),pch=20,col="green"))

21 Differential gene expression analysis
edgeR - MA plot (edgeR) plot(res[,"logCPM"],res[,"logFC"],pch=20) points(selres[,"logCPM"],selres[,"logFC"],pch=20,col="red")

22 Differential gene expression analysis
DESeq2 vs edgeR

23 Trinity - De novo transcriptome assembly
DEG analysis of Trinity results Use RSEM to get counts run RSEM on individual samples gather in to counts matrix # prepare reference # rsem-prepare-reference [options] reference_fasta_file(s) reference_name $TRINITY/trinity-plugins/rsem /rsem-prepare-reference \ --transcript-to-gene-map gene_map \ --bowtie2 \ ../trinity_out_dir/Trinity.fasta \ magnaporthe.ref # rsem-calculate-expression [options] upstream_read_file(s) reference_name sample_name $TRINITY/trinity-plugins/rsem /rsem-calculate-expression -p 20 \ --bowtie2 --bowtie2-sensitivity-level very_sensitive \ ../../cleaned/SRR trimmomatic.fq \ magnaporthe.ref \ magnaporte.wt.0.rsem abundance_estimates_to_matrix.pl --est_method RSEM --out magnaporthe.trinity.rsem.gene \ magnaporte.wt.0.rsem.genes.results \ magnaporte.wt.1.rsem.genes.results \ magnaporte.wt.2.rsem.genes.results \ magnaporte.gpf1.0.rsem.genes.results \ magnaporte.gpf1.1.rsem.genes.results \ magnaporte.gpf1.2.rsem.genes.results \ magnaporte.cnf2.0.rsem.genes.results \ magnaporte.cnf2.1.rsem.genes.results \ magnaporte.cnf2.2.rsem.genes.results

24 Trinity - De novo transcriptome assembly
biocLite("DESeq2") library("DESeq2") directory <- "c:/users/mgribsko/Desktop/trinity" setwd( directory) t <- read.delim("magnaporthe.trinity.rsem.gene.counts.matrix",header=T,sep="\t",row.names=1) head(t) magnaporte.wt.0.rsem magnaporte.wt.1.rsem magnaporte.wt.2.rsem magnaporte.gpf1.0.rsem magnaporte.gpf1.1.rsem TR8197|c0_all TR4360|c0_all TR16310|c0_all TR4052|c0_all TR9094|c0_all TR21273|c0_all magnaporte.gpf1.2.rsem magnaporte.cnf2.0.rsem magnaporte.cnf2.1.rsem magnaporte.cnf2.2.rsem TR8197|c0_all TR4360|c0_all TR16310|c0_all TR4052|c0_all TR9094|c0_all TR21273|c0_all # rename the columns to be a little more convenient colnames(t) <- sub("magnaporte\\.(.*)\\.rsem","\\1",colnames(t)) # set up sample metadata matrix matrix=data.frame(sampleName=colnames(t),condition=c("wt","wt","wt","gpf1","gpf1","gpf1","cnf2","cnf2","cnf2")) # read into DESeq data set count=DESeqDataSetFromMatrix(round(t),colData=matrix,design=~condition,tidy=F)

25 Trinity - De novo transcriptome assembly
vsd=varianceStabilizingTransformation(count) -- note: fitType='parametric', but the dispersion trend was not well captured by the function: y = a/x + b, and a local regression fit was automatically substituted. specify fitType='local' or 'mean' to avoid this message next time. plotPCA(vsd,intgroup=c("condition))

26 Trinity - De novo transcriptome assembly
hist(log10(t[,1]),breaks=seq(-5,15,by=0.05)) pairs(log(t[,1:9]),pch=20,cex=0.5) plot(log(t[,7]),log(t[,6]),pch=20,cex=0.5)

27 Trinity - De novo transcriptome assembly
count <- DESeq(count) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing gpf1_wt=results(count,contrast=c("condition","gpf1","wt"),cooksCutoff=0.5) table(is.na(gpf1_wt[,"padj"])) FALSE TRUE hist(log(gpf1_wt[!is.na(gpf1_wt[,"padj"]),"baseMean"]),breaks=seq(-5,15,by=0.25),main="gpf1 vs wt,cutoff=0.5")

28 Trinity - De novo transcriptome assembly
cut_gpf1_wt <- gpf1_wt[!is.na(gpf1_wt[,"padj"]),] cut_gpf1_wt log2 fold change (MAP): condition gpf1 vs wt Wald test p-value: condition gpf1 vs wt DataFrame with rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> TR16310|c0_all e e-01 TR4052|c0_all e e-02 TR9094|c0_all e e-01 TR21273|c0_all e e-19 TR2841|c0_all e e-66 TR10766|c0_all TR12328|c0_all TR18453|c0_all TR14941|c0_all TR11481|c0_all plotMA(cut_gpf1_wt,ylim=c(-8,8)) abline(h=2,col="blue") abline(h=-2,col="blue") # select DEGs select=cut_gpf1_wt[cut_gpf1_wt[,"padj"]<0.01 & abs(cut_gpf1_wt[,"log2FoldChange"])>4,] select=select[order(select$log2FoldChange),] # write results to file write.table( select,file="c:/users/mgribsko/Desktop/trinity/trinity_deseq.selected",row.names=T,col.names=T,sep="\t")

29 Trinity - De novo transcriptome assembly
reference

30 Trinity - De novo transcriptome assembly
edger_gpf1_wt=exactTest(cds,pair=c("wt","gpf1")) edger_gpf1_wt An object of class "DGEExact" $table logFC logCPM PValue TR16310|c0_all e-01 TR4052|c0_all e-02 TR21273|c0_all e-22 TR2841|c0_all e-37 TR6873|c0_all e-01 10114 more rows ... $comparison [1] "wt" "gpf1" $genes NULL sel = edger_gpf1_wt$table dim(sel) [1] sel=sel[abs(sel[,"logFC"])>5 & sel[,"PValue"]<0.01,] [1] sel=sel[order(sel[,"logFC"]),] head(sel) logFC logCPM PValue TR21909|c0_all e-181 TR24805|c0_all e-18 TR14368|c0_all e-16 TR6143|c0_all e-45 TR20597|c0_all e-04 TR20018|c0_all e-13 t["TR24805|c0_all",] wt.0 wt.1 wt.2 gpf1.0 gpf1.1 gpf1.2 cnf2.0 cnf2.1 cnf2.2 TR24805|c0_all write.table(sel,file="c:/users/mgribsko/Desktop/trinity_edger.selected",row.names=T,col.names=T,sep="\t")


Download ppt "Next – generation Transcriptome Analysis Workshop"

Similar presentations


Ads by Google