Next – generation Transcriptome Analysis Workshop

Next – generation Transcriptome Analysis Workshop
Lecture 4.1

Differential gene expression analysis
DESeq2 and edgeR Reference Transcriptome De novo Transcriptome HTSeq ??? R Statistical Analysis Package Differentially Expressed Genes MA plots Volcano plots DESeq2 EdgeR Counts

The R Project What is R??? How to run R Unix command line Rstudio plots are a little more convenient R console OK slightly less convenient for plots and help On PC, run as administrator makes installing packages easier

Bioconductor load Bioconductor first, then use biocLite to load other packages DESeq2 edgeR > source(" Bioconductor version 3.1 (BiocInstaller ), ?biocLite for help > biocLite("DESeq2") BioC_mirror: Using Bioconductor version 3.1 (BiocInstaller ), R version Installing package(s) ‘DESeq2’ trying URL ' Content type 'application/zip' length bytes (3.2 MB) downloaded 3.2 MB package ‘DESeq2’ successfully unpacked and MD5 sums checked The downloaded binary packages are in C:\Users\Gribskov Admin\AppData\Local\Temp\RtmpeErDxa\downloaded_packages Old packages: 'MASS', 'S4Vectors' Update all/some/none? [a/s/n]: a There are binary versions available but the source versions are later: binary source needs_compilation MASS TRUE S4Vectors TRUE Binaries will be installed trying URL ' Content type 'application/zip' length bytes (1.0 MB) downloaded 1.0 MB trying URL ' Content type 'application/zip' length bytes (1.6 MB) downloaded 1.6 MB package ‘MASS’ successfully unpacked and MD5 sums checked package ‘S4Vectors’ successfully unpacked and MD5 sums checked

DESeq2 source(" biocLite("DESeq2") library("DESeq2") # set a working directory – this is on my desktop PC directory <- "c:/users/mgribsko/Desktop/htseq" setwd( directory) # make a list of the count files produced by htseq, my files from htseq are called SRR<something>.count files <- grep("SRR.*count",list.files(directory),value=TRUE) # list a variable by typing its name files [1] "SRR count" "SRR count" "SRR count" "SRR count" "SRR count" "SRR count" [7] "SRR count" "SRR count" "SRR count" #alternatively, just list the file names by hand files <- c("SRR count","SRR count","SRR count","SRR count","SRR count","SRR count", +"SRR count","SRR count","SRR count") # make a list of the sample names by extracting the substring before .count samples <- sub("(*).count","\\1",files) samples [1] "SRR " "SRR " "SRR " "SRR " "SRR " "SRR " "SRR " "SRR " "SRR " # create a dataframe, called matrix, with the sample names, files, and conditions matrix <- data.frame(sampleName=samples, fileName=files, condition=c("wt","wt","wt","gpf1","gpf1","gpf1","cnf2","cnf2","cnf2")) matrix sampleName fileName condition 1 SRR SRR count wt 2 SRR SRR count wt 3 SRR SRR count wt 4 SRR SRR count gpf1 5 SRR SRR count gpf1 6 SRR SRR count gpf1 7 SRR SRR count cnf2 8 SRR SRR count cnf2 9 SRR SRR count cnf2 # read data into DESeq2, this is a special method designed to work with HTSeq files ds<-DESeqDataSetFromHTSeqCount(sampleTable=matrix, directory=directory, design= ~ condition)

DESeq2 some rows(genes) have no counts! # reorder the "levels" in the data so wt is first. levels are the sample types colData(ds)$condition<-factor(colData(ds)$condition,levels=c("wt","gpf1","cnf2")) ds class: DESeqDataSet dim: exptData(0): assays(1): counts rownames(12709): XLOC_ XLOC_ XLOC_ XLOC_012709 rowData metadata column names(0): colnames(9): SRR SRR SRR SRR colData names(1): condition #look at the first 10 rows counts(ds[1:10,]) SRR SRR SRR SRR SRR SRR SRR SRR SRR XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ # total counts in each sample colSums(counts(ds[,1:9])) plot(counts(ds[,1]),counts(ds[,2]),pch=20) plot(log10(counts(ds[,1])),log10(counts(ds[,2])),pch=20) 12709 rows, 9 columns

DESeq2 # DESeq2 is run in a single command ds <- DESeq(ds) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing #results are extracted using the results function res_gpf1_wt <- results(ds,contrast=c("condition","wt","gpf1")) res_gpf1_wt log2 fold change (MAP): condition wt vs gpf1 Wald test p-value: condition wt vs gpf1 DataFrame with rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> XLOC_ NA XLOC_ NA NA NA NA NA XLOC_ NA XLOC_ XLOC_ NA XLOC_ NA NA NA NA NA XLOC_ NA NA NA NA NA XLOC_ NA NA NA NA NA XLOC_ XLOC_ NA

DESeq2 should be approximately log-normal, we see a big tail on the left (low expression) side DESeq incorporates a test for outliers (Cook's cutoff) # show in order of adjusted P Value (FDR) res_gpf1_wt[order(res_gpf1_wt$padj),] log2 fold change (MAP): condition wt vs gpf1 Wald test p-value: condition wt vs gpf1 DataFrame with rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> XLOC_ e e+00 XLOC_ e e+00 XLOC_ e e-296 XLOC_ e e-280 XLOC_ e e-272 XLOC_ NA NA NA NA NA XLOC_ NA NA NA NA NA XLOC_ NA NA NA NA NA XLOC_ NA NA NA NA NA XLOC_ NA # histogram of mean expression levels hist(log(res_gpf1_wt[,"baseMean"]),breaks=seq(-5,15,by=0.25),main="gpf1 vs wt, no cutoff")

DESeq2 # Apply Cook's cutoff cut_gpf1_wt <- results(ds,contrast=c("condition","wt","gpf1"),cooksCutoff=1) par(mfrow=c(2,1)) hist(log(cut_gpf1_wt[!is.na(cut_gpf1_wt[,"padj"]),"baseMean"]),breaks=seq(-5,15,by=0.25),main="gpf1 vs wt,cutoff=1") hist(log(res_gpf1_wt[,"baseMean"]),breaks=seq(-5,15,by=0.25),main="gpf1 vs wt,no cutoff")

DESeq2 # remove rows with adjusted P Value = NA (Cook's cutoff) cut_gpf1_wt <- cut_gpf1_wt[!is.na(cut_gpf1_wt[,"padj"]),] cut_gpf1_wt log2 fold change (MAP): condition wt vs gpf1 Wald test p-value: condition wt vs gpf1 DataFrame with 8519 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> XLOC_ e e-01 XLOC_ e e-01 XLOC_ e e-02 XLOC_ e e-01 XLOC_ e e-14 XLOC_ e e-01 XLOC_ e e-07 XLOC_ e e-01 XLOC_ e e-01 XLOC_ e e-01 # Simple PCA to check whether the dat make sense, i.e., do the replicates seem more like each other than the different samples vsd=varianceStabilizingTransformation(cut_gpf1_wt) plotPCA(vsd,intgroup=c("condition")) 12709 rows -> 8519

DESeq2 # MA plot is a traditional way to look at DGE results plotMA(cut_gpf1_wt,ylim=c(-8,8)) abline(h=2,col="blue") abline(h=-2,col="blue") # volcano plots are another traditional method with(cut_gpf1_wt,plot(log2FoldChange,-log10(padj),pch=20)) with(subset(cut_gpf1_wt,padj<0.01),points(log2FoldChange,-log10(padj),pch=20,col="red")) with(subset(cut_gpf1_wt,abs(log2FoldChange)>3 & padj<0.01),points(log2FoldChange,-log10(padj),pch=20,col="green"))

DESeq2 # plot estimated dispersion plotDispEsts(ds) # select significant genes with large fold change select=cut_gpf1_wt[cut_gpf1_wt[,"padj"]<0.01 & abs(cut_gpf1_wt[,"log2FoldChange"])>4,] select=select[order(select$log2FoldChange),] # write results to file write.table( select,file="c:/users/mgribsko/Desktop/htseq/edger.selected",row.names=T,col.names=T,sep="\t")

edgeR >ds # starting with the counts from htseq that were read into DESeq2 class: DESeqDataSet dim: exptData(0): assays(1): counts rownames(12709): XLOC_ XLOC_ XLOC_ XLOC_012709 rowRanges metadata column names(0): colnames(9): SRR SRR SRR SRR colData names(1): condition mm = counts(ds) # extract the count matrix head(mm) SRR SRR SRR SRR SRR SRR SRR SRR SRR XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ > colSums( mm ) # Library Sizes # some simple exploratory plots hist(log(mm[,1]),breaks=100)

edgeR plot(log(mm[,1]),log(mm[,2]),pch=20, col=rgb(1,0,0,0.01)) pairs(log(mm[,1:9]),pch=20,cex=0.5)

edgeR # make the DGEList (data structure used by edgeR) group=c("wt","wt","wt","gpf1","gpf1","gpf1","cnf2","cnf2","cnf2") group [1] "wt" "wt" "wt" "gpf1" "gpf1" "gpf1" "cnf2" "cnf2" "cnf2" cds = DGEList( mm, group=group ) cds An object of class "DGEList" $counts SRR SRR SRR SRR SRR SRR SRR SRR SRR XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ 12704 more rows ... $samples group lib.size norm.factors SRR wt SRR wt SRR wt SRR gpf SRR gpf SRR gpf SRR cnf SRR cnf SRR cnf # filter counts that are too low to be reliable, the edgeR author's suggest something like the following # cds <- cds[rowSums(1e+06 * cds$counts/expandAsMatrix(cds$samples$lib.size, dim(cds)) > 1) >= 3, ] # I'll do something a little more complex, I want to keep any gene for which one or more samples have at least # two values with at least 30 counts cds=cds[(rowSums(mm[,1:3]>=30)>2|rowSums(mm[,4:6]>=30)>2|rowSums(mm[,7:9]>=30)>2),] dim(cds) [1] # 7620 of pass this filter

edgeR # get library normalization factors cds <- calcNormFactors( cds ) cds$samples group lib.size norm.factors SRR wt SRR wt SRR wt SRR gpf SRR gpf SRR gpf SRR cnf SRR cnf SRR cnf # do my replicates seem closer to each other than the samples plotMDS( cds , main = "MDS Plot for Count Data", pch=20,cex=5,col=c(rep("red",3),rep("green",3),rep("blue",3)),top=1000 ) # estimate the dispersions cds <- estimateCommonDisp(cds) cds <- estimateTagwiseDisp( cds )

edgeR An object of class "DGEList" $counts SRR SRR SRR SRR SRR SRR SRR SRR SRR XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ $samples group lib.size norm.factors SRR wt SRR wt SRR wt SRR gpf SRR gpf SRR gpf SRR cnf SRR cnf SRR cnf $common.dispersion [1] $pseudo.counts XLOC_ XLOC_ XLOC_ XLOC_ XLOC_ $pseudo.lib.size [1] $AveLogCPM [1] 7615 more elements ... $prior.n [1] $tagwise.dispersion [1]

edgeR - dispersion meanVarPlot <- plotMeanVar( cds , show.raw.vars=T,show.tagwise.vars=T,pch=19,show.binned.common.disp.vars=F,show.ave.raw.vars=F, + nbins = 100 ,xlab ="Mean Expression (Log10 Scale)" ,ylab = "Variance (Log10 Scale)" ,main = "Mean-Variance Plot")

edgeR # the significance test gpf_wt=exactTest(cds,pair=c("wt","gpf1")) # select the top DEGs sel=abs(res[,"logFC"])>5 & res[,"PValue"]<0.01 selres=res[sel,] selres = selres[order(selres[,"logFC"]),] head(selres) logFC logCPM PValue XLOC_ e-233 XLOC_ e-28 XLOC_ e-26 XLOC_ e-71 XLOC_ e-45 XLOC_ e-21 # save the top DEGs write.table( selres,file="c:/users/mgribsko/Desktop/edger.selected",row.names=T,col.names=T,sep="\t")

edgeR - volcano plot head(res) logFC logCPM PValue XLOC_ e+00 XLOC_ e+00 XLOC_ e-245 XLOC_ e-233 XLOC_ e-221 XLOC_ e-202 with(res,plot(logFC,-log10(PValue),pch=20)) with(subset(res,PValue<0.01),points(logFC,-log10(PValue),pch=20,col="red")) with(subset(res,abs(logFC)>5),points(logFC,-log10(PValue),pch=20,col="orange")) with(subset(res,abs(logFC)>5 & PValue<0.01),points(logFC,-log10(PValue),pch=20,col="green"))

edgeR - MA plot (edgeR) plot(res[,"logCPM"],res[,"logFC"],pch=20) points(selres[,"logCPM"],selres[,"logFC"],pch=20,col="red")

DESeq2 vs edgeR

Trinity - De novo transcriptome assembly
DEG analysis of Trinity results Use RSEM to get counts run RSEM on individual samples gather in to counts matrix # prepare reference # rsem-prepare-reference [options] reference_fasta_file(s) reference_name $TRINITY/trinity-plugins/rsem /rsem-prepare-reference \ --transcript-to-gene-map gene_map \ --bowtie2 \ ../trinity_out_dir/Trinity.fasta \ magnaporthe.ref # rsem-calculate-expression [options] upstream_read_file(s) reference_name sample_name $TRINITY/trinity-plugins/rsem /rsem-calculate-expression -p 20 \ --bowtie2 --bowtie2-sensitivity-level very_sensitive \ ../../cleaned/SRR trimmomatic.fq \ magnaporthe.ref \ magnaporte.wt.0.rsem abundance_estimates_to_matrix.pl --est_method RSEM --out magnaporthe.trinity.rsem.gene \ magnaporte.wt.0.rsem.genes.results \ magnaporte.wt.1.rsem.genes.results \ magnaporte.wt.2.rsem.genes.results \ magnaporte.gpf1.0.rsem.genes.results \ magnaporte.gpf1.1.rsem.genes.results \ magnaporte.gpf1.2.rsem.genes.results \ magnaporte.cnf2.0.rsem.genes.results \ magnaporte.cnf2.1.rsem.genes.results \ magnaporte.cnf2.2.rsem.genes.results

biocLite("DESeq2") library("DESeq2") directory <- "c:/users/mgribsko/Desktop/trinity" setwd( directory) t <- read.delim("magnaporthe.trinity.rsem.gene.counts.matrix",header=T,sep="\t",row.names=1) head(t) magnaporte.wt.0.rsem magnaporte.wt.1.rsem magnaporte.wt.2.rsem magnaporte.gpf1.0.rsem magnaporte.gpf1.1.rsem TR8197|c0_all TR4360|c0_all TR16310|c0_all TR4052|c0_all TR9094|c0_all TR21273|c0_all magnaporte.gpf1.2.rsem magnaporte.cnf2.0.rsem magnaporte.cnf2.1.rsem magnaporte.cnf2.2.rsem TR8197|c0_all TR4360|c0_all TR16310|c0_all TR4052|c0_all TR9094|c0_all TR21273|c0_all # rename the columns to be a little more convenient colnames(t) <- sub("magnaporte\\.(.*)\\.rsem","\\1",colnames(t)) # set up sample metadata matrix matrix=data.frame(sampleName=colnames(t),condition=c("wt","wt","wt","gpf1","gpf1","gpf1","cnf2","cnf2","cnf2")) # read into DESeq data set count=DESeqDataSetFromMatrix(round(t),colData=matrix,design=~condition,tidy=F)

vsd=varianceStabilizingTransformation(count) -- note: fitType='parametric', but the dispersion trend was not well captured by the function: y = a/x + b, and a local regression fit was automatically substituted. specify fitType='local' or 'mean' to avoid this message next time. plotPCA(vsd,intgroup=c("condition))

hist(log10(t[,1]),breaks=seq(-5,15,by=0.05)) pairs(log(t[,1:9]),pch=20,cex=0.5) plot(log(t[,7]),log(t[,6]),pch=20,cex=0.5)

count <- DESeq(count) estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing gpf1_wt=results(count,contrast=c("condition","gpf1","wt"),cooksCutoff=0.5) table(is.na(gpf1_wt[,"padj"])) FALSE TRUE hist(log(gpf1_wt[!is.na(gpf1_wt[,"padj"]),"baseMean"]),breaks=seq(-5,15,by=0.25),main="gpf1 vs wt,cutoff=0.5")

cut_gpf1_wt <- gpf1_wt[!is.na(gpf1_wt[,"padj"]),] cut_gpf1_wt log2 fold change (MAP): condition gpf1 vs wt Wald test p-value: condition gpf1 vs wt DataFrame with rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> TR16310|c0_all e e-01 TR4052|c0_all e e-02 TR9094|c0_all e e-01 TR21273|c0_all e e-19 TR2841|c0_all e e-66 TR10766|c0_all TR12328|c0_all TR18453|c0_all TR14941|c0_all TR11481|c0_all plotMA(cut_gpf1_wt,ylim=c(-8,8)) abline(h=2,col="blue") abline(h=-2,col="blue") # select DEGs select=cut_gpf1_wt[cut_gpf1_wt[,"padj"]<0.01 & abs(cut_gpf1_wt[,"log2FoldChange"])>4,] select=select[order(select$log2FoldChange),] # write results to file write.table( select,file="c:/users/mgribsko/Desktop/trinity/trinity_deseq.selected",row.names=T,col.names=T,sep="\t")

reference

edger_gpf1_wt=exactTest(cds,pair=c("wt","gpf1")) edger_gpf1_wt An object of class "DGEExact" $table logFC logCPM PValue TR16310|c0_all e-01 TR4052|c0_all e-02 TR21273|c0_all e-22 TR2841|c0_all e-37 TR6873|c0_all e-01 10114 more rows ... $comparison [1] "wt" "gpf1" $genes NULL sel = edger_gpf1_wt$table dim(sel) [1] sel=sel[abs(sel[,"logFC"])>5 & sel[,"PValue"]<0.01,] [1] sel=sel[order(sel[,"logFC"]),] head(sel) logFC logCPM PValue TR21909|c0_all e-181 TR24805|c0_all e-18 TR14368|c0_all e-16 TR6143|c0_all e-45 TR20597|c0_all e-04 TR20018|c0_all e-13 t["TR24805|c0_all",] wt.0 wt.1 wt.2 gpf1.0 gpf1.1 gpf1.2 cnf2.0 cnf2.1 cnf2.2 TR24805|c0_all write.table(sel,file="c:/users/mgribsko/Desktop/trinity_edger.selected",row.names=T,col.names=T,sep="\t")

Next – generation Transcriptome Analysis Workshop

Similar presentations

Presentation on theme: "Next – generation Transcriptome Analysis Workshop"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Next – generation Transcriptome Analysis Workshop

Similar presentations

Presentation on theme: "Next – generation Transcriptome Analysis Workshop"— Presentation transcript:

Similar presentations

About project

Feedback