Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH Analysis of genomes and transcriptomes using RNA-seq and ChIP-seq Practical session Leonardo Mariño-Ramírez, PhD NCBI / NLM / NIH ICGEB – Practical Course "Bioinformatics: Computer Methods in Molecular Biology” June 26-30 / 2017
RNA-seq workflow for the tutorial
Slides ftp://ftp.ncbi.nlm.nih.gov/pub/marino/teaching/ICGEB/2016/
Differential gene expression from RNA-Seq data
Differential gene expression from RNA-Seq data 1. Get to the RNA-Seq directory and launch R user0@head:~$ cd marino-data/RNA-Seq/ user0@head:~/marino-data/RNA-Seq$ ll total 344372 -rw-r----- 1 user0 user0 348194816 Apr 7 20:58 GSE27003GPL9115_DGE_22ba48b764533b15733122c3e8e01ae1.db -rw-r----- 1 user0 user0 4438128 Apr 7 20:58 GSE27003GPL9115_DGE_RNASeq_22ba48b764533b15733122c3e8e01ae1_report.pdf user0@head:~/marino-data/RNA-Seq$ R R version 3.3.0 (2016-05-03) -- "Supposedly Educational” Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. >
Differential gene expression from RNA-Seq data 2. Load the cummeRbund library - http://compbio.mit.edu/cummeRbund/ and perform basic statistic operations > library(cummeRbund) Loading required package: BiocGenerics Attaching package: 'BiocGenerics' The following object(s) are masked from 'package:stats': xtabs The following object(s) are masked from 'package:base': Filter, Find, Map, Position, Reduce, anyDuplicated, cbind, colnames, duplicated, eval, get, intersect, lapply, mapply, mget, order, paste, pmax, pmax.int, pmin, pmin.int, rbind, rep.int, rownames, sapply, setdiff, table, tapply, union, unique Loading required package: RSQLite Loading required package: DBI Loading required package: ggplot2 Loading required package: reshape2 Loading required package: fastcluster Attaching package: 'fastcluster' hclust Loading required package: rtracklayer Loading required package: GenomicRanges Loading required package: IRanges Loading required package: Gviz Loading required package: grid >
Differential gene expression from RNA-Seq data > list.files() [1] "GSE27003GPL9115_DGE_22ba48b764533b15733122c3e8e01ae1.db" [2] "GSE27003GPL9115_DGE_RNASeq_22ba48b764533b15733122c3e8e01ae1_report.pdf” > cuff<-readCufflinks(dbFile="GSE27003GPL9115_DGE_22ba48b764533b15733122c3e8e01ae1.db") > dens<-csDensity(genes(cuff)) > dens Warning messages: 1: Removed 1997 rows containing non-finite values (stat_density). 2: Removed 4973 rows containing non-finite values (stat_density). > The density plot will show you the distribution of your RNA-seq read counts (fpkm)
Differential gene expression from RNA-Seq data 3. Display a boxplot of the expression values and a volcano plot > b<-csBoxplot(genes(cuff)) > b >
Differential gene expression from RNA-Seq data > v<-csVolcanoMatrix(genes(cuff)) > v >
Differential gene expression from RNA-Seq data 4. Extract the differentially expressed genes and plot a heatmap for both conditions > mySigGeneIds<-getSig(cuff,alpha=0.05,level='genes') > myGenes<-getGenes(cuff,mySigGeneIds) Getting gene information: FPKM Differential Expression Data Annotation Data Replicate FPKMs Counts Getting isoforms information: Getting CDS information: Getting TSS information: Getting promoter information: distData Getting splicing information: Getting relCDS information: >
Differential gene expression from RNA-Seq data > h.rep<-csHeatmap(myGenes,cluster='both',replicates=F) Using tracking_id, sample_name as id variables Using as id variables > h.rep >
Differential gene expression from RNA-Seq data > h.rep<-csHeatmap(myGenes,cluster='both',replicates=T) Using tracking_id, sample_name as id variables Using as id variables > h.rep >
Differential gene expression from RNA-Seq data > h.rep<-csHeatmap(myGenes,cluster='both',replicates=T) Using tracking_id, sample_name as id variables Using as id variables > h.rep >
Differential gene expression from RNA-Seq data
ChIP-seq analysis with DiffBind This package is useful for manipulating ChIP-seq signal in R, for comparing signal across files and for performing tests of diffrential binding. user0@head:~$ cd workspace/chipseq/extra/ user0@head:~/workspace/chipseq/extra$ ls config.csv peaks reads tamoxifen_allfields.csv tamoxifen.csv tamoxifen_GEO.csv tamoxifen_GEO.R testdata The dataset for this example consists of ChIPs against the transcription factor ERa using five breast cancer cell lines. Three of these cell lines are responsive to tamoxifen treatment, while two others are resistant to tamoxifen. There are at least two replicates for each of the cell lines, with one cell line having three replicates, for a total of eleven sequenced libraries. Of the five cell lines, two are based on MCF7 cells: the regular tamoxifen responsive line, as well as MCF7 cells specially treated with tamoxifen until a tamoxifen resistant cell line is obtained.
ChIP-seq analysis with DiffBind user0@head:~/workspace/chipseq/extra$ R R version 3.3.0 (2016-05-03) -- "Supposedly Educational" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. > library(DiffBind)
ChIP-seq analysis with DiffBind > list.files() [1] "config.csv" "peaks" [3] "reads" "tamoxifen_allfields.csv" [5] "tamoxifen_GEO.csv" "tamoxifen_GEO.R" [7] "tamoxifen.csv" "testdata"
ChIP-seq analysis with DiffBind > read.csv("tamoxifen.csv") SampleID Tissue Factor Condition Treatment Replicate 1 BT4741 BT474 ER Resistant Full-Media 1 2 BT4742 BT474 ER Resistant Full-Media 2 3 MCF71 MCF7 ER Responsive Full-Media 1 4 MCF72 MCF7 ER Responsive Full-Media 2 5 MCF73 MCF7 ER Responsive Full-Media 3 6 T47D1 T47D ER Responsive Full-Media 1 7 T47D2 T47D ER Responsive Full-Media 2 8 MCF7r1 MCF7 ER Resistant Full-Media 1 9 MCF7r2 MCF7 ER Resistant Full-Media 2 10 ZR751 ZR75 ER Responsive Full-Media 1 11 ZR752 ZR75 ER Responsive Full-Media 2 bamReads ControlID bamControl 1 reads/Chr18_BT474_ER_1.bam BT474c reads/Chr18_BT474_input.bam 2 reads/Chr18_BT474_ER_2.bam BT474c reads/Chr18_BT474_input.bam 3 reads/Chr18_MCF7_ER_1.bam MCF7c reads/Chr18_MCF7_input.bam 4 reads/Chr18_MCF7_ER_2.bam MCF7c reads/Chr18_MCF7_input.bam 5 reads/Chr18_MCF7_ER_3.bam MCF7c reads/Chr18_MCF7_input.bam 6 reads/Chr18_T47D_ER_1.bam T47Dc reads/T47D_input.bam 7 reads/Chr18_T47D_ER_2.bam T47Dc reads/T47D_input.bam 8 reads/Chr18_TAMR_ER_1.bam TAMRc reads/TAMR_input.bam 9 reads/TAMR_ER_2.bam TAMRc reads/TAMR_input.bam 10 reads/Chr18_ZR75_ER_1.bam ZR75c reads/ZR75_input.bam 11 reads/Chr18_ZR75_ER_2.bam ZR75c reads/ZR75_input.bam Peaks PeakCaller 1 peaks/BT474_ER_1.bed.gz bed 2 peaks/BT474_ER_2.bed.gz bed 3 peaks/MCF7_ER_1.bed.gz bed 4 peaks/MCF7_ER_2.bed.gz bed 5 peaks/MCF7_ER_3.bed.gz bed 6 peaks/T47D_ER_1.bed.gz bed 7 peaks/T47D_ER_2.bed.gz bed 8 peaks/TAMR_ER_1.bed.gz bed 9 peaks/TAMR_ER_2.bed.gz bed 10 peaks/ZR75_ER_1.bed.gz bed 11 peaks/ZR75_ER_2.bed.gz bed >
ChIP-seq analysis with DiffBind > ta <- dba(sampleSheet="tamoxifen.csv") BT4741 BT474 ER Resistant Full-Media 1 bed BT4742 BT474 ER Resistant Full-Media 2 bed MCF71 MCF7 ER Responsive Full-Media 1 bed MCF72 MCF7 ER Responsive Full-Media 2 bed MCF73 MCF7 ER Responsive Full-Media 3 bed T47D1 T47D ER Responsive Full-Media 1 bed T47D2 T47D ER Responsive Full-Media 2 bed MCF7r1 MCF7 ER Resistant Full-Media 1 bed MCF7r2 MCF7 ER Resistant Full-Media 2 bed ZR751 ZR75 ER Responsive Full-Media 1 bed ZR752 ZR75 ER Responsive Full-Media 2 bed > ta 11 Samples, 2845 sites in matrix (3795 total): ID Tissue Factor Condition Treatment Replicate Caller Intervals 1 BT4741 BT474 ER Resistant Full-Media 1 bed 1080 2 BT4742 BT474 ER Resistant Full-Media 2 bed 1122 3 MCF71 MCF7 ER Responsive Full-Media 1 bed 1556 4 MCF72 MCF7 ER Responsive Full-Media 2 bed 1046 5 MCF73 MCF7 ER Responsive Full-Media 3 bed 1339 6 T47D1 T47D ER Responsive Full-Media 1 bed 527 7 T47D2 T47D ER Responsive Full-Media 2 bed 373 8 MCF7r1 MCF7 ER Resistant Full-Media 1 bed 1438 9 MCF7r2 MCF7 ER Resistant Full-Media 2 bed 930 10 ZR751 ZR75 ER Responsive Full-Media 1 bed 2346 11 ZR752 ZR75 ER Responsive Full-Media 2 bed 2345 > pdf("Correlation-occupancy-data.pdf") > plot(ta) > dev.off() null device 1 >
ChIP-seq analysis with DiffBind Go to: http://23.251.138.125/~user0/workspace/chipseq/extra/Correlation-occupancy-data.pdf
ChIP-seq analysis with DiffBind > data(tamoxifen_counts) > ta2 <- tamoxifen > ta2 <- dba.contrast(ta2, categories=DBA_CONDITION) > ta2 <- dba.analyze(ta2) converting counts to integer mode gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates > pdf("Correlation-significantly-differentially-bound.pdf") > plot(ta2, contrast=1) > dev.off() null device 1 > ta2 11 Samples, 2845 sites in matrix: ID Tissue Factor Condition Treatment Replicate Caller Intervals FRiP 1 BT4741 BT474 ER Resistant Full-Media 1 counts 2845 0.16 2 BT4742 BT474 ER Resistant Full-Media 2 counts 2845 0.15 3 MCF71 MCF7 ER Responsive Full-Media 1 counts 2845 0.27 4 MCF72 MCF7 ER Responsive Full-Media 2 counts 2845 0.17 5 MCF73 MCF7 ER Responsive Full-Media 3 counts 2845 0.23 6 T47D1 T47D ER Responsive Full-Media 1 counts 2845 0.10 7 T47D2 T47D ER Responsive Full-Media 2 counts 2845 0.06 8 MCF7r1 MCF7 ER Resistant Full-Media 1 counts 2845 0.20 9 MCF7r2 MCF7 ER Resistant Full-Media 2 counts 2845 0.13 10 ZR751 ZR75 ER Responsive Full-Media 1 counts 2845 0.32 11 ZR752 ZR75 ER Responsive Full-Media 2 counts 2845 0.22 1 Contrast: Group1 Members1 Group2 Members2 DB.DESeq2 1 Resistant 4 Responsive 7 677 >
ChIP-seq analysis with DiffBind Go to: http://23.251.138.125/~user25/workspace/chipseq/extra/Correlation-significantly-differentially-bound.pdf
ChIP-seq analysis with DiffBind > data(tamoxifen_counts) > ta2 <- tamoxifen > ta2 <- dba.contrast(ta2, categories=DBA_CONDITION) > ta2 <- dba.analyze(ta2) converting counts to integer mode gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates > pdf("Correlation-significantly-differentially-bound.pdf") > plot(ta2, contrast=1) > dev.off() null device 1 >
ChIP-seq analysis with DiffBind > ta2 11 Samples, 2845 sites in matrix: ID Tissue Factor Condition Treatment Replicate Caller Intervals FRiP 1 BT4741 BT474 ER Resistant Full-Media 1 counts 2845 0.16 2 BT4742 BT474 ER Resistant Full-Media 2 counts 2845 0.15 3 MCF71 MCF7 ER Responsive Full-Media 1 counts 2845 0.27 4 MCF72 MCF7 ER Responsive Full-Media 2 counts 2845 0.17 5 MCF73 MCF7 ER Responsive Full-Media 3 counts 2845 0.23 6 T47D1 T47D ER Responsive Full-Media 1 counts 2845 0.10 7 T47D2 T47D ER Responsive Full-Media 2 counts 2845 0.06 8 MCF7r1 MCF7 ER Resistant Full-Media 1 counts 2845 0.20 9 MCF7r2 MCF7 ER Resistant Full-Media 2 counts 2845 0.13 10 ZR751 ZR75 ER Responsive Full-Media 1 counts 2845 0.32 11 ZR752 ZR75 ER Responsive Full-Media 2 counts 2845 0.22 1 Contrast: Group1 Members1 Group2 Members2 DB.DESeq2 1 Resistant 4 Responsive 7 677 >
ChIP-seq analysis with DiffBind > tadb <- dba.report(ta2) > tadb GRanges object with 677 ranges and 6 metadata columns: seqnames ranges strand | Conc Conc_Resistant <Rle> <IRanges> <Rle> | <numeric> <numeric> 1291 chr18 [34597700, 34598200] * | 5.33 0.02 2452 chr18 [64490684, 64491184] * | 6.36 1.39 2571 chr18 [69433116, 69433616] * | 4.57 -0.79 2771 chr18 [74536113, 74536613] * | 3.93 -0.79 976 chr18 [26860992, 26861492] * | 7.3 3.1 ... ... ... ... . ... ... 1405 chr18 [38482733, 38483233] * | 3.23 0.99 1695 chr18 [45053220, 45053720] * | 2.77 0.81 1650 chr18 [43648626, 43649126] * | 3.88 2.31 1702 chr18 [45489315, 45489815] * | 1.54 -0.22 1506 chr18 [41736699, 41737199] * | 1.84 0 Conc_Responsive Fold p-value FDR <numeric> <numeric> <numeric> <numeric> 1291 5.97 -5.95 1.24e-10 3.21e-07 2452 7 -5.61 2.26e-10 3.21e-07 2571 5.21 -6 3.59e-09 3.41e-06 2771 4.57 -5.35 6.56e-09 4.67e-06 976 7.92 -4.82 8.74e-09 4.97e-06 ... ... ... ... ... 1405 3.76 -2.77 0.0116 0.0489 1695 3.28 -2.47 0.0117 0.0492 1650 4.34 -2.03 0.0117 0.0492 1702 2.03 -2.25 0.0118 0.0494 1506 2.34 -2.34 0.0118 0.0494 ------- seqinfo: 1 sequence from an unspecified genome; no seqlengths >
ChIP-seq analysis with DiffBind > counts <- dba.report(ta2, bCounts=TRUE) > x <- mcols(counts)[1,-c(1:6)] > x <- unlist(x) > (xord <- x[match(ta2$samples$SampleID, names(x))]) BT4741 BT4742 MCF71 MCF72 MCF73 T47D1 T47D2 MCF7r1 MCF7r2 ZR751 ZR752 1.70 0.56 36.00 23.57 52.47 14.71 11.02 0.59 1.21 156.08 144.10 > cond <- factor(ta2$samples[,"Condition"]) > condcomb <- factor(paste(ta2$samples[,"Condition"], ta2$samples[,"Tissue"])) > pdf("Counts-over-the-conditions.pdf") > par(mar=c(15,5,2,2)) > stripchart(log(xord) ~ condcomb, method="jitter", vertical=TRUE, las=2, ylab="log2 normalized counts") > dev.off() pdf 2 >
ChIP-seq analysis with DiffBind Go to: http://23.251.138.125/~user25/workspace/chipseq/extra/Counts-over-the-conditions.pdf