J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk
“aRrgh” a bit of chatter on the mail list “Ignorance is bliss” the less I have to know about R the more blissful I will be or “Knowledge is power” The more I know R the more powerful I find it is
WARNING! This talk contains snippets of R code that may be distressing those of a nervous disposition
R – Software environment for doing data analysis and statistics “It’s like marmite, you love it or hate it” … and like marmite I have to confess love it. Extensible Scriptable Terminal base Free Open source Multiplatform (multiprocessor)… ! Bioconductor ! …but then I also still love vim so what do I know?
Bioconductor Constantly growing set of R packages Focused on biological data analysis Common installation method Relatively easy package management Attempts at some common coding, testing and documentation standards Common (reusable/reused) data structures
Over the years I have made a living writing code in FORTRAN (IV, 77, 95) RPL-Filetab (RapidGen decision table language) APL (functional array processing language) Modula-2 (Pascal dialect) RPL-RS/1 (BBN Statistics Language) Basic / VisualBasic C / C++ /VisualC++ Objective-C ActionScript Matlab Java Perl Python R Learn a languages strengths and play to them
I am too old to keep swapping syntaxes! I use R exclusively (with a bit of bash) Microarrays affy: for affymetrix gene-chips limma: DE tool for micro-arrays “Great RNA-seq Experiment” DE tools mainly R-bioconductor packages edgeR, DESeq, BBSeq, SAMR, limma, BitSeq… …too many to mention HTS data analysis ChIP-seq, Mnase-seq, SNP calling
What is available for HTS data analysis? Lots and growing every day Few core packages for efficient storage and processing of sequence base data and annotations These are continually being refined and improved Sometimes at the cost of backward compatibility Best to keep R and bioconductor packeges up to date Integrated Tools being build on top of these core packages for specific analysis and visualisation
Which packages are worth the investment in learning? Are any packages worth the investment of learning/switching too R?
Biostrings Set of classes for representing large biological sequences (DNA/RNA/amino acids) Base class type XString (Bstring) => DNAString RNAString AAString Collection class XStringSet… Pairwise and multiple sequence alignments Set of methods for manipulation and computation on these classes Set of method for sequence matching and pairwise alignments
1 # Code to demonstrate the use of Biostrings 2 3 require(Biostrings) 4 5 dna <- DNAString("TCAACGTTGAATAGCGTACCG") 6 #> dna 7 # 21-letter "DNAString" instance 8 #seq: TCAACGTTGAATAGCGTACCG 9 10 aa <- AAString(translate(dna)) 11 #> aa 12 # 7-letter "AAString" instance 13 #seq: STLNSVP orfs <- AAStringSet(lapply(seq(1:3), 16 function(x){ 17 adj <- c(0,2,1) 18 AAString(translate(dna[x:(length(dna)-adj[x])])) 19 } 20 )) 21 #> orfs 22 # A AAStringSet instance of length 3 23 # width seq 24 #[1] 7 STLNSVP 25 #[2] 6 QR*IAY 26 #[3] 6 NVE*RT
1 require(Biostrings) 2 3 dna <- DNAString("TCAACGTTGAATAGCGTACCGAACGTTGAATATCGTTGAATAG") 4 p1 <- DNAString("AACGTT") 5 6 dinucFreq <- dinucleotideFrequency(dna) 7 #> dinucFreq 8 #AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 9 # trinucFreq<-trinucleotideFrequency(dna) 12 #> trinucFreq 13 #AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT 14 # #CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT 16 # #GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT 18 # #TTA TTC TTG TTT 20 # cpm <- countPattern(p1, dna) 23 #> cpm 24 #[1] 2 25
1 # Read in a FASTA file and a GTF features file 2 # Find chromosomes/contigs with features 3 # Generate a FASTA file with only those chromosomes/contigs 4 # containing features 5 6 require(Biostrings) 7 8 # read in Fasta file 9 gg4 <- readDNAStringSet("~/data/ensembl/gg4/Gg4_73.fa") # read in GTF 12 gtf <- read.delim("~/data/ensembl/gg4/Galgal4.gtf",sep="\t",h=F) # get chromosome that exist in GTF file 15 chrs <- levels(gtf$V1) # subset the strings 18 gg4.trim <- gg4[which(sapply(names(gg4), 19 function(x){ 20 unlist(strsplit(x," "))[1] 21 } 22 ) %in% chrs),] # write out new fasta file 25 writeXStringSet(gg4.trim,format="fasta", 26"~/data/ensembl/hg19_73/gg4_73_trimmed.fa",width=256)
26 mpm <- matchPattern(p1, dna) 27 #> mpm 28 # Views on a 43-letter DNAString subject 29 #subject: TCAACGTTGAATAGCGTACCGAACGTTGAATATCGTTGAATAG 30 #views: 31 # start end width 32 #[1] [AACGTT] 33 #[2] [AACGTT] mpm1 <- matchPattern(p1, dna,max.mismatch = 1) 36 #> mpm1 37 # Views on a 43-letter DNAString subject 38 #subject: TCAACGTTGAATAGCGTACCGAACGTTGAATATCGTTGAATAG 39 #views: 40 # start end width 41 #[1] [AACGTT] 42 #[2] [AACGTT] 43 #[3] [ATCGTT] dna <- DNAString("TCAACGTTGAAT") 30 print(date()) 31 #[1] "Sun Nov 10 14:45: " 32 v1<-vmatchPattern(dna,gg4.trim) 33 print(date()) 34 #[1] "Sun Nov 10 14:45: “
40 > v1 41 MIndex object of length $`1 dna:chromosome chromosome:Galgal4:1:1: :1 REF` 43 IRanges of length 4 44 start end width 45 [1] [2] [3] [4] $`10 dna:chromosome chromosome:Galgal4:10:1: :1 REF` 51 IRanges of length 1 52 start end width 53 [1] $`11 dna:chromosome chromosome:Galgal4:11:1: :1 REF` 56 IRanges of length >v1[[1]] 62 IRanges of length 4 63 start end width 64 [1] [2] [3] [4]
Range Data Slight Annoyance Data structure proliferation similar but different Sometimes there are easy conversion methods Sometimes there are not! GenomicRanges GRanges & GRangesList Iranges use RLE – run length encoding Run length encoding Raw: AAAACAAAAATTGTGGGG RLE: A4C1A5T2G1T1G4
GenomicRanges GRanges class It is a named list of ranges. Coordinate data : seqname, range, start, end, strand Metadata : user specified data fields Aligned read classes Gapped Aligned Reads GAlignement Gapped Aligned Read Pairs GAlignmentPair Importing reads from BAM files Frontend to Rsamtools Tools for iterative access to large files SummarizedExperiment Managing matrix of ranges and samples
25 gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), 26 ranges = IRanges(1:10, end = 7:16, names = head(letters, 10)), 27 strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), 28 score = rnorm(10,20,2), 29 GC = seq(1, 0, length=10), 30 group = c(rep("S",4),rep("T",2),rep("S",4))) seqlengths(gr) <- c( , , ) 33 gr GRanges with 10 ranges and 3 metadata columns: 37 seqnames ranges strand | score GC group 38 | 39 a chr1 [ 1, 7] - | S 40 b chr2 [ 2, 8] + | S 41 c chr2 [ 3, 9] + | S 42 d chr2 [ 4, 10] * | S 43 e chr1 [ 5, 11] * | T 44 f chr1 [ 6, 12] + | T 45 g chr3 [ 7, 13] + | S 46 h chr3 [ 8, 14] + | S 47 i chr3 [ 9, 15] - | S 48 j chr3 [10, 16] - | S seqlengths: 51 chr1 chr2 chr
IRanges & GRanges methods Access seqnames(), range(), strand(), mcols(), start(), end() Manipulate split(),unlist(), c() Lots of range set opperators reduce(),disjoin(),shift(),flank(), union(),intersect(),setdiff(),gaps(), restrict() Find findOverlaps(), nearest(), proceed(), follow() Calculate coverage() summerizeOverlaps()
1 # required packages 2 require(GenomicRanges) 3 require(Gviz) 4 5 # make a GRanges object 6 ducks <- GRanges(rep("chrW",6), 7 IRanges(start = c(50, 180, 260, 800, 600, 1240), 8 width = c(15, 20, 40, 100, 500, 20)), 9 strand = rep("*",6), 10 group = rep(c("Huey", "Dewey", "Louie"), c(1,3, 2))) # make and Annotation track of object 13 duckTrack <- AnnotationTrack(ducks, 14 genome="gg4", 15 name="Ducks") # make and Annotation track of reduced object 18 duckRed <- AnnotationTrack(reduce(ducks), 19 genome="gg4", 20 name="Ducks Reduce") # save it as a pdf 23 outFile <- "/homes/pschofield/scratch/NOBACK/ducks.pdf" 24 pdf(outFile,width=7,height=2) 25 plotTracks(list(duckTrack,duckRed),showId=T) 26 dev.off()
Visualisation Gviz “Gviz uses the biomaRt and the rtracklayer packages to perform live annotation queries to Ensembl and UCSC and translates this to e.g. gene/transcript structures in viewports of the grid graphics package. This results in genomic information plotted together with your data” Programmatically produced “IGB style” plots
Many track classes AlignedReadTrack AnnotationTrack BiomartGeneRegionTrack DataTrack GeneRegionTrack GenomeAxisTrack IdeogramTrack NumericTrack RangeTrack ReferenceTrack SequenceTrack StackedTrack UcscTrack
Fastq files Quality Assessment Alignment Feature Processing Peak calling (ChIP-seq) Read allocation (RNA-seq) SNP calling Annotational Processing Gene-set, functional analysis Differential Expression analysis SNP mapping
Quality Assessment ShortRead, htSeqTools Alignment Rsubread, gmapR, GSNAP, Rbowtie Biostrings Accessing Alignments ShortRead, Rsamtools Annotation BSgenome, annotate, biomaRt, topGO, GenomicFeatures… “I’ve been vaguely aware of biomaRt for a few years. Inexplicably, I’ve only recently started to use it. It’s one of the most useful applications I’ve ever used.” Neil Saunders, What You’re Doing is Rather Desparate
Differential Expression EdgeR, DESeq, limma, samr … ChIP-seq ChIPpeakAnno, QuasR, PICS, nucleR Motif discovery rGADEM, MotIV, seqlog SNP VariantAnnotation, snpStats Visualization GenomeGraphs, ggbio, Gviz, rtracklayer, biovizBase
“Easy” Parallelisation Although it has improved try to avoid for loops, R is designed for processing list lapply(), mapply() and relatives do.call() map() endoapply() mendoapply() *IRanges parallel package mclapply(), parlapply(), mcMap() The overheads of parallel processing will eventually limit speed up, (deminishing returns) Some R libraries are already multitreaded and compiled using OpenMP, be aware.
That’s all, thank you for listening. If you have tips and techniques for using R I would be very pleased to hear them. Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e doi: /journal.pcbi