J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk.

Slides:



Advertisements
Similar presentations
Introduction to perl programming: the minimum to know! Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China.
Advertisements

Supplementary Figure S1 (A) Change of reporter activity levels after actinomycin D treatment. HEK293T cells were transiently transfected with the reporter.
The genetic code.
Center for Biological Sequence Analysis The Technical University of Denmark DTU Chromatin and Gene Expression in E. coli Dave Ussery Biological Sequence.
Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student
Restriction Enzymes Lecture 15: 1 11/20/ Definition: enzymes that recognize specific double-stranded sequences and hydrolyze the phosphodiester.
Protein Synthesis (making proteins)
 -GLOBIN MUTATIONS AND SICKLE CELL DISORDER (SCD) - RESTRICTION FRAGMENT LENGTH POLYMORPHISMS (RFLP)
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
Gene Mutations Worksheet
Transcription & Translation Worksheet
Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal.
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 5 High-level Programming with Python Part II: Container Objects Reference:
In vitro expression of BVDV capsid protein Corpus Christi College, University of Oxford Glycobiology Institute, Department of Biochemistry KOR SHU CHAN.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
Figure S1. Sequence alignment of yeast and horse cyt-c (Identity~60%), green highly conserved residues. There are 40 amino acid differences in the primary.
Dictionaries.
GENE MUTATIONS aka point mutations. DNA sequence ↓ mRNA sequence ↓ Polypeptide Gene mutations which affect only one gene Transcription Translation © 2010.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Nature and Action of the Gene
Biological Dynamics Group Central Dogma: DNA->RNA->Protein.
1 Perl: subroutines (for sorting). 2 Good Programming Strategies for Subroutines #!/usr/bin/perl # example why globals are bad $one = ; $two = ; $max.
Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
Math 15 Introduction to Scientific Data Analysis Lecture 10 Python Programming – Part 4 University of California, Merced Today – We have A Quiz!
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
Linkage Mapping of the Angiotensin I Converting Enzyme Gene in Pig V.Q. Nguyen 1, K.L. Glenn 2, B.E. Mote 2, and M.F. Rothschild 2 1 Department of Biological.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Fig. S1 siControl E2 G1: 45.7% S: 26.9% G2-M: 27.4% siER  E2 G1: 70.9% S: 9.9% G2-M: 19.2% G1: 57.1% S: 12.0% G2-M: 30.9% siRNF31 E2 A B siRNF31 siControl.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
TRANSLATION: information transfer from RNA to protein the nucleotide sequence of the mRNA strand is translated into an amino acid sequence. This is accomplished.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
NSCI 314 LIFE IN THE COSMOS 4 - The Biochemistry of Life on Earth Dr. Karen Kolehmainen Department of Physics CSUSB
Prodigiosin Production in E. Coli Brian Hovey and Stephanie Vondrak.
Passing Genetic Notes in Class CC106 / Discussion D by John R. Finnerty.
Supplementary materials
Dictionaries. A “Good morning” dictionary English: Good morning Spanish: Buenas días Swedish: God morgon German: Guten morgen Venda: Ndi matscheloni Afrikaans:
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
RA(4kb)- Atggagtccgaaatgctgcaatcgcctcttctgggcctgggggaggaagatgaggc……………………………………………….. ……………………………………………. ……………………….,……. …tactacatctccgtgtactcggtggagaagcgtgtcagatag.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
1 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R Lecture 4
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
Ji-Yoon Park Nanoparticle-Based Theorem Proving.
The response of amino acid frequencies to directional mutation pressure in mitochondrial genomes is related to the physical properties of the amino acids.
Figure S1. Construction of pAL70
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Python.
Supplemental Table 3. Oligonucleotides for qPCR
GENE MUTATIONS aka point mutations © 2016 Paul Billiet ODWS.
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
Huntington Disease (HD)
DNA By: Mr. Kauffman.
DNA and RNA.
Gene architecture and sequence annotation
More on translation.
Molecular engineering of photoresponsive three-dimensional DNA
Fundamentals of Protein Structure
Python.
Station 2 Protein Synethsis.
6.096 Algorithms for Computational Biology Lecture 2 BLAST & Database Search Manolis Piotr Indyk.
Shailaja Gantla, Conny T. M. Bakker, Bishram Deocharan, Narsing R
Presentation transcript:

J OINING THE DA R K SIDE : HTS D ATA A NALYSIS WITH R-B IOCONDUCTOR Pieta Schofield Barton Group Talk

“aRrgh” a bit of chatter on the mail list “Ignorance is bliss” the less I have to know about R the more blissful I will be or “Knowledge is power” The more I know R the more powerful I find it is

WARNING! This talk contains snippets of R code that may be distressing those of a nervous disposition

R – Software environment for doing data analysis and statistics “It’s like marmite, you love it or hate it” … and like marmite I have to confess love it. Extensible Scriptable Terminal base Free Open source Multiplatform (multiprocessor)… ! Bioconductor ! …but then I also still love vim so what do I know?

Bioconductor Constantly growing set of R packages Focused on biological data analysis Common installation method Relatively easy package management Attempts at some common coding, testing and documentation standards Common (reusable/reused) data structures

Over the years I have made a living writing code in FORTRAN (IV, 77, 95) RPL-Filetab (RapidGen decision table language) APL (functional array processing language) Modula-2 (Pascal dialect) RPL-RS/1 (BBN Statistics Language) Basic / VisualBasic C / C++ /VisualC++ Objective-C ActionScript Matlab Java Perl Python R Learn a languages strengths and play to them

I am too old to keep swapping syntaxes! I use R exclusively (with a bit of bash) Microarrays affy: for affymetrix gene-chips limma: DE tool for micro-arrays “Great RNA-seq Experiment” DE tools mainly R-bioconductor packages edgeR, DESeq, BBSeq, SAMR, limma, BitSeq… …too many to mention HTS data analysis ChIP-seq, Mnase-seq, SNP calling

What is available for HTS data analysis? Lots and growing every day Few core packages for efficient storage and processing of sequence base data and annotations These are continually being refined and improved Sometimes at the cost of backward compatibility Best to keep R and bioconductor packeges up to date Integrated Tools being build on top of these core packages for specific analysis and visualisation

Which packages are worth the investment in learning? Are any packages worth the investment of learning/switching too R?

Biostrings Set of classes for representing large biological sequences (DNA/RNA/amino acids) Base class type XString (Bstring) => DNAString RNAString AAString Collection class XStringSet… Pairwise and multiple sequence alignments Set of methods for manipulation and computation on these classes Set of method for sequence matching and pairwise alignments

1 # Code to demonstrate the use of Biostrings 2 3 require(Biostrings) 4 5 dna <- DNAString("TCAACGTTGAATAGCGTACCG") 6 #> dna 7 # 21-letter "DNAString" instance 8 #seq: TCAACGTTGAATAGCGTACCG 9 10 aa <- AAString(translate(dna)) 11 #> aa 12 # 7-letter "AAString" instance 13 #seq: STLNSVP orfs <- AAStringSet(lapply(seq(1:3), 16 function(x){ 17 adj <- c(0,2,1) 18 AAString(translate(dna[x:(length(dna)-adj[x])])) 19 } 20 )) 21 #> orfs 22 # A AAStringSet instance of length 3 23 # width seq 24 #[1] 7 STLNSVP 25 #[2] 6 QR*IAY 26 #[3] 6 NVE*RT

1 require(Biostrings) 2 3 dna <- DNAString("TCAACGTTGAATAGCGTACCGAACGTTGAATATCGTTGAATAG") 4 p1 <- DNAString("AACGTT") 5 6 dinucFreq <- dinucleotideFrequency(dna) 7 #> dinucFreq 8 #AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT 9 # trinucFreq<-trinucleotideFrequency(dna) 12 #> trinucFreq 13 #AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT 14 # #CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT 16 # #GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT 18 # #TTA TTC TTG TTT 20 # cpm <- countPattern(p1, dna) 23 #> cpm 24 #[1] 2 25

1 # Read in a FASTA file and a GTF features file 2 # Find chromosomes/contigs with features 3 # Generate a FASTA file with only those chromosomes/contigs 4 # containing features 5 6 require(Biostrings) 7 8 # read in Fasta file 9 gg4 <- readDNAStringSet("~/data/ensembl/gg4/Gg4_73.fa") # read in GTF 12 gtf <- read.delim("~/data/ensembl/gg4/Galgal4.gtf",sep="\t",h=F) # get chromosome that exist in GTF file 15 chrs <- levels(gtf$V1) # subset the strings 18 gg4.trim <- gg4[which(sapply(names(gg4), 19 function(x){ 20 unlist(strsplit(x," "))[1] 21 } 22 ) %in% chrs),] # write out new fasta file 25 writeXStringSet(gg4.trim,format="fasta", 26"~/data/ensembl/hg19_73/gg4_73_trimmed.fa",width=256)

26 mpm <- matchPattern(p1, dna) 27 #> mpm 28 # Views on a 43-letter DNAString subject 29 #subject: TCAACGTTGAATAGCGTACCGAACGTTGAATATCGTTGAATAG 30 #views: 31 # start end width 32 #[1] [AACGTT] 33 #[2] [AACGTT] mpm1 <- matchPattern(p1, dna,max.mismatch = 1) 36 #> mpm1 37 # Views on a 43-letter DNAString subject 38 #subject: TCAACGTTGAATAGCGTACCGAACGTTGAATATCGTTGAATAG 39 #views: 40 # start end width 41 #[1] [AACGTT] 42 #[2] [AACGTT] 43 #[3] [ATCGTT] dna <- DNAString("TCAACGTTGAAT") 30 print(date()) 31 #[1] "Sun Nov 10 14:45: " 32 v1<-vmatchPattern(dna,gg4.trim) 33 print(date()) 34 #[1] "Sun Nov 10 14:45: “

40 > v1 41 MIndex object of length $`1 dna:chromosome chromosome:Galgal4:1:1: :1 REF` 43 IRanges of length 4 44 start end width 45 [1] [2] [3] [4] $`10 dna:chromosome chromosome:Galgal4:10:1: :1 REF` 51 IRanges of length 1 52 start end width 53 [1] $`11 dna:chromosome chromosome:Galgal4:11:1: :1 REF` 56 IRanges of length >v1[[1]] 62 IRanges of length 4 63 start end width 64 [1] [2] [3] [4]

Range Data Slight Annoyance Data structure proliferation similar but different Sometimes there are easy conversion methods Sometimes there are not! GenomicRanges GRanges & GRangesList Iranges use RLE – run length encoding Run length encoding Raw: AAAACAAAAATTGTGGGG RLE: A4C1A5T2G1T1G4

GenomicRanges GRanges class It is a named list of ranges. Coordinate data : seqname, range, start, end, strand Metadata : user specified data fields Aligned read classes Gapped Aligned Reads GAlignement Gapped Aligned Read Pairs GAlignmentPair Importing reads from BAM files Frontend to Rsamtools Tools for iterative access to large files SummarizedExperiment Managing matrix of ranges and samples

25 gr <- GRanges(seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)), 26 ranges = IRanges(1:10, end = 7:16, names = head(letters, 10)), 27 strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)), 28 score = rnorm(10,20,2), 29 GC = seq(1, 0, length=10), 30 group = c(rep("S",4),rep("T",2),rep("S",4))) seqlengths(gr) <- c( , , ) 33 gr GRanges with 10 ranges and 3 metadata columns: 37 seqnames ranges strand | score GC group 38 | 39 a chr1 [ 1, 7] - | S 40 b chr2 [ 2, 8] + | S 41 c chr2 [ 3, 9] + | S 42 d chr2 [ 4, 10] * | S 43 e chr1 [ 5, 11] * | T 44 f chr1 [ 6, 12] + | T 45 g chr3 [ 7, 13] + | S 46 h chr3 [ 8, 14] + | S 47 i chr3 [ 9, 15] - | S 48 j chr3 [10, 16] - | S seqlengths: 51 chr1 chr2 chr

IRanges & GRanges methods Access seqnames(), range(), strand(), mcols(), start(), end() Manipulate split(),unlist(), c() Lots of range set opperators reduce(),disjoin(),shift(),flank(), union(),intersect(),setdiff(),gaps(), restrict() Find findOverlaps(), nearest(), proceed(), follow() Calculate coverage() summerizeOverlaps()

1 # required packages 2 require(GenomicRanges) 3 require(Gviz) 4 5 # make a GRanges object 6 ducks <- GRanges(rep("chrW",6), 7 IRanges(start = c(50, 180, 260, 800, 600, 1240), 8 width = c(15, 20, 40, 100, 500, 20)), 9 strand = rep("*",6), 10 group = rep(c("Huey", "Dewey", "Louie"), c(1,3, 2))) # make and Annotation track of object 13 duckTrack <- AnnotationTrack(ducks, 14 genome="gg4", 15 name="Ducks") # make and Annotation track of reduced object 18 duckRed <- AnnotationTrack(reduce(ducks), 19 genome="gg4", 20 name="Ducks Reduce") # save it as a pdf 23 outFile <- "/homes/pschofield/scratch/NOBACK/ducks.pdf" 24 pdf(outFile,width=7,height=2) 25 plotTracks(list(duckTrack,duckRed),showId=T) 26 dev.off()

Visualisation Gviz “Gviz uses the biomaRt and the rtracklayer packages to perform live annotation queries to Ensembl and UCSC and translates this to e.g. gene/transcript structures in viewports of the grid graphics package. This results in genomic information plotted together with your data” Programmatically produced “IGB style” plots

Many track classes AlignedReadTrack AnnotationTrack BiomartGeneRegionTrack DataTrack GeneRegionTrack GenomeAxisTrack IdeogramTrack NumericTrack RangeTrack ReferenceTrack SequenceTrack StackedTrack UcscTrack

Fastq files Quality Assessment Alignment Feature Processing Peak calling (ChIP-seq) Read allocation (RNA-seq) SNP calling Annotational Processing Gene-set, functional analysis Differential Expression analysis SNP mapping

Quality Assessment ShortRead, htSeqTools Alignment Rsubread, gmapR, GSNAP, Rbowtie Biostrings Accessing Alignments ShortRead, Rsamtools Annotation BSgenome, annotate, biomaRt, topGO, GenomicFeatures… “I’ve been vaguely aware of biomaRt for a few years. Inexplicably, I’ve only recently started to use it. It’s one of the most useful applications I’ve ever used.” Neil Saunders, What You’re Doing is Rather Desparate

Differential Expression EdgeR, DESeq, limma, samr … ChIP-seq ChIPpeakAnno, QuasR, PICS, nucleR Motif discovery rGADEM, MotIV, seqlog SNP VariantAnnotation, snpStats Visualization GenomeGraphs, ggbio, Gviz, rtracklayer, biovizBase

“Easy” Parallelisation Although it has improved try to avoid for loops, R is designed for processing list lapply(), mapply() and relatives do.call() map() endoapply() mendoapply() *IRanges parallel package mclapply(), parlapply(), mcMap() The overheads of parallel processing will eventually limit speed up, (deminishing returns) Some R libraries are already multitreaded and compiled using OpenMP, be aware.

That’s all, thank you for listening. If you have tips and techniques for using R I would be very pleased to hear them. Lawrence M, Huber W, Pagès H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8): e doi: /journal.pcbi