1 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R Lecture 4

Slides:



Advertisements
Similar presentations
Introduction to perl programming: the minimum to know! Bioinformatic and Comparative Genome Analysis Course HKU-Pasteur Research Centre - Hong Kong, China.
Advertisements

Supplementary Figure S1 (A) Change of reporter activity levels after actinomycin D treatment. HEK293T cells were transiently transfected with the reporter.
Uses of Cloned Genes sequencing reagents (eg, probes) protein production insufficient natural quantities modify/mutagenesis library screening Expression.
The genetic code.
Center for Biological Sequence Analysis The Technical University of Denmark DTU Chromatin and Gene Expression in E. coli Dave Ussery Biological Sequence.
Center for Biological Sequence Analysis Prokaryotic gene finding Marie Skovgaard Ph.D. student
Restriction Enzymes Lecture 15: 1 11/20/ Definition: enzymes that recognize specific double-stranded sequences and hydrolyze the phosphodiester.
 -GLOBIN MUTATIONS AND SICKLE CELL DISORDER (SCD) - RESTRICTION FRAGMENT LENGTH POLYMORPHISMS (RFLP)
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
Gene Mutations Worksheet
Transcription & Translation Worksheet
Crick’s early Hypothesis Revisited. Or The Existence of a Universal Coding Frame Axel Bernal UPenn Center for Bioinformatics Jean-Louis Lassez Coastal.
1 Essential Computing for Bioinformatics Bienvenido Vélez UPR Mayaguez Lecture 5 High-level Programming with Python Part II: Container Objects Reference:
In vitro expression of BVDV capsid protein Corpus Christi College, University of Oxford Glycobiology Institute, Department of Biochemistry KOR SHU CHAN.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
Figure S1. Sequence alignment of yeast and horse cyt-c (Identity~60%), green highly conserved residues. There are 40 amino acid differences in the primary.
Dictionaries.
GENE MUTATIONS aka point mutations. DNA sequence ↓ mRNA sequence ↓ Polypeptide Gene mutations which affect only one gene Transcription Translation © 2010.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Nature and Action of the Gene
Biological Dynamics Group Central Dogma: DNA->RNA->Protein.
1 Perl: subroutines (for sorting). 2 Good Programming Strategies for Subroutines #!/usr/bin/perl # example why globals are bad $one = ; $two = ; $max.
Introduction to Python for Biologists Lecture 2 This Lecture Stuart Brown Associate Professor NYU School of Medicine.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
Math 15 Introduction to Scientific Data Analysis Lecture 10 Python Programming – Part 4 University of California, Merced Today – We have A Quiz!
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
Constitutive Low+Med Regulated Low+Med ∙ ∙ ∙ Constitutive High+V.High Regulated High+V.High max 20bp window.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Fig. S1 siControl E2 G1: 45.7% S: 26.9% G2-M: 27.4% siER  E2 G1: 70.9% S: 9.9% G2-M: 19.2% G1: 57.1% S: 12.0% G2-M: 30.9% siRNF31 E2 A B siRNF31 siControl.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
TRANSLATION: information transfer from RNA to protein the nucleotide sequence of the mRNA strand is translated into an amino acid sequence. This is accomplished.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
NSCI 314 LIFE IN THE COSMOS 4 - The Biochemistry of Life on Earth Dr. Karen Kolehmainen Department of Physics CSUSB
Prodigiosin Production in E. Coli Brian Hovey and Stephanie Vondrak.
Passing Genetic Notes in Class CC106 / Discussion D by John R. Finnerty.
Supplementary materials
Figure S1 (Kim et al) Introduction into the Saccharomyces cerevisiae reporter strain AH109 (Trp-/Leu-/His-/Ade-) pGBKT7-hsRad21 human fetal kidney cDNA.
Dictionaries. A “Good morning” dictionary English: Good morning Spanish: Buenas días Swedish: God morgon German: Guten morgen Venda: Ndi matscheloni Afrikaans:
Suppl. Figure 1 APP23 + X Terc +/- Terc +/-, APP23 + X Terc +/- G1Terc -/-, APP23 + X G1Terc -/- G2Terc -/-, APP23 + X G2Terc -/- G3Terc -/-, APP23 + and.
RA(4kb)- Atggagtccgaaatgctgcaatcgcctcttctgggcctgggggaggaagatgaggc……………………………………………….. ……………………………………………. ……………………….,……. …tactacatctccgtgtactcggtggagaagcgtgtcagatag.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
Ji-Yoon Park Nanoparticle-Based Theorem Proving.
The response of amino acid frequencies to directional mutation pressure in mitochondrial genomes is related to the physical properties of the amino acids.
Nanoparticle-based Theorem Proving
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Python.
Supplemental Table 3. Oligonucleotides for qPCR
GENE MUTATIONS aka point mutations © 2016 Paul Billiet ODWS.
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
Huntington Disease (HD)
DNA By: Mr. Kauffman.
DNA and RNA.
Gene architecture and sequence annotation
More on translation.
Molecular engineering of photoresponsive three-dimensional DNA
Fundamentals of Protein Structure
Neandertal DNA Sequences and the Origin of Modern Humans
Python.
Station 2 Protein Synethsis.
6.096 Algorithms for Computational Biology Lecture 2 BLAST & Database Search Manolis Piotr Indyk.
Shailaja Gantla, Conny T. M. Bakker, Bishram Deocharan, Narsing R
Presentation transcript:

1 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R Lecture 4 Michael Shmoish Bioinformatics Knowledge Unit The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion - IIT

2 R as a set of statistical tables Distribution R name Additional arguments Normal norm mean, sd Uniform unif min, max Hypergeometric hyper m, n, k Poisson pois lambda Student’s t t df, ncp....

3 4 functions: d-, p-, q-, r-  For each probability distribution presented in R there are at least 4 basic functions whose names differ only in first character:  dnorm, dunif, dhyper, … - density  pnorm, punif, phyper, … - cumulative distribution function ( CDF) p- here stands for p-value  qnorm, qunif, qhyper, … - quantile ( inverse of CDF )  rnorm, runif, rhyper, … - random (simulate random deviates)

4 Hypergeometric: enrichment

5 Example 1. Given 100 genomic sequences, out of which 50 are of viral origin and 50 are non- viral. By searching (‘grep’ !) for a certain motif of interest found in 25 out of 100 sequences a researcher discovered that 18 viral genes have this motif. Is there any evidence for this motif over-representation in viral genes? >dhyper(18, 50, 50, 25) ### what’s a probability to get 18 white balls by choosing 25 balls blindly from the basket where 50 white and 50 black balls (without putting them back as in binomial) [1] ### is that all? Not enough. Have to check “what’s the probability to get ‘18 or more’ “

6 Hypergeometric: dhyper and phyper To check “what’s the probability to get ‘18 or more’ “ > dhyper(18:25, 50, 50, 25) ### use ‘round(…, 4)’ to get nice numbers [1] e e e e e e e e-10 > sum(dhyper(18:25, 50, 50, 25)) [1] What we’ve just computed is CDF: >phyper(17, 50,50,25, lower.tail = FALSE) ### p-value for getting more than 17 (i.e., ‘18 or more’) [1] >barplot(dhyper(0:25, 50,50,25), ylim = c(0,0.2), main = "Prob. of Overlap")

7 Hypergeometric: dhyper

8 Exercise: dhyper and phyper 1) Under condition of example 1: a) What’s the probability to get exactly 13 viral genes? b) What’s the probability to get exactly 13 non-viral genes? c) What’s the probability to get ‘5 or less’ viral genes? Compute both using phyper and by sum(dhyper(…)), and then compare. 2) Generate 100 uniform random values in the range [0, 1], keep them in vector runi and draw them as time series.

9 Statistical tests

10 Student’s t-test

11 Student’s t-test > x <- rnorm(50) > y <- runif(30) > t.test(x,y) ### by default: unpaired, unequal variance (Welch), two-sided Welch Two Sample t-test data: x and y t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: …. > names(t.test(x,y)) [1] "statistic" "parameter" "p.value" "conf.int" "estimate" [6] "null.value" "alternative" "method" "data.name" > t.test(x,y)$p.val [1]

12 Kolmogorov-Smirnov test

13 Kolmogorov-Smirnov test > x <- rnorm(50) > y <- runif(30) > kst = ks.test(x, y) # Do x and y come from the same distribution? > kst Two-sample Kolmogorov-Smirnov test data: x and y D = 0.56, p-value = 6.303e-06 alternative hypothesis: two-sided > names(kst) [1] "statistic" "p.value" "alternative" "method" "data.name" >kst$p.val [1] e-06

14 Overview of R-package installation  Open R-console  Open ‘Packages’ drop-down list under RGui  Choose ‘Set CRAN mirror’ (then choose a mirror and click OK) >chooseCRANmirror() ### automatically appears  Choose repositories (CRAN – default, usually one adds ‘BioC software’ etc., click OK (clicking ‘Cancel’ prompts the dialog within the console): >setRepositories() ### automatically appears  Install package from repositories (could take time!)  Update package  Install package from a local zip file

15 Package installation (mirror)

16 Package installation (mirror)

17 Package installation (repository)

18 Package installation (repository)

19 Package installation (install)

20 Package installation (seqinR)

21 Package loading (seqinR) >library(seqinr) ### returns error OR >require(seqinr) ### returns FALSE ; designed for use inside functions OR

22 Package loading

23 Some R-packages for Bioinformatics  ‘limma’, ‘affy’, ‘marray’ (Bioconductor project), ‘lumi’, ‘beadarray’ – for microarray  ‘ape’ - phylogenetics  ‘seqinr’ - manipulation of biosequences  ‘BioNet’ - for networks integration  ‘mseq’, ‘DEGseq’ – next-generation sequencing

24 Getting help with R-packages  “Task Views” at : ClinicalTrials, Genetics, Cluster, Pharmacokinetics, etc.  ‘sos’ package : function findFn of ‘sos’ package produces HTML page per keyword (e.g. “protein”): > findFn(“protein”, maxPages = 2)

25 seqinR PACKAGE

26 seqinR : read Fasta files

27 seqinR : read fasta files (cont) ‘read.fasta’ > ff <- system.file("sequences/someORF.fsa", package = "seqinr") > fs <- read.fasta(file = ff) > names(fs) [1] "YAL001C" "YAL002W" "YAL003W" "YAL005C" "YAL007C" "YAL008W" "YAL009W“ > count(fs[[1]],2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt >seqAA <- read.fasta(file = system.file("sequences/seqAA.fasta", package = "seqinr"), seqtype = "AA")

28 seqinR: ‘c2s’ and ‘s2c’ functions  Given a sequence (character string), how to get a vector of individual characters? Generic R-solution is non-intuitive: unlist(strsplit(…,””))  In seqinR package this is very simple: > s2c("acgggtacggtcccatcgaa") [1] "a" "c" "g" "g" "g" "t" "a" "c" "g" "g" "t" "c" "c" "c" "a" "t" "c" "g" "a" "a“ > a <- s2c("acgggtacggtcccatcgaa") > a [1] "a" "c" "g" "g" "g" "t" "a" "c" "g" "g" "t" "c" "c" "c" "a" "t" "c" "g" "a" "a"  The inverse operation is done by function ‘s2c’ > c2s(a) [1] "acgggtacggtcccatcgaa"

29 seqinR: ‘comp’ function > rev(a) #a function from package base [1] "a" "a" "g" "c" "t" "a" "c" "c" "c" "t" "g" "g" "c" "a" "t" "g" "g" "g" "c" "a" > c2s(rev(a)) [1] "aagctaccctggcatgggca" > ar = c2s(rev(a))  How to get a reverse complement? > comp(ar) Error in s2n(seq) : sequence is not a vector of chars > comp(rev(a)) [1] "t" "t" "c" "g" "a" "t" "g" "g" "g" "a" "c" "c" "g" "t" "a" "c" "c" "c" "g" "t" > print( arc <- c2s(comp(rev(a))) ) ### both assignment and printing [1] "ttcgatgggaccgtacccgt"

30 seqinR: ‘count’ function >a <- s2c("acgggtacggtcccatcgaa") #To count dinucleotide occurrences in sequence a: > count(a, 2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt # To count trinucleotide occurrences in sequence a, in frame 2 (frame counting starts from 0): > count(a, 3, 2) aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat cca ccc ccg cct cga cgc cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct gga ggc ggg ggt gta gtc gtg gtt taa tac tag tat tca tcc tcg tct tga tgc tgg tgt tta ttc ttg ttt

31 seqinR: ‘count’ function  To count dinucleotide frequencies in sequence a: > count(a, 2, freq = TRUE) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt > round(count(a, 2, freq = TRUE), 3) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt >?permutation

32 Seqinr: ‘AAstat’ function > seqAA <- read.fasta(file = system.file("sequences/seqAA.fasta", package = "seqinr"), seqtype = "AA") A function AAstat of package ‘seqinr’ returns a list with a simple protein sequence information including the number of residues, the percentage physico-chemical classes and the theoretical isoelectric point ; > AAstat(seqAA[[1]]) $Compo A C D E F G H I K L M N P Q R S T V W Y … $Prop$Aliphatic [1] $Prop$Aromatic [1] $Pi [1]

33 seqinr ‘AAstat’ function (cont.)

34 Exercise 1) Read your favorite DNA sequence a) Find dinucleotide composition in natural frame b) Find frequencies of dinucleotides in the 1 st frame c) The same for trinucleotides 2) Read your favorite AA sequence and learn its composition with AAstat function