Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Brief Introduction to R Programming Darren J. Fitzpatrick, PhD The Bioinformatics Support Team 27/08/2015.

Similar presentations


Presentation on theme: "A Brief Introduction to R Programming Darren J. Fitzpatrick, PhD The Bioinformatics Support Team 27/08/2015."— Presentation transcript:

1 A Brief Introduction to R Programming Darren J. Fitzpatrick, PhD The Bioinformatics Support Team fitzpadj@tcd.ie 27/08/2015

2 Overview What is R? Why might it be useful? An Overview of Rstudio A First Program Basic Syntax of R Indexing Rows and Columns Exploratory Data Analysis using R/RStudio

3 Trinity College Dublin, The University of Dublin What is R and why bother?  R is fundamentally a programming language suitable for data analysis  R has ~4000 packages enabling advanced data analytics, exploration and visualisation  Bioconductor a suite of specialised tools for biological data analysis integrates with R  R has a learning curve but once the basics are mastered, it offers flexibility to deal with any imaginable analytics problem.

4 Trinity College Dublin, The University of Dublin What can be done?

5 Trinity College Dublin, The University of Dublin An Overview of RStudio Inbuilt text editor for writing and saving R code Console/Interpreter for running R Code Plots, Packages and HELP!

6 Trinity College Dublin, The University of Dublin A First Program Write code, select and press “run” R executes code

7 Trinity College Dublin, The University of Dublin Basic Syntax of R > print('hello world') > [1] "hello world" print() is an inbuilt R function Functions are always of the form function() Arguments are passed to a function using the brackets ‘hello world’ is an argument

8 Trinity College Dublin, The University of Dublin Basic Syntax of R R has many useful inbuilt functions some of which we will use today. Examples include the following: sum() add numbers together mean() calculate the mean of a set of numbers sd() calculate the standard deviation of a set of numbers t.test() perform a Student’s t-test wilcoxon.test() perform a Wilcoxon/Mann-Whitney test fisher.test() perform a Fisher’s exact test chisq.test() perform a Chi-squared test plot() basic plotting function hist() plot histogram

9 Trinity College Dublin, The University of Dublin The Iris Data Set > attach(iris)# Fetch data > x <- as.matrix(iris[,-5]) # Make an ugly heatmap > heatmap(x, cexCol=0.7) Let’s look at the data! We will explore the famous Fisher’s Iris Data Set which is available with R. The data is in the form of a data structure called a data frame. A data frame is a tabular representation of data using rows and columns.

10 Trinity College Dublin, The University of Dublin The Iris Data Set > nrow(iris) # No. of rows [1] 150 > ncol(iris) # No of columns [1] 5 > dim(iris) # The dimensions [1] 150 5

11 Trinity College Dublin, The University of Dublin The Iris Data Set A nicer heatmap! We will learn to make these plots in an extended R workshop.

12 Trinity College Dublin, The University of Dublin Indexing Rows and Columns Data frames have a matrix structure comprising rows and columns. To access rows and columns we use indexing. Indexing is of the form: dataset[from row:to row, from col:to col] Some examples: > iris[1,] # The first row of the data Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa

13 Trinity College Dublin, The University of Dublin Indexing Rows and Columns > iris[1:5,] # The first 5 rows of the data Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa > iris[1:5, 1] # The first 5 rows of the first column [1] 5.1 4.9 4.7 4.6 5.0

14 Trinity College Dublin, The University of Dublin Indexing Rows and Columns Find the species > names(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" > unique(iris[,5])# Using column number [1] setosa versicolor virginica > unique(iris$Species)# Using $ and column name Extract Sepal.Length values for the setosa species >setosa_sepal_length <- iris[which(iris$Species=='setosa'), 1] How would you extract the sepal length data for the virginica species?

15 Trinity College Dublin, The University of Dublin Exploratory Data Analysis > virginica_sepal_length <- iris[which(iris$Species=='virginica'), 1] Defined two variables containing sepal length data for two species. setosa_sepal_length virginica_sepal_length How do we begin to explore this data? Calculate the means of both data sets Calculate the standard deviation for both data sets Plot histograms of both data sets Perform statistics to ask if sepal length differs between species

16 Trinity College Dublin, The University of Dublin Exploratory Data Analysis > mean(setosa_sepal_length) > mean(virginica_sepal_length) > sd(setosa_sepal_length) > sd(virginica_sepal_length) Do means and standard deviations differ? Would you expect the distributions of the data to differ?

17 Trinity College Dublin, The University of Dublin Exploratory Data Analysis R can render nice descriptive plots such as boxplots, various flavours of scatterplots and histograms. These require additional knowledge - today we will keep it simple. Code for the plots here can be found in the ‘Additional_Plots.R’ file on http://bioinf.gen.tcd.ie/workshops/R

18 Trinity College Dublin, The University of Dublin Exploratory Data Analysis Look up the hist() function using the help manual. R help always gives the following: The arguments that a function can take A description (not always clear!) of what those arguments are. Try the following: > hist(setosa_sepal_length) > hist(setosa_sepal_length, breaks=10, main='Sepal Length (Setosa)', col='darkred', xlab='Sepal Length')

19 Trinity College Dublin, The University of Dublin Exploratory Data Analysis You should see something like this!

20 Trinity College Dublin, The University of Dublin Exploratory Data Analysis Use the hist() function to plot the sepal lengths for the virginica species. Change the title of the graph Change the colour (darkgreen, darkslategrey, purple) Tell R to plot two histograms side by side > par(mfrow=c(1,2)) Now, run your histogram code for both data sets.

21 Trinity College Dublin, The University of Dublin Exploratory Data Analysis You should see something like this!

22 Trinity College Dublin, The University of Dublin Hypothesis Testing We want to test if the distributions of sepal lengths in Setosa and Virginica are different to each other. H 0 : mean setosa = mean virginica H 1 : mean setosa ≠ mean virginica Use the help utility to work out how to do a two-sample unpaired t-test. Is there a significant difference in sepal lengths between the two species?

23 Trinity College Dublin, The University of Dublin Hypothesis Testing > t.test(setosa_sepal_length, virginica_sepal_length) Welch Two Sample t-test data: setosa_sepal_length and virginica_sepal_length t = -15.386, df = 76.516, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.78676 -1.37724 sample estimates: mean of x mean of y 5.006 6.588

24 Trinity College Dublin, The University of Dublin Resources The R official website for downloading software and help https://cran.r-project.org A free online book – “Statistics in R Using Biological Examples” https://cran.r-project.org/doc/contrib/Seefeld_StatsRBio.pdf Quick-R – a site with nice examples of how to do various analyses in R http://www.statmethods.net Bioconductor – a suite of R packages for biological data analysis http://www.bioconductor.org

25 Trinity College Dublin, The University of Dublin Conclusions You have been briefly introduced to the Rstudio environment and coding in R You are familiar with the basics of variables, data frames, indexing, plotting and hypothesis testing. A more comprehensive R course planned for the near future will include such topics: Coding in R – writing functions, loops and scripts Further exploratory data analysis Further hypothesis testing (Fishers, Chi, Mann-Whitney) Statistical modelling (linear regression, anova) Biological data analysis – GWAS, differential expression, your interests!

26 Thank You


Download ppt "A Brief Introduction to R Programming Darren J. Fitzpatrick, PhD The Bioinformatics Support Team 27/08/2015."

Similar presentations


Ads by Google