Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistical Programming Using the R Language Lecture 1 Basic Concepts I Darren J. Fitzpatrick, Ph.D April 2016.

Similar presentations


Presentation on theme: "Statistical Programming Using the R Language Lecture 1 Basic Concepts I Darren J. Fitzpatrick, Ph.D April 2016."— Presentation transcript:

1 Statistical Programming Using the R Language Lecture 1 Basic Concepts I Darren J. Fitzpatrick, Ph.D April 2016

2 Trinity College Dublin, The University of Dublin Preliminaries Course will run for 2 hours per day for 5 days (10:00 – 12:00). Broadly divided into: 1 hour lecture 1 hour practical/problem sheet Course website: http://bioinf.gen.tcd.ie/workshops/R Prior to each lecture notes, problem sheets and data will be available at the above address. Solutions will be posted after each lecture. I can be contacted at the following: E-mail: fitzptrd@tcd.iefitzptrd@tcd.ie

3 Course Overview I Lecture 1 – Basic Concepts I MAC, RStudio, syntax, functions, files and data structures Lecture 2 – Basic Concepts II More syntax, plotting, exploratory data analysis Lecture 3 – Hypothesis Testing Concepts, normality, F-test, t-test, wilcoxon test, correlation Lecture 4 – Experimental Design & ANOVA Power calculations, ANOVA Lecture 5 – Introducing Multivariate Analysis Clustering, hierarchical, k-means, heatmaps

4 Course Overview II The course is NOT intended to provide comprehensive coverage of either R or statistics. It is intended to provide: Adequate fluency in the R language such that users can easily learn additions relevant to their needs. Familiarity in applying statistics to datasets and interpreting the results. Take the mystery out of using code. Mistakes are encouraged – unlike cells, R will neither starve nor die!

5 Trinity College Dublin, The University of Dublin Lecture 1 - Overview 1.R – what, where, why? 2.Getting to grips with: – MAC – Rstudio 3.Beginning R programming: Variables Data Types Operators Functions Getting help Dealing with files Data Structures

6 R – what, why, where? R is fundamentally a programming language suitable for data analysis R has ~4000 packages enabling advanced data analytics, exploration and visualisation Bioconductor a suite of specialised tools for biological data analysis integrates with R R has a learning curve but once the basics are mastered, it offers flexibility to deal with any imaginable analytics problem.

7 R – what, why, where?

8 Using a MAC (briefly!) The main differences between a MAC and a PC cmd instead of control (e.g. cmd-C for copying) right click mouse: ctrl-click # character: alt-3 switch between applications: cmd-tab Spotlight (magnifying glass top right): finds files/programs Apple symbol (top left): for logging out, preferences, etc.

9 Resources The R Website: https://www.r-project.orghttps://www.r-project.org Statistics in R Using Biological Examples: https://cran.r-project.org/doc/contrib/Seefeld_StatsRBio.pdf Statistics: an introduction using R http://www.amazon.co.uk/Statistics-An-Introduction-Using-R/dp/1118941098/ Bioconductor: https://www.bioconductor.org

10 Finding RStudio using Spotlight

11 Overview of RStudio I Inbuilt text editor for writing and saving R code Console/Interpreter for running R Code Plots, Packages and HELP!

12 Overview of RStudio II Write code, press “run” For multiple lines of code, select and "run" (Ctrl-R) R executes code

13 Trinity College Dublin, The University of Dublin Basic Syntax of R > print('hello world') > [1] "hello world" print() is an inbuilt R function Functions are always of the form function() Arguments are passed to a function using the brackets 'hello world' is an argument

14 Trinity College Dublin, The University of Dublin Basic Syntax of R R has many useful inbuilt functions some of which we will use today. Examples include the following: sum() add numbers together mean() calculate the mean of a set of numbers sd() calculate the standard deviation of a set of numbers t.test() perform a Student’s t-test wilcoxon.test() perform a Wilcoxon/Mann-Whitney test fisher.test() perform a Fisher’s exact test chisq.test() perform a Chi-squared test plot() basic plotting function hist() plot histogram

15 Getting Help Type the function name. R will auto suggest. The help pages may or may not be helpful. Sometimes you have to play around with functions to figure out how to use them.

16 Variables & Data Types A variable is a name given to 'something' held in computer memory. some_name <- 5e-2 <- this is the assignment operator, i.e., associates some value with the variable name (shortcut: alt and - ) To retrieve/reuse data in computer memory, it must be assigned as a variable Create the following variables in RStudio: a <- 2 b <- 10 c <- "my_name" To recall a variable, just type the variable name!

17 Variables & Data Types A variable can hold different data types. Variableclass()R's response a <- 2class(a)[1] "numeric" b <- 10class(b)[1] "numeric" c <- "my_name"class(c)[1] "character" There are other data types in R, but we will ignore these for now.

18 Operators Operators in computer science are inbuilt functions that perform an 'operation' of some kind. They can be arithmetic: +, -, *, /, ^ They can be comparative: Equal: == Not equal: != Greater than (or equal): > (>=) Less than (or equal): < (<=) They can be logical: AND: & OR: | NOT: !

19 Operators Exercise Using the two variables a and b, try the following: a + ba * ba^2 a - ba / ba^2 + b^2 Arithmetic a == ba < bb <= a a != ba > bb >= a Comparative a < b & b == 0a < b | b == 0 a < b & b != 0a < b | b != 0 Logical

20 Reading Files I First we need some data! Data for each lecture will be available on the course website http://bioinf.gen.tcd.ie/workshops/R Using the screen shot as a guide, create a folder in your Documents and call it R_Course. From the course webpage, download the file entitled, gene_expression_disease_sex.txt Drag file from the Downloads folder to the R_Course folder.

21 Reading Files II In order to read a file, a computer must know the file's location. A file location is usually specified as a path: /home/User/Desktop/R_Course/file.txt R/RStudio can be directed to point to a particular location on your machine. This is called the working directory (wd). To set the wd, follow the above and navigate to Desktop then the R_Course folder.

22 Reading Files III Once that's done, reading files is simple. df <- read.table('gene_expression_disease_sex.txt', header = T) You have read in a file using the read.table() function and assigned it the variable name df. If you type df in the console, the file contents should flash before your eyes. We will come back to this data later.

23 Data Structures A data structure is a way of organising data in computer memory such that it can be used for some purpose. There are many different kinds of data structures in computer languages – graphs (networks), lists, tables, etc. The most relevant in R are: The vector The matrix The dataframe The list (- not covered in this course)

24 The Vector A vector is a sequence of numbers, strings or both (1 dimensional). Vectors have a length ( length() ) Elements can be accessed by indexing ( vec1[1] ) When a vector has a character element, all elements become characters vec1 <- c(10, 35, 67, 3) > length(vec1) # vector length [1] 4 > vec1[1] # indexing a vector [1] 10 > vec1[4] [1] 3 > class(vec1[4]) [1] "numeric" > vec2 <- c(10, 35, 67, 3, 'string') > vec2 [1] "10" "35" "67" "3" "string" > class(vec2) [1] "character"

25 The Matrix Matrices are multi-dimensional collections of data (some times called arrays). mat <- matrix(rnorm(4), 2, 2) mat [,1] [,2] [1,] 0.02908084 1.1467495 [2,] 0.60354861 0.5619637 mat[1, 1] # indexing mat[rows, cols] [1] 0.02908084

26 The Dataframe I The dataframe is the heart of the R programming language. It is a way of representing/structuring data such that the data set can be easily used and modified for analysis. A quick view of a dataframe – similar to excel. Gene_aGene_b...........Gene_nStatusSex Ind_10.30.8...........1.2UM Ind_20.62.8...........0.4AF Ind_n0.10.09...........0.19AM

27 The Dataframe II A quick view of a data frame: Gene_aGene_b..Gene_nStatusSex Ind_10.30.8..1.2UM Ind_20.62.8..0.4AF Ind_n0.10.09..0.19AM rows columns row names numeric data category data (factors)

28 The Dataframe III So, a data frame is a tabular (rows and columns) representation of data that organises data of different types. Gene_aGene_b. Gene_nStatusSex Ind_10.30.8. 1.2UM Ind_20.62.8. 0.4AF Ind_n0.10.09. 0.19AM R has various functions for accessing the attributes of a data frame dim() dimensions (row X col) names() header names nrow() no. of rows colnames() header names ncol() no. of cols row.names() row names Use the above functions to explore the data set (df) that you previously read in, e.g, dim(df)

29 The Dataframe IV > dim(df)# dimensions (rows by columns) [1] 20 12 > nrow(df) # number of rows [1] 20 > ncol(df) # number of columns [1] 12 > names(df) # header names, same as colnames(df) [1] "gene_a" "gene_b" "gene_c" "gene_d" "gene_e" "gene_f" "gene_g" "gene_h" [9] "gene_i" "gene_j" "status" "sex" > row.names(df) # row names [1] "ind_1" "ind_2" "ind_3" "ind_4" "ind_5" "ind_6" "ind_7" "ind_8" [9] "ind_9" "ind_10" "ind_11" "ind_12" "ind_13" "ind_14" "ind_15" "ind_16" [17] "ind_17" "ind_18" "ind_19" "ind_20"

30 Indexing I Dataframes are not much use unless you can access the elements. Similar to the matrix, we can access the elements of a dataframe by indexing. Try the following (what are they doing?): df[1, ]df[, 1:5]unique(df[, 11]) df[1:3, ]df[1, 1:5]unique(df[, 13]) df[1:3, 1]df[1:nrow(df), 1]

31 Indexing II You can also refer to columns of a dataframe directly. You can attach() a dataframe and refer to columns by name or, you can use the df$column_name notation Try the following (what are they doing?): attach(df)df$sex gene_adf$gene_i unique(status)unique(df$status)

32 Problem I Our data comprises gene expression information for affected (A) and unaffected (U) individuals. Create two new dataframes named affected and unaffected containing only the gene expression data for those groups. affected <- df[which(df$status=='A'), 1:10] unaffected <- df[which(df$status=='U'), 1:10] What do you think the above code is doing?

33 The Environment Window The environment window (top right) keeps track of the variables stored in memory. Opposite, it tells use that the df variable contains 20 observations (rows) and 12 variables (columns) You can also use the ls() function in the console to list content.

34 The Environment Window By clicking on the variable name, the data will appear in the top right.

35 The History Window The history window keeps a record of execute commands. You can highlight code, click "to source" and the code will appear in your Rscript.

36 Problem II There are two additional dataframes held in memory containing gene expression data for the affected and unaffected individuals. Compute the mean gene expression for genes a-j for both groups separately. mean_a <- apply(affected, 2, mean) mean_u <- apply(unaffected, 2, mean) What do you think the above code is doing? This is a bit tricky!

37 Problem III Now we have computed the mean gene expression for each gene within each group. Combine mean_a and mean_u into a new dataframe and write out a new file. sample_means <- rbind(mean_a, mean_u) write.table(sample_means, 'sample_means.txt', sep='\t', row.names=T, col.names=T, quote=F) The file sample_means.txt should be located in your working directory.

38 Saving a Script Save your script as lecture_examples.R

39 Saving a Session Save the session as lecture_examples.RData

40 Summary Today we covered alot: R studio, variables, operators, data types, data structures, inbuilt functions We came across a few inbuilt functions in R (the unintuitive ones are worth looking up in the help pages!) read.table(), which(), apply(), rbind(), write.table() Tomorrow, we will look at more advanced aspects of R syntax and basic plotting.

41 Lecture 1 – problem sheet A problem sheet entitled lecture_1_problems.pdf is located on the course website (http://bioinf.gen.tcd.ie/workshops/R). All the code required for the problem sheet has been covered in this lecture. Please attempt the problems for the next 30-45 mins. We will be on hand to help out. Solutions will be posted this afternoon.

42 Thank You


Download ppt "Statistical Programming Using the R Language Lecture 1 Basic Concepts I Darren J. Fitzpatrick, Ph.D April 2016."

Similar presentations


Ads by Google