John D. Cook M. D. Anderson Cancer Center

Slides:



Advertisements
Similar presentations
1 1 Chapter 5: Multiple Regression 5.1 Fitting a Multiple Regression Model 5.2 Fitting a Multiple Regression Model with Interactions 5.3 Generating and.
Advertisements

1 More Regression Information. 2 3 On the previous slide I have an Excel regression output. The example is the pizza sales we saw before. The first thing.
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
RESEARCH STATISTICS Jobayer Hossain Larry Holmes, Jr November 6, 2008 Examining Relationship of Variables.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Crime? FBI records violent crime, z x y z [1,] [2,] [3,] [4,] [5,]
Baburao Kamble (Ph.D) University of Nebraska-Lincoln
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
What is R Muhammad Omer. What is R  R is the programing language software for statistical computing and data analysis  The R language is extensively.
How to plot x-y data and put statistics analysis on GLEON Fellowship Workshop January 14-18, 2013 Sunapee, NH Ari Santoso.
Example of Simple and Multiple Regression
Regression Harry R. Erwin, PhD School of Computing and Technology University of Sunderland.
Multiple Linear Regression - Matrix Formulation Let x = (x 1, x 2, …, x n )′ be a n  1 column vector and let g(x) be a scalar function of x. Then, by.
BIOL 582 Lecture Set 19 Matrices, Matrix calculations, Linear models using linear algebra.
One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.
Quantitative Methods Heteroskedasticity.
 Combines linear regression and ANOVA  Can be used to compare g treatments, after controlling for quantitative factor believed to be related to response.
Diploma in Statistics Introduction to Regression Lecture 2.21 Introduction to Regression Lecture Review of Lecture 2.1 –Homework –Multiple regression.
A very brief introduction to using R & MX - Matthew Keller Some material cribbed from: UCLA Academic Technology Services Technical Report Series (by Patrick.
Data Handling & Analysis BD7054 Scatter Plots Andrew Jackson
Lecture 3: Inference in Simple Linear Regression BMTRY 701 Biostatistical Methods II.
Chapter 12 Multiple Linear Regression Doing it with more variables! More is better. Chapter 12A.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
Multiple Regression Petter Mostad Review: Simple linear regression We define a model where are independent (normally distributed) with equal.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Simple Linear Regression (SLR)
Simple Linear Regression (OLS). Types of Correlation Positive correlationNegative correlationNo correlation.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
1 KFPA Critical Design Review – Fri., Jan. 30, 2009 KFPA Data Pipeline Bob Garwood- NRAO-CV.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Bursts modelling Using WinBUGS Tim Watson May 2012 :diagnostics/ :transformation/ :investment planning/ :portfolio optimisation/ :investment economics/
Lecture 11: Simple Linear Regression
EXCEL: Multiple Regression
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
Data Analytics – ITWS-4600/ITWS-6600
Modeling in R Sanna Härkönen.
Scripting Languages Info derived largely from Programming Language Pragmatics, by Michael Scott.
Résolution de l’ex 1 p40 t=c(2:12);N=c(55,90,135,245,403,665,1100,1810,3000,4450,7350) T=data.frame(t,N,y=log(N));T; > T t N y
Generalized regression techniques
CS101 Introduction to Computing Lecture 19 Programming Languages
A451 Theory – 7 Programming 7A, B - Algorithms.
Basics of Group Analysis
Correlation and regression
Statistics for the Social Sciences
Chapter 5 Introduction to Factorial Designs
Introduction to Computer Programming
Correlation and Regression
Everyone thinks they know this stuff
Python for Scientific Computing
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
CHAPTER 29: Multiple Regression*
© University of Wisconsin, CS559 Spring 2004
Chapter 12 Inference on the Least-squares Regression Line; ANOVA
Console Editeur : myProg.R 1
Welcome to the class! set.seed(843) df <- tibble::data_frame(
Model Comparison: some basic concepts
5.4 General Linear Least-Squares
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Quadrat sampling Quadrat shape Quadrat size Lab Regression and ANCOVA
ICT Programming Lesson 1:
One-Factor Experiments
Essentials of Statistics for Business and Economics (8e)
Correlation and Simple Linear Regression
Presentation transcript:

John D. Cook M. D. Anderson Cancer Center www.JohnDCook.com Why and how people use R John D. Cook M. D. Anderson Cancer Center www.JohnDCook.com

What is R? Open source statistical language De facto standard for statistical research Grew out of Bell Labs’ S (1976, 1988) Influenced by Scheme, Fortran Quirky, flawed, and an enormous success Fortran: unit offset vectors, column-major storage, cryptic names. As Dennis Ritchie said about C.

R in data analysis Languages used in Kaggle.com data analysis competition 2011 Source: http://r4stats.com/popularity

R in bioinformatics (2012) Bioinformatics means a lot of different things. Imagine that those using Perl are not doing statistics. http://bioinfsurvey.org/analysis/programming_languages/

So what is using R like?

"Using R is a bit akin to smoking. The beginning is difficult, one may get headaches and even gag the first few times. But in the long run, it becomes pleasurable and even addictive. Yet, deep down, for those willing to be honest, there is something not fully healthy in it.“ -- Francois Pinard

“… R has a unique and somewhat prickly syntax and tends to have a steeper learning curve than other languages.” Drew Conway John Myles White

The R Inferno by Patrick Burns http://www.burns-stat.com/ pages/Tutor/R_inferno.pdf

What about speed? Refer to Julia slides

Maybe 100x slower than C++, though it varies greatly.

What about tool support?

Limited compared to, for example, first release of Visual Studio (1995).

So why do statisticians use R? “The best thing about R is that it was written by statisticians. The worst thing about R ...” Bo Cowgill, Google

R is a DSL To understand a DSL, start with D, not L. The alternative to R isn’t Python or C#, it’s SAS. People love their DSL, and will use it outside of its domain.

What are statisticians like? Different priorities than software developers Different priorities than mathematicians Learn bits of R in parallel with statistics

Why a statistical DSL? Statistical functions easily accessible Convenient manipulation of tables Vector operations Smooth handling of missing data Patterns for common tasks Bruce Payette: It’s a shell!

Some advantages of R Batteries included, one namespace Contrast Python + matplotlib + SciPy + IPython Designed for interactive data analysis Easier to program than, e.g., SAS Open source, interpreted, portable Succinct notation for querying and filtering Succinct notation for linear regression Open source: i.e. can avoid purchasing bureaucracy, students and colleagues in poor areas can use same software

Examples Set all NA elements of x to 0. x[ is.na(x) ] <- 10 Fit a linear regression model to w as a function of x, y, and z, including a constant term and all first order interaction terms except xz. model ~ lm(w ~ (x + y + z)^2 – x:z)

Simple regression growth tannin 12 0 10 1 8 2 11 3 6 4 7 5 2 6 3 7 3 8

Regression example > data <- read.table("regression.txt", header=T) > attach(data) > names(data) [1] "growth" "tannin" > model <- lm( growth ~ tannin ) > summary(model) ... Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 11.7556 1.0408 11.295 9.54e-06 *** tannin -1.2167 0.2186 -5.565 0.000846 *** Residual standard error: 1.693 on 7 degrees of freedom Multiple R-squared: 0.8157, Adjusted R-squared: 0.7893 F-statistic: 30.97 on 1 and 7 DF, p-value: 0.0008461

Recycling Strange, dangerous, or convenient, depending on your perspective If you add two vectors of different lengths, the shorter will be recycled to be as long as the longer vector. Example: > 1:6 + c(10, 100) 11 102 13 104 15 106

Recycling cont. > x <- matrix(1:6, ncol = 2) > x [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 > x + c(10, 100) [1,] 11 104 [2,] 102 15 [3,] 13 106

Lessons from R for language designers Users may have very different priorities than computer scientists. Users will put up with a lot of crap to get their work done. Users will a familiar tool over a better tool if at all feasible.

Resources http://www.r-project.org/ http://www.johndcook.com/ R_language_for_programmers.html “The Art of R Programming” by Normal Matloff @RLangTip