1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations,

Slides:



Advertisements
Similar presentations
Basics of Biostatistics for Health Research Session 2 – February 14 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
Advertisements

Hypothesis testing and confidence intervals by resampling by J. Kárász.
Nemours Biomedical Research Statistics March 19, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Lecture 23: Tues., Dec. 2 Today: Thursday:
Lecture 6 Outline: Tue, Sept 23 Review chapter 2.2 –Confidence Intervals Chapter 2.3 –Case Study –Two sample t-test –Confidence Intervals Testing.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
The Population Mean and Standard Deviation 1 X μ σ.
Chapter 2Design & Analysis of Experiments 7E 2009 Montgomery 1 Chapter 2 –Basic Statistical Methods Describing sample data –Random samples –Sample mean,
Student’s t statistic Use Test for equality of two means
Hypothesis Testing Using The One-Sample t-Test
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3b, February 7, 2014 Lab exercises: datasets and data infrastructure.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Use of Quantile Functions in Data Analysis. In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions)
Power and Sample Size IF IF the null hypothesis H 0 : μ = μ 0 is true, then we should expect a random sample mean to lie in its “acceptance region” with.
Chapter 24: Comparing Means.
Lab 5 Hypothesis testing and Confidence Interval.
ISE 352: Design of Experiments
Describing distributions with numbers
Lecture 9 1 Reminder:Hypothesis tests Hypotheses H 0 : Null-hypothesis is an conjecture which we assume is true until we have too much evidence against.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations,
LECTURE 21 THURS, 23 April STA 291 Spring
Hands-on Introduction to R. Outline R : A powerful Platform for Statistical Analysis Why bother learning R ? Data, data, data, I cannot make bricks without.
Essential Statistics in Biology: Getting the Numbers Right
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 21/09/2015 7:46 PM 1 Two-sample comparisons Underlying principles.
Comparing Two Population Means
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 30, 2015 Introductory Statistics/ Refresher and Relevant software installation.
Chapter 10 Comparing Two Means Target Goal: I can use two-sample t procedures to compare two means. 10.2a h.w: pg. 626: 29 – 32, pg. 652: 35, 37, 57.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 14b, May 2, 2014 PCA and return to Big Data infrastructure…. and assignment time.
A Brief Introduction to R Programming Darren J. Fitzpatrick, PhD The Bioinformatics Support Team 27/08/2015.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
1 SMU EMIS 7364 NTU TO-570-N Inferences About Process Quality Updated: 2/3/04 Statistical Quality Control Dr. Jerrell T. Stracener, SAE Fellow.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 24, 2014 Relevant software and getting it installed.
Using R for Marketing Research Dan Toomey 2/23/2015
Exploratory Data Analysis Observations of a single variable.
Chapter 2 Analysis using R. Few Tips for R Commands included here CANNOT ALWAYS be copied and pasted directly without alteration. –One major reason is.
Research Seminars in IT in Education (MIT6003) Quantitative Educational Research Design 2 Dr Jacky Pow.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Week111 The t distribution Suppose that a SRS of size n is drawn from a N(μ, σ) population. Then the one sample t statistic has a t distribution with n.
Chapter 8 Minitab Recipe Cards. Confidence intervals for the population mean Choose Basic Statistics from the Stat menu and 1- Sample t from the sub-menu.
Basics of Biostatistics for Health Research Session 1 – February 7 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
Data Science and Big Data Analytics Chap 3: Data Analytics Using R
Psychology 202a Advanced Psychological Statistics October 6, 2015.
T-TestsSlide #1 2-Sample t-test -- Examples Do mean test scores differ between two sections of a class? Does the average number of yew per m 2 differ between.
Bioinformatics for biologists
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 2, 2016, LALLY 102 Data and Information Resources, Role of Hypothesis, Exploration and.
Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.
Introduction to Basic Statistical Methods Part 1: Statistics in a Nutshell UWHC Scholarly Forum May 21, 2014 Ismor Fischer, Ph.D. UW Dept of Statistics.
Midterm. T/F (a) False—step function (b) False, F n (x)~Bin(n,F(x)) so Inverting and estimating the standard error we see that a factor of n -1/2 is missing.
1 Design and Analysis of Experiments (2) Basic Statistics Kyung-Ho Park.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.
Hands-on Introduction to R. We live in oceans of data. Computers are essential to record and help analyse it. Competent scientists speak C/C++, Java,
Objectives (PSLS Chapter 18) Comparing two means (σ unknown)  Two-sample situations  t-distribution for two independent samples  Two-sample t test 
Statistics and probability Dr. Khaled Ismael Almghari Phone No:
Before the class starts: Login to a computer Read the Data analysis assignment 1 on MyCourses If you use Stata: Start Stata Start a new do file Open the.
Data Analytics – ITWS-4963/ITWS-6965
EMPA Statistical Analysis
Data Analytics – ITWS-4600/ITWS-6600
Stat 251 (2009, Summer) Final Lab TA: Yu, Chi Wai.
Lab exercises: beginning to work with data: filtering, distributions, populations, significance testing… Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600.
Group 1 Lab 2 exercises /assignment 2
Introduction to R Commander
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Univariate Data Exploration
Group 1 Lab 2 exercises and Assignment 2
STAT 312 Introduction Z-Tests and Confidence Intervals for a
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Get and Load Data Univariate EDA Bivariate EDA Filter Individuals
Group 1 Lab 2 exercises and Assignment 2
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations, testing and models.

Assignment 1 – discuss. Review/ critique (business case, area of application, approach/ methods, tools used, results, actions, benefits). What did you choose/ why and what are your review comments? 2

Today - lab Exploring data and their distributions Fitting distributions, joint and conditional Constructing models and testing their fitness 3

Objectives for today Get familiar with: –R –Distributions –Populations –Fitting –Filtering –Testing 4

Table: Matlab/R/scipy-numpy 5

Gnu R - load this firsthttp://lib.stat.cmu.edu/R/CRAN/ R Studio – (see R-intro.html too) (desktop version)– Manuals - Libraries – at the command line – library(), or select the packages tab, and check/ uncheck as needed 6

Files This is where the files for assignments, exercise are –data and code fragments… 7

Exercises – getting data in Rstudio –read in csv file (two ways to do this) - GPW3_GRUMP_SummaryInformation_2010.csv –Read in excel file (directly or by csv convert) EPI_data.xls (EPI2010_all countries or EPI2010_onlyEPIcountries tabs) –Plot some variables –Commonalities among them? Also for other datasets, enter these in the R command window pane or cmd line > data() > help(data) 8

Files 2010EPI_data.xls – with missing values changed to suit your application (EPI2010_all countries or EPI2010_onlyEPIcountries tabs) See Lab2b_2016_data.R Get the data read in as you did last week (in e.g. use “EPI_data” for the object (in R)) > EPI_data /2010EPI_data.csv") > # Note: replace default data frame name – cannot start with numbers! > View(EPI_data) 9

Tips (in R) > attach(EPI_data) # sets the ‘default’ object > fix(EPI_data) # launches a simple data editor > EPI # prints out values EPI_data$EPI [1] NA NA 36.3 NA 71.4 NA NA NA [18] NA NA NA [35] NA 76.8 NA NA NA [52] NA NA NA 78.2 [69] NA NA NA 44.4 NA NA NA 54.0 NA [86] NA 59.2 NA NA [103] NA NA NA 50.1 NA NA 63.7 [120] NA NA 65.6 NA NA [137] NA NA NA NA 59.3 NA 37.6 NA NA 66.4 [154] NA NA NA NA [171] NA NA 48.9 NA NA NA [188] NA NA NA NA 64.6 NA NA [205] 38.4 NA NA NA NA NA NA 62.9 [222] NA NA 59.0 NA NA NA > tf <- is.na(EPI) # records True values if the value is NA > E <- EPI[!tf] # filters out NA values, new array 10

Exercise 1: exploring the distribution > summary(EPI) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > fivenum(EPI,na.rm=TRUE) [1] > stem(EPI) # stem and leaf plot > hist(EPI) > hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines(density(EPI,na.rm=TRUE,bw=1.)) # or try bw=“SJ” > rug(EPI) #Use help( ), e.g. > help(stem) See Lab2b_2016_summary.R 11

12 Save your plots, name them. Save the commands you used to generate them.

Exercise 1: fitting a distribution beyond histograms Cumulative density function? > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) Quantile-Quantile? > par(pty="s") > qqnorm(EPI); qqline(EPI) Simulated data from t-distribution: > x <- rt(250, df = 5) > qqnorm(x); qqline(x) Make a Q-Q plot against the generating distribution by: x<-seq(30,95,1) > qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn") > qqline(x) 13

Exercise 1: fitting a distribution Your exercise: do the same exploration and fitting for another 2 variables in the EPI_data, i.e. primary variables (DALY, WATER_H, …) Try fitting other distributions – i.e. as ecdf or qq- 14

Comparing distributions > boxplot(EPI,DALY) > t.test(EPI,DALY) Welch Two Sample t-test data: EPI and DALY t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y

qqplot(EPI,DALY) 16

But there is more Your exercise – intercompare: EPI, ENVHEALTH, ECOSYSTEM, DALY, AIR_H, WATER_H, AIR_EWATER_E, BIODIVERSITY ** (subject to possible filtering…) 17

Distributions See archive “alldist.zip” –Excel files for different distributions… –Expanded in the folder: e.g. lognorm.xls In R: > help(distributions) 18

Exercise 2: filtering (populations) Conditional filtering: > EPILand<-EPI[!Landlock] > Eland <- EPILand[!is.na(EPILand)] > hist(ELand) > hist(ELand, seq(30., 95., 1.0), prob=TRUE) Repeat exercise 1… Also look at: No_surface_water, Desert and High_Population_Density Your exercise: how to filter on EPI_regions or GEO_subregion? E.g. EPI_South_Asia ] 19

Exercise 3: testing the fits shapiro.test(EPI) ks.test(EPI,seq(30.,95.,1.0)) You should be thinking about how to interpret (we covered some of this early this week and will also in the next few classes)? 20

Exercise 4: joint distributions > t.test(EPI,DALY) Welch Two Sample t-test data: EPI and DALY t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y > var.test(EPI,DALY) F test to compare two variances data: EPI and DALY F = , num df = 162, denom df = 191, p-value < 2.2e-16 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: sample estimates: ratio of variances

Kolmogorov- Smirnov - KS test - > ks.test(EPI,DALY) Two-sample Kolmogorov-Smirnov test data: EPI and DALY D = , p-value = alternative hypothesis: two-sided Warning message: In ks.test(EPI, DALY) : p-value will be approximate in the presence of ties 22

GPW3_GRUMP – repeat… The reading in, then exploration/summary, plots, histograms, distributions, filtering, tests…. 23

water_treatment water-treatment.csv 24

Objectives for today Get familiar with: –Distributions –Populations –Fitting –Filtering –Testing 25