1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations,

Slides:



Advertisements
Similar presentations
Basics of Biostatistics for Health Research Session 2 – February 14 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
Advertisements

Inference for Regression
BINF 702 Spring 2014 Practice Problems Practice Problems BINF 702 Practice Problems.
Sampling Distributions (§ )
Chapter 7: Statistical Applications in Traffic Engineering
Lecture 23: Tues., Dec. 2 Today: Thursday:
SADC Course in Statistics Comparing Means from Independent Samples (Session 12)
Statistics: Data Analysis and Presentation Fr Clinic II.
Lecture 6 Outline: Tue, Sept 23 Review chapter 2.2 –Confidence Intervals Chapter 2.3 –Case Study –Two sample t-test –Confidence Intervals Testing.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
The Population Mean and Standard Deviation 1 X μ σ.
Chapter 2Design & Analysis of Experiments 7E 2009 Montgomery 1 Chapter 2 –Basic Statistical Methods Describing sample data –Random samples –Sample mean,
1 Inference About a Population Variance Sometimes we are interested in making inference about the variability of processes. Examples: –Investors use variance.
Hypothesis Testing Using The One-Sample t-Test
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Use of Quantile Functions in Data Analysis. In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions)
Power and Sample Size IF IF the null hypothesis H 0 : μ = μ 0 is true, then we should expect a random sample mean to lie in its “acceptance region” with.
Lab 5 Hypothesis testing and Confidence Interval.
ISE 352: Design of Experiments
Describing distributions with numbers
Lecture 9 1 Reminder:Hypothesis tests Hypotheses H 0 : Null-hypothesis is an conjecture which we assume is true until we have too much evidence against.
Education 793 Class Notes T-tests 29 October 2003.
LECTURE 21 THURS, 23 April STA 291 Spring
Essential Statistics in Biology: Getting the Numbers Right
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 21/09/2015 7:46 PM 1 Two-sample comparisons Underlying principles.
Comparing Two Population Means
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 30, 2015 Introductory Statistics/ Refresher and Relevant software installation.
Chapter 10 Comparing Two Means Target Goal: I can use two-sample t procedures to compare two means. 10.2a h.w: pg. 626: 29 – 32, pg. 652: 35, 37, 57.
Week 111 Power of the t-test - Example In a metropolitan area, the concentration of cadmium (Cd) in leaf lettuce was measured in 7 representative gardens.
A Brief Introduction to R Programming Darren J. Fitzpatrick, PhD The Bioinformatics Support Team 27/08/2015.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.
Exploratory Data Analysis Observations of a single variable.
Chapter 2 Analysis using R. Few Tips for R Commands included here CANNOT ALWAYS be copied and pasted directly without alteration. –One major reason is.
Analysis of Variance 1 Dr. Mohammed Alahmed Ph.D. in BioStatistics (011)
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Week111 The t distribution Suppose that a SRS of size n is drawn from a N(μ, σ) population. Then the one sample t statistic has a t distribution with n.
Illustrations using R B. Jones Dept. of Political Science UC-Davis.
SPSS Workshop Day 2 – Data Analysis. Outline Descriptive Statistics Types of data Graphical Summaries –For Categorical Variables –For Quantitative Variables.
Psychology 202a Advanced Psychological Statistics October 6, 2015.
T-TestsSlide #1 2-Sample t-test -- Examples Do mean test scores differ between two sections of a class? Does the average number of yew per m 2 differ between.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Analysis of Variance STAT E-150 Statistical Methods.
Section 6.4 Inferences for Variances. Chi-square probability densities.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations,
1 Probability and Statistics Confidence Intervals.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 2, 2016, LALLY 102 Data and Information Resources, Role of Hypothesis, Exploration and.
Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.
Midterm. T/F (a) False—step function (b) False, F n (x)~Bin(n,F(x)) so Inverting and estimating the standard error we see that a factor of n -1/2 is missing.
1 Design and Analysis of Experiments (2) Basic Statistics Kyung-Ho Park.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 3b, February 12, 2016 Lab exercises /assignment 2.
Objectives (PSLS Chapter 18) Comparing two means (σ unknown)  Two-sample situations  t-distribution for two independent samples  Two-sample t test 
Statistics and probability Dr. Khaled Ismael Almghari Phone No:
Chapters 22, 24, 25 Inference for Two-Samples. Confidence Intervals for 2 Proportions.
Data Analytics – ITWS-4963/ITWS-6965
Statistical Inference
Data Analytics – ITWS-4600/ITWS-6600
Stat 251 (2009, Summer) Final Lab TA: Yu, Chi Wai.
Lab exercises: beginning to work with data: filtering, distributions, populations, significance testing… Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600.
Group 1 Lab 2 exercises /assignment 2
business analytics II ▌assignment one - solutions autoparts 
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Group 1 Lab 2 exercises and Assignment 2
Confidence Intervals Tobias Econ 472.
STAT 312 Introduction Z-Tests and Confidence Intervals for a
Summary of Tests Confidence Limits
Confidence Intervals Tobias Econ 472.
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Group 1 Lab 2 exercises and Assignment 2
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations, testing and models.

The DATUM project This class is one included in the DATUM project for assessment and evaluation Here’s what that means… 2

Assignment 1 – discuss. Review/ critique (business case, area of application, approach/ methods, tools used, results, actions, benefits). What did you choose/ why and what are your review comments? 3

Today - lab Exploring data and their distributions Fitting distributions, joint and conditional Constructing models and testing their fitness 4

Objectives for today Get familiar with: –R –Distributions –Populations –Fitting –Filtering –Testing 5

Table: Matlab/R/scipy-numpy 6

Files EPI_data.xls – with missing values changed to suit your application (EPI2010_all countries or EPI2010_onlyEPIcountries tabs) Get the data read in as you did last week (in e.g. I will use “EPI_data” for the object (in R)) > EPI_data /2010EPI_data.csv") > View(EPI_data) 7

Tips (in R) > attach(EPI_data) # sets the ‘default’ object > fix(EPI_data) # launches a simple data editor > EPI # prints out values EPI_data$EPI [1] NA NA 36.3 NA 71.4 NA NA NA [18] NA NA NA [35] NA 76.8 NA NA NA [52] NA NA NA 78.2 [69] NA NA NA 44.4 NA NA NA 54.0 NA [86] NA 59.2 NA NA [103] NA NA NA 50.1 NA NA 63.7 [120] NA NA 65.6 NA NA [137] NA NA NA NA 59.3 NA 37.6 NA NA 66.4 [154] NA NA NA NA [171] NA NA 48.9 NA NA NA [188] NA NA NA NA 64.6 NA NA [205] 38.4 NA NA NA NA NA NA 62.9 [222] NA NA 59.0 NA NA NA > tf <- is.na(EPI) # records True values if the value is NA > E <- EPI[!tf] # filters out NA values, new array 8

Exercise 1: exploring the distribution > summary(EPI) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > fivenum(EPI,na.rm=TRUE) [1] > stem(EPI) # stem and leaf plot > hist(EPI) > hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines(density(EPI,na.rm=TRUE,bw=1.)) # or try bw=“SJ” > rug(EPI) > Use help( ), e.g. > help(stem) 9

10 Save your plots, name them. Save the commands you used to generate them.

Exercise 1: fitting a distribution beyond histograms Cumulative density function? > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) Quantile-Quantile? > par(pty="s") > qqnorm(EPI); qqline(EPI) Simulated data from t-distribution: > x <- rt(250, df = 5) > qqnorm(x); qqline(x) Make a Q-Q plot against the generating distribution by: x<-seq(30,95,1) > qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn") > qqline(x) 11

Exercise 1: fitting a distribution Your exercise: do the same exploration and fitting for another 2 variables in the EPI_data, i.e. primary variables (DALY, WATER_H, …) Try fitting other distributions – i.e. as ecdf or qq- 12

Comparing distributions > boxplot(EPI,DALY) > t.test(EPI,DALY) Welch Two Sample t-test data: EPI and DALY t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y

qqplot(EPI,DALY) 14

But there is more Your exercise – intercompare: EPI, ENVHEALTH, ECOSYSTEM, DALY, AIR_H, WATER_H, AIR_EWATER_E, BIODIVERSITY ** (subject to possible filtering…) 15

Exercise 2: filtering (populations) Conditional filtering: > EPILand<-EPI[!Landlock] > Eland <- EPILand[!is.na(EPILand)] > hist(ELand) > hist(ELand, seq(30., 95., 1.0), prob=TRUE) Repeat exercise 1… Also look at: No_surface_water, Desert and High_Population_Density Your exercise: how to filter on EPI_regions or GEO_subregion? E.g. EPI_South_Asia ] 16

Exercise 3: testing the fits shapiro.test(EPI) ks.test(EPI,seq(30.,95.,1.0)) You should be thinking about how to interpret (we covered some of this early this week and will also in the next few classes)? 17

Exercise 4: joint distributions > t.test(EPI,DALY) Welch Two Sample t-test data: EPI and DALY t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y > var.test(EPI,DALY) F test to compare two variances data: EPI and DALY F = , num df = 162, denom df = 191, p-value < 2.2e-16 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: sample estimates: ratio of variances

Kolmogorov- Smirnov - KS test - > ks.test(EPI,DALY) Two-sample Kolmogorov-Smirnov test data: EPI and DALY D = , p-value = alternative hypothesis: two-sided Warning message: In ks.test(EPI, DALY) : p-value will be approximate in the presence of ties 19

Objectives for today Get familiar with: –Distributions –Populations –Fitting –Filtering –Testing 20