1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 2, 2016, LALLY 102 Data and Information Resources, Role of Hypothesis, Exploration and.

Slides:



Advertisements
Similar presentations
Copyright © 2011 Pearson Education, Inc. Statistical Tests Chapter 16.
Advertisements

IB Math Studies – Topic 6 Statistics.
The controlled assessment is worth 25% of the GCSE The project has three stages; 1. Planning 2. Collecting, processing and representing data 3. Interpreting.
Significance Testing Chapter 13 Victor Katch Kinesiology.
Hypothesis testing Week 10 Lecture 2.
Chapter 7: Statistical Applications in Traffic Engineering
Chapter 7(7b): Statistical Applications in Traffic Engineering Chapter objectives: By the end of these chapters the student will be able to (We spend 3.
DISTRIBUTION FITTING.
Lesson #25 Nonparametric Tests for a Single Population.
Topic 2: Statistical Concepts and Market Returns
Test statistic: Group Comparison Jobayer Hossain Larry Holmes, Jr Research Statistics, Lecture 5 October 30,2008.
Bivariate Statistics GTECH 201 Lecture 17. Overview of Today’s Topic Two-Sample Difference of Means Test Matched Pairs (Dependent Sample) Tests Chi-Square.
PSYC512: Research Methods PSYC512: Research Methods Lecture 8 Brian P. Dyre University of Idaho.
IENG 486 Statistical Quality & Process Control
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 8 Analyzing and Interpreting Quantitative.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Nonparametrics and goodness of fit Petter Mostad
M obile C omputing G roup A quick-and-dirty tutorial on the chi2 test for goodness-of-fit testing.
Use of Quantile Functions in Data Analysis. In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions)
Statistical Methods II
Power and Sample Size IF IF the null hypothesis H 0 : μ = μ 0 is true, then we should expect a random sample mean to lie in its “acceptance region” with.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Analysis & Interpretation: Individual Variables Independently Chapter 12.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations,
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
More About Significance Tests
Choosing and using statistics to test ecological hypotheses
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 21/09/2015 7:46 PM 1 Two-sample comparisons Underlying principles.
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
Non-parametric Tests. With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests!
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 30, 2015 Introductory Statistics/ Refresher and Relevant software installation.
Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.
1 Statistical Distribution Fitting Dr. Jason Merrick.
Week 10 Nov 3-7 Two Mini-Lectures QMM 510 Fall 2014.
Hypothesis Testing A procedure for determining which of two (or more) mutually exclusive statements is more likely true We classify hypothesis tests in.
1 STATISTICAL HYPOTHESIS Two-sided hypothesis: H 0 :  = 50H 1 :   only here H 0 is valid all other possibilities are H 1 One-sided hypothesis:
1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 16 Statistical Tests.
Introduction to Statistics Alastair Kerr, PhD. Think about these statements (discuss at end) Paraphrased from real conversations: – “We used a t-test.
1 Results from Lab 0 Guessed values are biased towards the high side. Judgment sample means are biased toward the high side and are more variable.
Descriptive statistics Petter Mostad Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
GG 313 Lecture 9 Nonparametric Tests 9/22/05. If we cannot assume that our data are at least approximately normally distributed - because there are a.
Data Science and Big Data Analytics Chap 3: Data Analytics Using R
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Exploratory Spatial Data Analysis (ESDA) Analysis through Visualization.
Biostatistics Nonparametric Statistics Class 8 March 14, 2000.
Statistical Data Analysis 2011/2012 M. de Gunst Lecture 3.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations,
Chapter 1 Introduction to Statistics. Section 1.1 Fundamental Statistical Concepts.
Synthesis and Review 2/20/12 Hypothesis Tests: the big picture Randomization distributions Connecting intervals and tests Review of major topics Open Q+A.
Introduction to Statistics Alastair Kerr, PhD. Overview Understanding samples and distributions Binomial and normal distributions Describing data Visualising.
Hypothesis Tests u Structure of hypothesis tests 1. choose the appropriate test »based on: data characteristics, study objectives »parametric or nonparametric.
 Kolmogor-Smirnov test  Mann-Whitney U test  Wilcoxon test  Kruskal-Wallis  Friedman test  Cochran Q test.
Graphs with SPSS Aravinda Guntupalli. Bar charts  Bar Charts are used for graphical representation of Nominal and Ordinal data  Height of the bar is.
MEGN 537 – Probabilistic Biomechanics Ch.5 – Determining Distributions and Parameters from Observed Data Anthony J Petrella, PhD.
Non-parametric Tests Research II MSW PT Class 8. Key Terms Power of a test refers to the probability of rejecting a false null hypothesis (or detect a.
Appendix I A Refresher on some Statistical Terms and Tests.
Ex St 801 Statistical Methods Part 2 Inference about a Single Population Mean (HYP)
Data Analytics – ITWS-4963/ITWS-6965
Introductory Statistics/ Refresher
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Chapter 8: Introduction to Statistics CIS Computational Probability.
Lab exercises: beginning to work with data: filtering, distributions, populations, significance testing… Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600.
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
(-4)*(-7)= Agenda Bell Ringer Bell Ringer
Introductory Statistics
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 2, 2016, LALLY 102 Data and Information Resources, Role of Hypothesis, Exploration and Distributions

Contents Data sources –Cyber –Human “Munging” Exploring –Distributions… –Summaries –Visualization Testing and evaluating the results (beginning) 2

Lower layers in the Analytics Stack 3

“Cyber Data” … 4

“Human Data” … 5

Data Prepared for Analysis = Munging Missing values, null values, etc. E.g. in the EPI_data – they use “--” –Most data applications provide built ins for these higher- order functions – in R “NA” is used and functions such as is.na(var), etc. provide powerful filtering options (we’ll cover these on Friday) Of course, different variables often are missing “different” values In R – higher-order functions such as: Reduce, Filter, Map, Find, Position and Negate will become your enemies and then your friends: 3/higher-order-functions-in-r/ 3/higher-order-functions-in-r/ 6

Getting started – summarize data Summary statistic –Ranges, “hinges” –Tukey’s five numbers Look for a distribution match Tests…for… –Normality – shapiro-wilks – returns a statistic (W!) and a p-value – what is the null hypothesis here? > shapiro.test(EPI_data$EPI) Shapiro-Wilk normality test data: EPI_data$EPI W = , p-value =

Accept or Reject? Reject the null hypothesis if the p-value is less than the level of significance. You will fail to reject the null hypothesis if the p-value is greater than or equal to the level of significance. Typical significance 0.05 (!) 8

Another variable in EPI > shapiro.test(EPI_data$DALY) Shapiro-Wilk normality test data: EPI_data$DALY W = , p-value = 1.891e-07 Accept or reject? 9

Distribution tests Binomial, …. most distributions have tests Wilcoxon (Mann-Whitney) –Comparing populations – versus to a distribution Kolmogorov-Smirnov (KS) … It got out of control when people realized they can name the test after themselves, v. someone else… 10

Getting started – look at the data Visually –What is the improvement in the understanding of the data as compared to the situation without visualization? –Which visualization techniques are suitable for one's data? Scatter plot diagrams Box plots (min, 1 st quartile, median, 3 rd quartile, max) Stem and leaf plots Frequency plots Group Frequency Distributions plot Cumulative Frequency plots Distribution plots 11

Why visualization? Reducing amount of data, quantization Patterns Features Events Trends Irregularities Leading to presentation of data, i.e. information products Exit points for analysis 12

Exploring the distribution > summary(EPI) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > boxplot(EPI) > fivenum(EPI,na.rm=TRUE) [1] Tukey: min, lower hinge, median, upper hinge, max 13

Stem and leaf plot > stem(EPI)# like-a histogram The decimal point is 1 digit(s) to the right of the | - but the scale of the stem is 10… watch carefully.. 3 | | | | | | | | | | | 11 8 | | 4 14

Grouped Frequency Distribution aka binning > hist(EPI)#defaults 15

Distributions Shape Character Parameter(s) Which one fits? 16

17 > hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines (density(EPI,na.rm=TR UE,bw=1.)) > rug(EPI) or > lines (density(EPI,na.rm=TR UE,bw=“SJ”))

18 > hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines (density(EPI,na.rm=TR UE,bw=“SJ”))

Why are histograms so unsatisfying? 19

> xn<-seq(30,95,1) > qn<- dnorm(xn,mean=63, sd=5,log=FALSE) > lines(xn,qn) > lines(xn,.4*qn) > ln<-dnorm(xn,mean=44, sd=5,log=FALSE) > lines(xn,.26*ln) 20

Exploring the distribution > summary(DALY) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > fivenum(DALY,na.rm=TRUE) [1] EPIDALY

Stem and leaf plot > stem(DALY) # The decimal point is 1 digit(s) to the right of the | 0 | | | | | | | | | | | | | | | | | | | 22 22

Beyond histograms Cumulative distribution function: probability that a real-valued random variable X with a given probability distribution will be found at a value less than or equal to x. > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) 23

Beyond histograms Quantile ~ inverse cumulative density function – points taken at regular intervals from the CDF, e.g. 2-quantiles=median, 4-quantiles=quartiles Quantile-Quantile (versus default=normal dist.) > par(pty="s") > qqnorm(EPI); qqline(EPI) 24

Beyond histograms Simulated data from t-distribution (random): > x <- rt(250, df = 5) > qqnorm(x); qqline(x) 25

Beyond histograms Q-Q plot against the generating distribution: x<- seq(30,95,1) > qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn") > qqline(x) 26

But if you are not sure it is normal > wilcox.test(EPI,DALY) Wilcoxon rank sum test with continuity correction data: EPI and DALY W = 15970, p-value = alternative hypothesis: true location shift is not equal to 0 27

Comparing the CDFs > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) > plot(ecdf(DALY), do.points=FALSE, verticals=TRUE, add=TRUE) 28

29

30

31

32

More munging Bad values, outliers, corrupted entries, thresholds … Noise reduction – low-pass filtering, binning Modal filtering REMEMBER: when you munge you MUST record what you did (and why) and save copies of pre- and post- operations… 33

34

35

Populations within populations In the EPI example: –Geographic regions (GEO_subregion) –EPI_regions –Eco-regions (EDC v. LEDC – know what that is?) –Primary industry(ies) –Climate region What would you do to start exploring? 36

37

38 Or, a twist – n=1 but many attributes? The item of interest in relation to its attributes

Summary: explore Going from preliminary to initial analysis… Determining if there is one or more common distributions involved – i.e. parametric statistics (assumes or asserts a probability distribution) Fitting that distribution -> provides a model! Or NOT –A hybrid or –Non-parametric (statistics) approaches are needed – more on this to come 39

Goodness of fit And, we cannot take the models at face value, we must assess how fit they may be: –Chi-Square –One-sided and two-sided Kolmogorov-Smirnov tests –Lilliefors tests –Ansari-Bradley tests –Jarque-Bera tests Just a preview… 40

41 Summary Cyber and Human data; quality, uncertainty and bias – you will often spend a lot of time with the data Distributions – the common and not-so common ones and how cyber and human data can have distinct distributions How simple statistical distributions can mislead us Populations and samples and how inferential statistics will lead us to model choices (no we have not actually done that yet in detail) Munging toward exploratory analysis Toward models!

How are the software installs going? R Data exercises? –You can try some of the examples from today on the EPI dataset More on Friday… and other datasets. 42