Data Analytics – ITWS-4600/ITWS-6600/MATP-4450

Slides:

Advertisements

Similar presentations

Statistical Techniques I

Advertisements

Biomedical Statistics Testing for Normality and Symmetry Teacher:Jang-Zern Tsai ( 蔡章仁 ) Student: 邱瑋國.

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.

Copyright © 2011 Pearson Education, Inc. Statistical Tests Chapter 16.

IB Math Studies – Topic 6 Statistics.

The controlled assessment is worth 25% of the GCSE The project has three stages; 1. Planning 2. Collecting, processing and representing data 3. Interpreting.

Chapter 7: Statistical Applications in Traffic Engineering

PSYC512: Research Methods PSYC512: Research Methods Lecture 8 Brian P. Dyre University of Idaho.

Use of Quantile Functions in Data Analysis. In general, Quantile Functions (sometimes referred to as Inverse Density Functions or Percent Point Functions)

Statistical Methods II

The Chi-square Statistic. Goodness of fit 0 This test is used to decide whether there is any difference between the observed (experimental) value and.

Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations,

The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.

Statistical Techniques I EXST7005 Review. Objectives n Develop an understanding and appreciation of Statistical Inference - particularly Hypothesis testing.

University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 21/09/2015 7:46 PM 1 Two-sample comparisons Underlying principles.

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 1b, January 30, 2015 Introductory Statistics/ Refresher and Relevant software installation.

Instrumentation (cont.) February 28 Note: Measurement Plan Due Next Week.

1 STATISTICAL HYPOTHESIS Two-sided hypothesis: H 0 :  = 50H 1 :   only here H 0 is valid all other possibilities are H 1 One-sided hypothesis:

1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.

Introduction to Statistics Alastair Kerr, PhD. Think about these statements (discuss at end) Paraphrased from real conversations: – “We used a t-test.

1 Results from Lab 0 Guessed values are biased towards the high side. Judgment sample means are biased toward the high side and are more variable.

Descriptive statistics Petter Mostad Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when.

Data Science and Big Data Analytics Chap 3: Data Analytics Using R

Exploratory Spatial Data Analysis (ESDA) Analysis through Visualization.

1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations,

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 2, 2016, LALLY 102 Data and Information Resources, Role of Hypothesis, Exploration and.

Appendix I A Refresher on some Statistical Terms and Tests.

Statistics and probability Dr. Khaled Ismael Almghari Phone No:

Data Analytics – ITWS-4963/ITWS-6965

Modeling and Simulation CS 313

STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample

Introduction to Data Science Lecture 6 Exploratory Data Analysis

Introductory Statistics/ Refresher

Non-parametric test ordinal data

Chapter 16: Exploratory data analysis: numerical summaries

Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Chapter 8: Introduction to Statistics CIS Computational Probability.

Lab exercises: beginning to work with data: filtering, distributions, populations, significance testing… Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600.

STATISTICS FOR SCIENCE RESEARCH

Modeling and Simulation CS 313

Statistics in psychology

Assumption of normality

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

Evaluating Univariate Normality

Description of Data (Summary and Variability measures)

Y - Tests Type Based on Response and Measure Variable Data

StatCrunch Workshop Hector Facundo.

Inferential statistics,

Data Analytics – ITWS-4600/ITWS-6600/MATP-4450

Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine

Group 1 Lab 2 exercises and Assignment 2

Lecture 11 Nonparametric Statistics Introduction

Distributions (Chapter 1) Sonja Swanson

Hypothesis Theory examples.

Laboratory in Oceanography: Data and Methods

Goodness-of-Fit Tests Applications

Types of (random) variables

When You See (This), You Think (That)

Writing the IA Report: Analysis and Evaluation

ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960

Honors Statistics Review Chapters 4 - 5

(-4)*(-7)= Agenda Bell Ringer Bell Ringer

Nonparametric Statistics

Higher National Certificate in Engineering

Advanced Algebra Unit 1 Vocabulary

Analyzing and Interpreting Quantitative Data

Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine

Group 1 Lab 2 exercises and Assignment 2

PSYCHOLOGY AND STATISTICS

Introductory Statistics

Presentation transcript:

Data Analytics – ITWS-4600/ITWS-6600/MATP-4450 Data and Information Resources, Role of Hypothesis, Exploration and Distributions Peter Fox Data Analytics – ITWS-4600/ITWS-6600/MATP-4450 Group 1 Module 3, ~ January 23, 2018

Contents Data sources “Munging” Exploring Cyber Human “Munging” Exploring Distributions… Summaries Visualization Testing and evaluating the results (beginning)

Lower layers in the Analytics Stack

“Cyber Data” … http://i.istockimg.com/file_thumbview_approve/9668319/2/stock-photo-9668319-stock-market-data.jpg

“Human Data” … http://images.sciencedaily.com/2013/10/131007151731-large.jpg http://www.downsidehedge.com/wp-content/uploads/2013/09/130915StockTwitsSPX.png

Data Prepared for Analysis = Munging Missing values, null values, etc. E.g. in the EPI_data – they use “--” Most data applications provide built ins for these higher-order functions – in R “NA” is used and functions such as is.na(var), etc. provide powerful filtering options (we’ll cover these on Friday) Of course, different variables often are missing “different” values In R – higher-order functions such as: Reduce, Filter, Map, Find, Position and Negate will become your enemies and then your friends: http://www.johnmyleswhite.com/notebook/2010/09/23/higher-order-functions-in-r/

Getting started – summarize data Summary statistic Ranges, “hinges” Tukey’s five numbers Look for a distribution match Tests…for… Normality – shapiro-wilks – returns a statistic (W!) and a p-value – what is the null hypothesis here? > shapiro.test(EPI_data$EPI) Shapiro-Wilk normality test data: EPI_data$EPI W = 0.9866, p-value = 0.1188

Accept or Reject? Reject the null hypothesis if the p-value is less than the level of significance. You will fail to reject the null hypothesis if the p-value is greater than or equal to the level of significance. Typical significance 0.05 (!)

Another variable in EPI > shapiro.test(EPI_data$DALY) Shapiro-Wilk normality test data: EPI_data$DALY W = 0.9365, p-value = 1.891e-07 Accept or reject?

Distribution tests Binomial, …. most distributions have tests Wilcoxon (Mann-Whitney) Comparing populations – versus to a distribution Kolmogorov-Smirnov (KS) … It got out of control when people realized they can name the test after themselves, v. someone else…

Getting started – look at the data Visually What is the improvement in the understanding of the data as compared to the situation without visualization? Which visualization techniques are suitable for one's data? Scatter plot diagrams Box plots (min, 1st quartile, median, 3rd quartile, max) Stem and leaf plots Frequency plots Group Frequency Distributions plot Cumulative Frequency plots Distribution plots

Why visualization? Reducing amount of data, quantization Patterns Features Events Trends Irregularities Leading to presentation of data, i.e. information products Exit points for analysis

Exploring the distribution > summary(EPI) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 32.10 48.60 59.20 58.37 67.60 93.50 68 > boxplot(EPI) > fivenum(EPI,na.rm=TRUE) [1] 32.1 48.6 59.2 67.6 93.5 Tukey: min, lower hinge, median, upper hinge, max

Stem and leaf plot > stem(EPI) # like-a histogram The decimal point is 1 digit(s) to the right of the | - but the scale of the stem is 10… watch carefully.. 3 | 234 3 | 66889 4 | 00011112222223344444 4 | 5555677788888999 5 | 0000111111111244444 5 | 55666677778888999999 6 | 000001111111222333344444 6 | 5555666666677778888889999999 7 | 000111233333334 7 | 5567888 8 | 11 8 | 669 9 | 4

Grouped Frequency Distribution aka binning > hist(EPI) #defaults

Distributions Shape Character Parameter(s) Which one fits?

> hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines (density(EPI,na.rm=TRUE,bw=1.)) > rug(EPI) or > lines (density(EPI,na.rm=TRUE,bw=“SJ”))

> hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines (density(EPI,na.rm=TRUE,bw=“SJ”))

Why are histograms so unsatisfying?

> xn<-seq(30,95,1) > qn<-dnorm(xn,mean=63, sd=5,log=FALSE) > lines(xn,qn) > lines(xn,.4*qn) > ln<-dnorm(xn,mean=44, sd=5,log=FALSE) > lines(xn,.26*ln)

Exploring the distribution > summary(DALY) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.00 37.19 60.35 53.94 71.97 91.50 39 > fivenum(DALY,na.rm=TRUE) [1] 0.000 36.955 60.350 72.320 91.500 EPI DALY

Stem and leaf plot > stem(DALY) # The decimal point is 1 digit(s) to the right of the | 0 | 0000111244 0 | 567899 1 | 0234 1 | 56688 2 | 000123 2 | 5667889 3 | 00001134 3 | 5678899 4 | 00011223444 4 | 555799 5 | 12223344 5 | 556667788999999 6 | 0000011111222233334444 6 | 6666666677788889999 7 | 00000000223333444 7 | 66888999 8 | 1113333333 8 | 555557777777777799999 9 | 22

Beyond histograms Cumulative distribution function: probability that a real-valued random variable X with a given probability distribution will be found at a value less than or equal to x. > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE)

Beyond histograms Quantile ~ inverse cumulative density function – points taken at regular intervals from the CDF, e.g. 2-quantiles=median, 4-quantiles=quartiles Quantile-Quantile (versus default=normal dist.) > par(pty="s") > qqnorm(EPI); qqline(EPI)

Beyond histograms Simulated data from t-distribution (random): > x <- rt(250, df = 5) > qqnorm(x); qqline(x)

Beyond histograms Q-Q plot against the generating distribution: x<-seq(30,95,1) > qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn") > qqline(x)

But if you are not sure it is normal > wilcox.test(EPI,DALY) Wilcoxon rank sum test with continuity correction data: EPI and DALY W = 15970, p-value = 0.7386 alternative hypothesis: true location shift is not equal to 0

Comparing the CDFs > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) > plot(ecdf(DALY), do.points=FALSE, verticals=TRUE, add=TRUE)

More munging Bad values, outliers, corrupted entries, thresholds … Noise reduction – low-pass filtering, binning Modal filtering REMEMBER: when you munge you MUST record what you did (and why) and save copies of pre- and post- operations…

Populations within populations In the EPI example: Geographic regions (GEO_subregion) EPI_regions Eco-regions (EDC v. LEDC – know what that is?) Primary industry(ies) Climate region What would you do to start exploring?

Or, a twist – n=1 but many attributes? The item of interest in relation to its attributes

Summary: explore Going from preliminary to initial analysis… Determining if there is one or more common distributions involved – i.e. parametric statistics (assumes or asserts a probability distribution) Fitting that distribution -> provides a model! Or NOT A hybrid or Non-parametric (statistics) approaches are needed – more on this to come

Goodness of fit And, we cannot take the models at face value, we must assess how fit they may be: Chi-Square One-sided and two-sided Kolmogorov-Smirnov tests Lilliefors tests Ansari-Bradley tests Jarque-Bera tests Just a preview…

Summary Cyber and Human data; quality, uncertainty and bias – you will often spend a lot of time with the data Distributions – the common and not-so common ones and how cyber and human data can have distinct distributions How simple statistical distributions can mislead us Populations and samples and how inferential statistics will lead us to model choices (no we have not actually done that yet in detail) Munging toward exploratory analysis Toward models!

Lab this week - software installs Data exercises! You can try some of the examples from today on the EPI or EPI2016 datasets More in the lab … and other datasets.