1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment.

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1 A B C
AGVISE Laboratories %Zone or Grid Samples – Northwood laboratory
Chapter 4 Sampling Distributions and Data Descriptions.
AP STUDY SESSION 2.
1
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
STATISTICS Joint and Conditional Distributions
STATISTICS HYPOTHESES TEST (III) Nonparametric Goodness-of-fit (GOF) tests Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering.
STATISTICS HYPOTHESES TEST (I)
STATISTICS INTERVAL ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
STATISTICS POINT ESTIMATION Professor Ke-Sheng Cheng Department of Bioenvironmental Systems Engineering National Taiwan University.
Addition and Subtraction Equations
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
David Burdett May 11, 2004 Package Binding for WS CDL.
Create an Application Title 1Y - Youth Chapter 5.
CALENDAR.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Chapter 7 Sampling and Sampling Distributions
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
The 5S numbers game..
Break Time Remaining 10:00.
The basics for simulations
Turing Machines.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
MM4A6c: Apply the law of sines and the law of cosines.
Bright Futures Guidelines Priorities and Screening Tables
Frequency Tables and Stem-and-Leaf Plots 1-3
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Bellwork Do the following problem on a ½ sheet of paper and turn in.
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
Hypothesis Tests: Two Independent Samples
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Chapter 1: Expressions, Equations, & Inequalities
1..
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
When you see… Find the zeros You think….
2011 WINNISQUAM COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=1021.
Before Between After.
2011 FRANKLIN COMMUNITY SURVEY YOUTH RISK BEHAVIOR GRADES 9-12 STUDENTS=332.
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
DTU Informatics Introduction to Medical Image Analysis Rasmus R. Paulsen DTU Informatics TexPoint fonts.
Types of selection structures
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
PSSA Preparation.
Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.
Physics for Scientists & Engineers, 3rd Edition
Chapter 14 Nonparametric Statistics
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
9. Two Functions of Two Random Variables
4/4/2015Slide 1 SOLVING THE PROBLEM A one-sample t-test of a population mean requires that the variable be quantitative. A one-sample test of a population.
Schutzvermerk nach DIN 34 beachten 05/04/15 Seite 1 Training EPAM and CANopen Basic Solution: Password * * Level 1 Level 2 * Level 3 Password2 IP-Adr.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, February 6, 2015 Lab exercises: beginning to work with data: filtering, distributions, populations,
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 2b, February 5, 2016 Lab exercises: beginning to work with data: filtering, distributions, populations,
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 2, 2016, LALLY 102 Data and Information Resources, Role of Hypothesis, Exploration and.
Data Analytics – ITWS-4963/ITWS-6965
Lab exercises: beginning to work with data: filtering, distributions, populations, significance testing… Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600.
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
ITWS-4600/ITWS-6600/MATP-4450/CSCI-4960
Presentation transcript:

1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment

Contents PDA Interpreting what you get back (the stats, the plots) Detailed analyses/ fitting – a start How to assess/ intercompare 2

Preliminary Data Analysis Relates to the sample v. population (for Big Data) discussion last week Also called Exploratory DA –EDA is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe will be there (John Tukey) Distribution analysis and comparison, visual analysis, model testing, i.e. pretty much the things you did last Friday! Thus we are going to review those results 3

Patterns and Relationships Stepping from elementary/ distribution analysis to algorithmic-based analysis I.e. pattern detection via data mining: classification, clustering, rules; machine learning; support vector machines, non- parametric models Relations – associations between/among populations Outcome: model and an evaluation of its fitness for purpose 4

Models Assumptions are often used when considering models, e.g. as being representative of the population – since they are so often derived from a sample – this should be starting to make sense (a bit) Two key topics: –N=all and the open world assumption –Model of the thing of interest versus model of the data (data model; structural form) All models are wrong but some are useful (generally attributed to the statistician George Box) 5

Conceptual, logical and physical models 6 Applied to a database: However our models will be mathematical, statistical, or a combination. The concept of the model comes from the hypothesis The implementation of the physical model comes from the data ;-)

Art or science? The form of the model, incorporating the hypothesis determines a form Thus, as much art as science because it depends both on your world view and what the data is telling you (or not) We will however, be giving the models nice mathematical properties; orthogonal/ orthonormal basis functions, etc… 7

Exploring the distribution > summary(EPI) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > boxplot(EPI) > fivenum(EPI,na.rm=TRUE) [1] Tukey: min, lower hinge, median, upper hinge, max 8

Stem and leaf plot > stem(EPI)# like-a histogram The decimal point is 1 digit(s) to the right of the | - but the scale of the stem is 10… watch carefully.. 3 | | | | | | | | | | | 11 8 | | 4 9

Histogram > hist(EPI)#defaults 10

Distributions Shape Character Parameter(s) Which one fits? 11

12 > hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines (density(EPI,na.rm=TR UE,bw=1.)) > rug(EPI) or > lines (density(EPI,na.rm=TR UE,bw=SJ))

13 > hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines (density(EPI,na.rm=TR UE,bw=SJ))

Why are histograms so unsatisfying? 14

> xn<-seq(30,95,1) > qn<- dnorm(xn,mean=63, sd=5,log=FALSE) > lines(xn,qn) > lines(xn,.4*qn) > ln<-dnorm(xn,mean=44, sd=5,log=FALSE) > lines(xn,.26*ln) 15

Eland ~ EPI !Landlock > hist(ELand, seq(30., 95., 1.0), prob=TRUE); lines … 16

No surface water 17

EPIreg<- EPI_data$EPI[EPI_data$EPI_reg ions=="Europe"] 18

Exploring other distributions > summary(DALY) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > fivenum(DALY,na.rm=TRUE) [1] EPIDALY

Stem and leaf plot > stem(DALY) # The decimal point is 1 digit(s) to the right of the | 0 | | | | | | | | | | | | | | | | | | | 22 20

DALY hist(DALY, seq(0., 99., 1.0), prob=TRUE) lines(density( DALY, na.rm=TRUE,bw=1.)) lines(density( DALY, na.rm=TRUE,bw=SJ)) 21

Beyond histograms Cumulative distribution function: probability that a real-valued random variable X with a given probability distribution will be found at a value less than or equal to x. > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) 22

Beyond histograms Quantile ~ inverse cumulative density function – points taken at regular intervals from the CDF, e.g. 2-quantiles=median, 4-quantiles=quartiles Quantile-Quantile (versus default=normal dist.) > par(pty="s") > qqnorm(EPI); qqline(EPI) 23

Beyond histograms Simulated data from t-distribution (random): > x <- rt(250, df = 5) > qqnorm(x); qqline(x) 24

Beyond histograms Q-Q plot against the generating distribution: x<- seq(30,95,1) > qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn") > qqline(x) 25

DALY (ecdf and qqplot) 26

Weibull qqplot…….. 27

Testing the fits shapiro.test(EPI) # null hypothesis – normal? Shapiro-Wilk normality test data: EPI W = , p-value = Interpretation: W and probability-value Reject null hypothesis or not? Here.. ~ NO. DALY: W = , p-value = 1.891e-07 (reject) 28

Kolmogorov–Smirnov One-sided or two-sided: > ks.test(EPI,seq(30.,95.,1.0)) Two-sample Kolmogorov-Smirnov test data: EPI and seq(30, 95, 1) D = , p-value = alternative hypothesis: two-sided Warning message: In ks.test(EPI, seq(30, 95, 1)) : p-value will be approximate in the presence of ties D=distance between ECDF (blue) of sample and CDF (red) for one-sided: but p-value is important – accept if p-value>

Variability in normal distributions 30

F-test 31 F = S 1 2 / S 2 2 where S 1 and S 2 are the sample variances. The more this ratio deviates from 1, the stronger the evidence for unequal population variances.

> var.test(EPI,DALY) F test to compare two variances data: EPI and DALY F = , num df = 162, denom df = 191, p-value < 2.2e-16 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: sample estimates: ratio of variances

T-test 33

Comparing distributions > t.test(EPI,DALY) Welch Two Sample t-test data: EPI and DALY t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y

Comparing distributions > boxplot(EPI,DALY) 35

CDF for EPI and DALY 36 > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) > plot(ecdf(DALY), do.points=FALSE, verticals=TRUE, add=TRUE)

qqplot(EPI,DALY) 37

Oooppss did we forget? 38

Goal? Find the single most important factor in increasing the EPI in a given region Preceding table gives a nested conceptual model Examine distributions down to the leaf nodes and build up an EPI model 39

boxplot(ENVHEALTH,ECOSYSTEM) 40

qqplot(ENVHEALTH,ECOSYSTEM) 41

ENVHEALTH/ ECOSYSTEM > shapiro.test(ENVHEALTH) Shapiro-Wilk normality test data: ENVHEALTH W = , p-value = 1.083e Reject. > shapiro.test(ECOSYSTEM) Shapiro-Wilk normality test data: ECOSYSTEM W = , p-value = ~reject 42

Kolmogorov- Smirnov - KS test - > ks.test(EPI,DALY) Two-sample Kolmogorov-Smirnov test data: EPI and DALY D = , p-value = alternative hypothesis: two-sided Warning message: In ks.test(EPI, DALY) : p-value will be approximate in the presence of ties 43

44

How are the software installs going? R/Scipy (et al)/Matlab – getting comfortable? Data infrastructure … (Matlab, R, scipy/numpy) table comparisonhttp://hyperpolyglot.org/numerical-analysis 45

Tentative assignments Assignment 2: Datasets and data infrastructures – lab assignment. Held in week 3 (Feb. 7) 10% (lab; individual); Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual); Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual); Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual); Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual); Term project. Due ~ week % (25% written, 5% oral; individual). 46

Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: (do not leave a Contact hours: Monday** 3:00-4:00pm (or by appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by ) TA: Lakshmi Chenicheri Web site: –Schedule, lectures, syllabus, reading, assignments, etc. 47