1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 3a, February 4, 2014, SAGE 3101 Preliminary Analysis, Interpretation, Detailed Analysis, Assessment
Contents PDA Interpreting what you get back (the stats, the plots) Detailed analyses/ fitting – a start How to assess/ intercompare 2
Preliminary Data Analysis Relates to the sample v. population (for Big Data) discussion last week Also called Exploratory DA –EDA is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe will be there (John Tukey) Distribution analysis and comparison, visual analysis, model testing, i.e. pretty much the things you did last Friday! Thus we are going to review those results 3
Patterns and Relationships Stepping from elementary/ distribution analysis to algorithmic-based analysis I.e. pattern detection via data mining: classification, clustering, rules; machine learning; support vector machines, non- parametric models Relations – associations between/among populations Outcome: model and an evaluation of its fitness for purpose 4
Models Assumptions are often used when considering models, e.g. as being representative of the population – since they are so often derived from a sample – this should be starting to make sense (a bit) Two key topics: –N=all and the open world assumption –Model of the thing of interest versus model of the data (data model; structural form) All models are wrong but some are useful (generally attributed to the statistician George Box) 5
Conceptual, logical and physical models 6 Applied to a database: However our models will be mathematical, statistical, or a combination. The concept of the model comes from the hypothesis The implementation of the physical model comes from the data ;-)
Art or science? The form of the model, incorporating the hypothesis determines a form Thus, as much art as science because it depends both on your world view and what the data is telling you (or not) We will however, be giving the models nice mathematical properties; orthogonal/ orthonormal basis functions, etc… 7
Exploring the distribution > summary(EPI) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > boxplot(EPI) > fivenum(EPI,na.rm=TRUE) [1] Tukey: min, lower hinge, median, upper hinge, max 8
Stem and leaf plot > stem(EPI)# like-a histogram The decimal point is 1 digit(s) to the right of the | - but the scale of the stem is 10… watch carefully.. 3 | | | | | | | | | | | 11 8 | | 4 9
Histogram > hist(EPI)#defaults 10
Distributions Shape Character Parameter(s) Which one fits? 11
12 > hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines (density(EPI,na.rm=TR UE,bw=1.)) > rug(EPI) or > lines (density(EPI,na.rm=TR UE,bw=SJ))
13 > hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines (density(EPI,na.rm=TR UE,bw=SJ))
Why are histograms so unsatisfying? 14
> xn<-seq(30,95,1) > qn<- dnorm(xn,mean=63, sd=5,log=FALSE) > lines(xn,qn) > lines(xn,.4*qn) > ln<-dnorm(xn,mean=44, sd=5,log=FALSE) > lines(xn,.26*ln) 15
Eland ~ EPI !Landlock > hist(ELand, seq(30., 95., 1.0), prob=TRUE); lines … 16
No surface water 17
EPIreg<- EPI_data$EPI[EPI_data$EPI_reg ions=="Europe"] 18
Exploring other distributions > summary(DALY) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > fivenum(DALY,na.rm=TRUE) [1] EPIDALY
Stem and leaf plot > stem(DALY) # The decimal point is 1 digit(s) to the right of the | 0 | | | | | | | | | | | | | | | | | | | 22 20
DALY hist(DALY, seq(0., 99., 1.0), prob=TRUE) lines(density( DALY, na.rm=TRUE,bw=1.)) lines(density( DALY, na.rm=TRUE,bw=SJ)) 21
Beyond histograms Cumulative distribution function: probability that a real-valued random variable X with a given probability distribution will be found at a value less than or equal to x. > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) 22
Beyond histograms Quantile ~ inverse cumulative density function – points taken at regular intervals from the CDF, e.g. 2-quantiles=median, 4-quantiles=quartiles Quantile-Quantile (versus default=normal dist.) > par(pty="s") > qqnorm(EPI); qqline(EPI) 23
Beyond histograms Simulated data from t-distribution (random): > x <- rt(250, df = 5) > qqnorm(x); qqline(x) 24
Beyond histograms Q-Q plot against the generating distribution: x<- seq(30,95,1) > qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn") > qqline(x) 25
DALY (ecdf and qqplot) 26
Weibull qqplot…….. 27
Testing the fits shapiro.test(EPI) # null hypothesis – normal? Shapiro-Wilk normality test data: EPI W = , p-value = Interpretation: W and probability-value Reject null hypothesis or not? Here.. ~ NO. DALY: W = , p-value = 1.891e-07 (reject) 28
Kolmogorov–Smirnov One-sided or two-sided: > ks.test(EPI,seq(30.,95.,1.0)) Two-sample Kolmogorov-Smirnov test data: EPI and seq(30, 95, 1) D = , p-value = alternative hypothesis: two-sided Warning message: In ks.test(EPI, seq(30, 95, 1)) : p-value will be approximate in the presence of ties D=distance between ECDF (blue) of sample and CDF (red) for one-sided: but p-value is important – accept if p-value>
Variability in normal distributions 30
F-test 31 F = S 1 2 / S 2 2 where S 1 and S 2 are the sample variances. The more this ratio deviates from 1, the stronger the evidence for unequal population variances.
> var.test(EPI,DALY) F test to compare two variances data: EPI and DALY F = , num df = 162, denom df = 191, p-value < 2.2e-16 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: sample estimates: ratio of variances
T-test 33
Comparing distributions > t.test(EPI,DALY) Welch Two Sample t-test data: EPI and DALY t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y
Comparing distributions > boxplot(EPI,DALY) 35
CDF for EPI and DALY 36 > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) > plot(ecdf(DALY), do.points=FALSE, verticals=TRUE, add=TRUE)
qqplot(EPI,DALY) 37
Oooppss did we forget? 38
Goal? Find the single most important factor in increasing the EPI in a given region Preceding table gives a nested conceptual model Examine distributions down to the leaf nodes and build up an EPI model 39
boxplot(ENVHEALTH,ECOSYSTEM) 40
qqplot(ENVHEALTH,ECOSYSTEM) 41
ENVHEALTH/ ECOSYSTEM > shapiro.test(ENVHEALTH) Shapiro-Wilk normality test data: ENVHEALTH W = , p-value = 1.083e Reject. > shapiro.test(ECOSYSTEM) Shapiro-Wilk normality test data: ECOSYSTEM W = , p-value = ~reject 42
Kolmogorov- Smirnov - KS test - > ks.test(EPI,DALY) Two-sample Kolmogorov-Smirnov test data: EPI and DALY D = , p-value = alternative hypothesis: two-sided Warning message: In ks.test(EPI, DALY) : p-value will be approximate in the presence of ties 43
44
How are the software installs going? R/Scipy (et al)/Matlab – getting comfortable? Data infrastructure … (Matlab, R, scipy/numpy) table comparisonhttp://hyperpolyglot.org/numerical-analysis 45
Tentative assignments Assignment 2: Datasets and data infrastructures – lab assignment. Held in week 3 (Feb. 7) 10% (lab; individual); Assignment 3: Preliminary and Statistical Analysis. Due ~ week 4. 15% (15% written and 0% oral; individual); Assignment 4: Pattern, trend, relations: model development and evaluation. Due ~ week 5. 15% (10% written and 5% oral; individual); Assignment 5: Term project proposal. Due ~ week 6. 5% (0% written and 5% oral; individual); Assignment 6: Predictive and Prescriptive Analytics. Due ~ week 8. 15% (15% written and 5% oral; individual); Term project. Due ~ week % (25% written, 5% oral; individual). 46
Admin info (keep/ print this slide) Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: (do not leave a Contact hours: Monday** 3:00-4:00pm (or by appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by ) TA: Lakshmi Chenicheri Web site: –Schedule, lectures, syllabus, reading, assignments, etc. 47