Data Analytics – ITWS-4963/ITWS-6965

Data Analytics – ITWS-4963/ITWS-6965
Lab exercises: beginning to work with data: filtering, distributions, populations, testing and models. Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2b, January 31, 2014

Assignment 1 – discuss. Review/ critique (business case, area of application, approach/ methods, tools used, results, actions, benefits). What did you choose/ why and what are your review comments?

Today Exploring data and their distributions
Fitting distributions, joint and conditional Constructing models and testing their fitness

Files http://escience.rpi.edu/data/DA
2010EPI_data.xls – with missing values changed to suit your application (EPI2010_all countries or EPI2010_onlyEPIcountries tabs) Get the data read in as you did last week (in e.g. I will use “EPI” for the object (in R))

Tips (in R) > attach(EPI) # sets the ‘default’ object
> fix(EPI) # launches a simple data editor > EPI # prints out values [1] NA NA NA NA NA NA [18] NA NA NA [35] NA NA NA NA [52] NA NA NA 78.2 [69] NA NA NA NA NA NA NA [86] NA NA NA [103] NA NA NA NA NA 63.7 [120] NA NA NA NA [137] NA NA NA NA NA NA NA 66.4 [154] NA NA NA NA [171] NA NA NA NA NA [188] NA NA NA NA NA NA [205] NA NA NA NA NA NA 62.9 [222] NA NA NA NA NA > tf < is.na(EPI) # records True values if the value is NA > E <- EPI[!tf] # filters out NA values, new array

Exercise 1: exploring the distribution
> summary(EPI) # stats Min. 1st Qu. Median Mean 3rd Qu. Max. NA's > fivenum(EPI,na.rm=TRUE) [1] > stem(EPI) # stem and leaf plot > hist(EPI) > hist(EPI, seq(30., 95., 1.0), prob=TRUE) > lines(density(EPI,na.rm=TRUE,bw=1.)) # or try bw=“SJ” > rug(EPI) > Use help(<command>), e.g. > help(stem)

Save your plots, name them.
Save the commands you used to generate them.

Exercise 1: fitting a distribution beyond histograms
Cumulative density function? > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) Quantile-Quantile? > par(pty="s") > qqnorm(highEPI); qqline(highEPI) Simulated data from t-distribution: > x <- rt(250, df = 5) qqnorm(x); qqline(x) Make a Q-Q plot against the generating distribution by: x<-seq(30,95,1) > qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn") > qqline(x)

Exercise 1: fitting a distribution
Your exercise: do the same exploration and fitting for another 2 variables in the EPI_data, i.e. primary variables Try fitting other distributions – i.e. as ecdf or qq-

Distributions functions
Scipy: R: Matlab:

Comparing distributions
> boxplot(EPI,DALY) > t.test(EPI,DALY) Welch Two Sample t-test data: EPI and DALY t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y

qqplot(EPI,DALY)

But there is more Your exercise – intercompare: EPI, ENVHEALTH, ECOSYSTEM, DALY, AIR_H, WATER_H, AIR_EWATER_E, BIODIVERSITY ** (subject to possible filtering…)

Exercise 2: filtering (populations)
Conditional filtering: > EPILand<-EPI[!Landlock] > Eland <- EPILand[!is.na(EPILand)] > hist(ELand) > hist(ELand, seq(30., 95., 1.0), prob=TRUE) Repeat exercise 1… Also look at: No_surface_water, Desert and High_Population_Density Your exercise: how to filter on EPI_regions or GEO_subregion? E.g. EPI_South_Asia <- EPI[<what is this>]

Exercise 3: testing the fits
shapiro.test(EPI) ks.test(EPI) How to interpret (we’ll cover this in the next few classes)?

Variability in normal distributions

F-test F = S12 / S22 where S1 and S2 are the sample variances.
The more this ratio deviates from 1, the stronger the evidence for unequal population variances.

T-test

Exercise 4: joint distributions
> t.test(EPI,DALY) Welch Two Sample t-test data: EPI and DALY t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y > var.test(EPI,DALY) F test to compare two variances F = , num df = 162, denom df = 191, p-value < 2.2e-16 alternative hypothesis: true ratio of variances is not equal to 1 ratio of variances

But if you are not sure it is normal
> wilcox.test(EPI,DALY) Wilcoxon rank sum test with continuity correction data: EPI and DALY W = 15970, p-value = alternative hypothesis: true location shift is not equal to 0

Comparing the CDFs > plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) > plot(ecdf(DALY), do.points=FALSE, verticals=TRUE, add=TRUE)

Kolmogorov- Smirnov - KS test -
> ks.test(EPI,DALY) Two-sample Kolmogorov-Smirnov test data: EPI and DALY D = , p-value = alternative hypothesis: two-sided Warning message: In ks.test(EPI, DALY) : p-value will be approximate in the presence of ties

Objective Distributions Populations Fitting Filtering Testing

Scipy/numpy numpy.ma.array (masked array)
np.histogram and then plt.bar, plt.show scipy.stats.probplot (quantile-quantile) Etc.

Matlab In Matlab – use tf=ismissing(EPI,”--”) hist boxplot QQPlot Etc.

Admin info (keep/ print this slide)
Class: ITWS-4963/ITWS 6965 Hours: 12:00pm-1:50pm Tuesday/ Friday Location: SAGE 3101 Instructor: Peter Fox Instructor contact: (do not leave a msg) Contact hours: Monday** 3:00-4:00pm (or by appt) Contact location: Winslow 2120 (sometimes Lally 207A announced by ) TA: Lakshmi Chenicheri Web site: Schedule, lectures, syllabus, reading, assignments, etc.

Data Analytics – ITWS-4963/ITWS-6965

Similar presentations

Presentation on theme: "Data Analytics – ITWS-4963/ITWS-6965"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Analytics – ITWS-4963/ITWS-6965

Similar presentations

Presentation on theme: "Data Analytics – ITWS-4963/ITWS-6965"— Presentation transcript:

Similar presentations

About project

Feedback