Summarizing Data
Statistics
statistics probability probability vs. statistics sampling inference
Distribution ?
Distribution : A mathematical way to represent the diversity of characteristics of a group. Group may be a sample and a population. population distribution distribution of a sample
dist’n of a sample pop’n dist’n realisticimaginary dataTheory (model) statistics
Statistics starts from data.
Data are clues to truth, and say about truth. Data are not just sets of numbers.
The 1st principle of statistics : The sample is not the same with the population, but the population is represented by the sample sufficiently well.
≈
Datawork
From real world Data collecting Exploring data Reducing data Modeling Evaluating From forest Making timber Inspecting wood grain Cutting Structuring Finishing Woodwork & Datawork
Craft & Endeavor
Tools & Skills
Paper, pencil & calculator Spreadsheet SW (Excel) Minitab, SPSS, SAS, R DBMS ( Access, Oracle, …) C/C++, Java, Python, … Statistical tools You need skill to use these.
Also, you need craft & experiences. However, the more important point in datawork is trying to get perspectives of the data on your hand.
No typical ways for good datawork. Think, think and think ! That’s the only way.
Datawork is not a miagic. It's a hard job. 살라카둘라 메치카불라 비비디 바비디 부 --
Wood grain ?
Grain of data ?
Seeing the grain of data Exploratory Data Analysis ≈
The step to check the basic properties of data, by using the basic statistical methods. From EDA, we aim to develop insight on data, as a first step for more specific analysis. Exploratory Data Analysis (EDA)
Qualitative variable frequency table crosstabulation (contingency table) bar chart, pie chart, …. Basic Statistical Methods
(cumulative) frequency distribution histogram dot-plot stem & leaf diagram scatter plot box plot, …. Quantitative scale Basic Statistical Methods
12 var’s & 100 obs’s Many types of ‘offer’ to cardholders To find the type of ‘offer’ that increases cardholder’s usage maximally. Credit_Card_Bank: p22 of SVV Example Data
[1] "Offer.Status" (Categorical) [2] "Charges.Aug.2008" (Quantitative) [3] "Charges.Sept.2008" (Quantitative) [4] "Charges.Oct.2008“ (Quantitative) [5] "Marketing.Segment" (Categorical) [6] "Industry.Segment" (Categorical) [7] "Spendlift.After.Promotion“(Quantitative) [8] "Pre.Promotion.Avg.Spend" (Quantitative) [9] "Post.Promotion.Avg.Spend" (Quantitative) [10] "Retail.Customer" (Yes, No) [11] "Enrolled.in.Program" (Yes, No) [12] "Spendlift.Positive" (Yes, No) oct08 mseg iseg loct08 = log(oct08) data.svv<-dir("c:/temp/text") dfile.svv<-paste("c:/temp/text/",data.svv,sep="") dsv<- read.table(dfile.svv[1],head=TRUE, sep="\t") names(dsv) oct08 0] mseg<-dsv[,5]; iseg<-dsv[,6]
[1] -Inf [11] [21] [31] Inf [41] [51] Inf [61] Inf [71] [81] -Inf Inf [91] Inf log(oct08): log(0) = - Inf Rounded up to 2 nd decimal round(loct08,2)
[1] [11] [21] [31] [41] [51] [61] [71] [81] [91] Sorted values of log(oct08): after deleting 7 cases of –Inf. round(sort(xoct08,2)
[1] B B A T A T B A A T B T T B B B B B T B R A T A A [26] R B B R T T T A A B T B A R B B A T B B R T T A A [51] B A B B T A A T B A B A B R B A A R A T B T T B R [76] T A T A A B B B T R T T R T B A A A A A A B T A T Levels: A B R T iseg Meaning of the levels are not known.
[1] M L L M B A L A M H M L A M M B L B H L [21] H B L H H M A B H L A H A B L H L B A A [41] A H A L L H L A B A A B A B B A M A B L [61] L B B H B A B A B L B A H L M L L M A B [81] A L L M H A H H L A H L B A H A L L L H Levels: L < B < M < A < H mseg L: low, B: below medium, M: medium, A: above medium, H: high levels(mseg)<-c("M","H","L","A","B") mseg<-factor(mseg, levels=c("L","B","M","A","H")) mseg
Histogram of loct08 loct08 Frequency hist(xoct08,col="grey")
Stem and leaf display: leaf unit = | 5 3 | | | | | | | | 4 a stem a leaf 2.5 stem(xoct08)
leaf unit = 1 2 | 5 3 | | | | | | | | 4 25 stem(10*xoct08)
Min. Q1 Median Q3 Max number summary of log(oct08): IQR = summary(xoct08)
Quartiles : Q1, Q2, Q3 Q1 : values ranked at 25% from lowest Q2 : values ranked at 50% from lowest Q3 : values ranked at 75% from lowest IQR (Inter-Quartile Range) = Q3 – Q1 Median = Q2
How to take : Q1, Q2, Q3 If c is an integer, then c-th ranked value x[c] If c is not an integer, then (x[c - ]+ x[c + ])/2 Q1 : c = 0.25*(n+1) Q2 : c= 0.5*(n+1) Q3 : c= 0.75*(n+1) c - : the largest lower integer than c c + : the smallest upper integer than c
[1] [11] [21] [31] [41] [51] [61] [71] [81] [91] Sorted values of log(oct08): after deleting 7 cases of – Inf. n= 93, 0.25*94=23.5, 0.5*94=47, 0.75*94=70.5
loct08 Dot plot
Box plot oct Box plot of log(oct08) boxplot(xoct08)boxplot(oct08)
IQR Q1 Q3Q2 * * mild-outlierextreme-outlier min(non-outlier) 1.5 IQR
freq %freq cum. freq %cum. freq Low Spender Med Low Spender Average Spender Med High Spender High Spender Total Frequency table table(mseg) table(mseg)/length(mseg) cumsum(table(mseg)) cumsum(table(mseg))/length(mseg)
Bar chart of log(oct08) (2,3](3,4](4,5](5,6](6,7](7,8](8,9](9,10](10,11]
Histogram & Bar chart Histogram : for quantitative variables connected bar’s Bar chart : for categorical variables disconnected bar’s
A B R T Total L B M A H Total Contingency table of mseg and iseg mseg iseg table(mseg,iseg) apply(table(mseg,iseg),1,sum) apply(table(mseg,iseg),2,sum)
A B R T Pie chart of iseg pie(table(iseg),col=c("red","light green","green","blue"))
ABRT Segmented bar chart of (mseg, iseg) - serial barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"))
ABRT Segmented bar chart of (mseg, iseg) - parallel barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"),beside=TRUE)
Mosaic Plot iseg mseg ABRT L B M A H mosaicplot(~iseg+mseg,col=rainbow(5))
LBMAH Box plot of log(oct08) by mseg boxplot(loct08[oct08>0]~mseg[oct08>0])
ABCDEF
Thank you !!