Presentation is loading. Please wait.

Presentation is loading. Please wait.

Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference.

Similar presentations


Presentation on theme: "Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference."— Presentation transcript:

1 Summarizing Data

2 Statistics

3 statistics probability probability vs. statistics sampling inference

4 Distribution ?

5 Distribution : A mathematical way to represent the diversity of characteristics of a group. Group may be a sample and a population. population distribution distribution of a sample

6 dist’n of a sample pop’n dist’n realisticimaginary dataTheory (model) statistics

7

8 Statistics starts from data.

9 Data are clues to truth, and say about truth. Data are not just sets of numbers.

10 The 1st principle of statistics : The sample is not the same with the population, but the population is represented by the sample sufficiently well.

11

12 Datawork

13 From real world Data collecting Exploring data Reducing data Modeling Evaluating From forest Making timber Inspecting wood grain Cutting Structuring Finishing Woodwork & Datawork

14 Craft & Endeavor

15 Tools & Skills

16 Paper, pencil & calculator Spreadsheet SW (Excel) Minitab, SPSS, SAS, R DBMS ( Access, Oracle, …) C/C++, Java, Python, … Statistical tools You need skill to use these.

17 Also, you need craft & experiences. However, the more important point in datawork is trying to get perspectives of the data on your hand.

18 No typical ways for good datawork. Think, think and think ! That’s the only way.

19

20 Datawork is not a miagic. It's a hard job. 살라카둘라 메치카불라 비비디 바비디 부 --

21 Wood grain ?

22 Grain of data ?

23 Seeing the grain of data Exploratory Data Analysis ≈

24 The step to check the basic properties of data, by using the basic statistical methods. From EDA, we aim to develop insight on data, as a first step for more specific analysis. Exploratory Data Analysis (EDA)

25 Qualitative variable frequency table crosstabulation (contingency table) bar chart, pie chart, …. Basic Statistical Methods

26 (cumulative) frequency distribution histogram dot-plot stem & leaf diagram scatter plot box plot, …. Quantitative scale Basic Statistical Methods

27 12 var’s & 100 obs’s Many types of ‘offer’ to cardholders To find the type of ‘offer’ that increases cardholder’s usage maximally. Credit_Card_Bank: p22 of SVV Example Data

28 [1] "Offer.Status" (Categorical) [2] "Charges.Aug.2008" (Quantitative) [3] "Charges.Sept.2008" (Quantitative) [4] "Charges.Oct.2008“ (Quantitative) [5] "Marketing.Segment" (Categorical) [6] "Industry.Segment" (Categorical) [7] "Spendlift.After.Promotion“(Quantitative) [8] "Pre.Promotion.Avg.Spend" (Quantitative) [9] "Post.Promotion.Avg.Spend" (Quantitative) [10] "Retail.Customer" (Yes, No) [11] "Enrolled.in.Program" (Yes, No) [12] "Spendlift.Positive" (Yes, No) oct08 mseg iseg loct08 = log(oct08) data.svv<-dir("c:/temp/text") dfile.svv<-paste("c:/temp/text/",data.svv,sep="") dsv<- read.table(dfile.svv[1],head=TRUE, sep="\t") names(dsv) oct08 0] mseg<-dsv[,5]; iseg<-dsv[,6]

29 [1] -Inf 6.21 3.96 3.84 6.96 6.95 7.89 7.35 3.97 5.97 [11] 5.50 8.00 6.30 3.13 4.58 8.89 3.81 7.00 8.37 7.85 [21] 7.42 6.86 8.45 6.12 5.62 8.21 6.91 6.87 7.15 5.46 [31] 6.71 6.12 -Inf 7.68 9.08 5.91 3.42 6.12 8.05 7.03 [41] 6.02 2.51 7.20 3.29 7.44 5.88 6.33 6.24 4.33 5.93 [51] 5.25 7.85 8.76 7.15 7.95 7.13 -Inf 7.13 8.11 8.05 [61] 9.11 5.56 8.24 -Inf 7.47 6.70 7.52 6.53 8.33 4.63 [71] 6.80 5.72 7.54 3.48 7.57 8.42 8.16 4.67 7.16 5.61 [81] -Inf 10.42 8.73 4.85 -Inf 6.63 5.48 4.89 8.35 4.65 [91] 5.56 7.39 3.11 3.90 5.72 7.10 -Inf 7.58 8.15 6.30 log(oct08): log(0) = - Inf Rounded up to 2 nd decimal round(loct08,2)

30 [1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96 [11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46 [21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91 [31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30 [41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95 [51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20 [61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68 [71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16 [81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89 [91] 9.08 9.11 10.42 Sorted values of log(oct08): after deleting 7 cases of –Inf. round(sort(xoct08,2)

31 [1] B B A T A T B A A T B T T B B B B B T B R A T A A [26] R B B R T T T A A B T B A R B B A T B B R T T A A [51] B A B B T A A T B A B A B R B A A R A T B T T B R [76] T A T A A B B B T R T T R T B A A A A A A B T A T Levels: A B R T iseg Meaning of the levels are not known.

32 [1] M L L M B A L A M H M L A M M B L B H L [21] H B L H H M A B H L A H A B L H L B A A [41] A H A L L H L A B A A B A B B A M A B L [61] L B B H B A B A B L B A H L M L L M A B [81] A L L M H A H H L A H L B A H A L L L H Levels: L < B < M < A < H mseg L: low, B: below medium, M: medium, A: above medium, H: high levels(mseg)<-c("M","H","L","A","B") mseg<-factor(mseg, levels=c("L","B","M","A","H")) mseg

33 Histogram of loct08 loct08 Frequency 246810 0 5 15 20 hist(xoct08,col="grey")

34 Stem and leaf display: leaf unit = 0.1 2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4 a stem a leaf 2.5 stem(xoct08)

35 leaf unit = 1 2 | 5 3 | 11345889 4 | 003667789 5 | 3555666677999 6 | 0011122333567789999 7 | 000111122244445556678999 8 | 0001222234444789 9 | 11 10 | 4 25 stem(10*xoct08)

36 Min. Q1 Median Q3 Max. 2.509 5.563 6.864 7.682 10.420 5 number summary of log(oct08): IQR = 2.119 summary(xoct08)

37 Quartiles : Q1, Q2, Q3 Q1 : values ranked at 25% from lowest Q2 : values ranked at 50% from lowest Q3 : values ranked at 75% from lowest IQR (Inter-Quartile Range) = Q3 – Q1 Median = Q2

38 How to take : Q1, Q2, Q3 If c is an integer, then c-th ranked value x[c] If c is not an integer, then (x[c - ]+ x[c + ])/2 Q1 : c = 0.25*(n+1) Q2 : c= 0.5*(n+1) Q3 : c= 0.75*(n+1) c - : the largest lower integer than c c + : the smallest upper integer than c

39 [1] 2.51 3.11 3.13 3.29 3.42 3.48 3.81 3.84 3.90 3.96 [11] 3.97 4.33 4.58 4.63 4.65 4.67 4.85 4.89 5.25 5.46 [21] 5.48 5.50 5.56 5.56 5.61 5.62 5.72 5.72 5.88 5.91 [31] 5.93 5.97 6.02 6.12 6.12 6.12 6.21 6.24 6.30 6.30 [41] 6.33 6.53 6.63 6.70 6.71 6.80 6.86 6.87 6.91 6.95 [51] 6.96 7.00 7.03 7.10 7.13 7.13 7.15 7.15 7.16 7.20 [61] 7.35 7.39 7.42 7.44 7.47 7.52 7.54 7.57 7.58 7.68 [71] 7.85 7.85 7.89 7.95 8.00 8.05 8.05 8.11 8.15 8.16 [81] 8.21 8.24 8.33 8.35 8.37 8.42 8.45 8.73 8.76 8.89 [91] 9.08 9.11 10.42 Sorted values of log(oct08): after deleting 7 cases of – Inf. n= 93, 0.25*94=23.5, 0.5*94=47, 0.75*94=70.5

40 loct08 Dot plot

41 0 5000 10000 15000 20000 25000 30000 Box plot oct08 4 6 8 10 Box plot of log(oct08) boxplot(xoct08)boxplot(oct08)

42 IQR Q1 Q3Q2 * * mild-outlierextreme-outlier min(non-outlier) 1.5 IQR

43 freq %freq cum. freq %cum. freq Low Spender 26 0.26 26 0.26 Med Low Spender 20 0.20 46 0.46 Average Spender 11 0.11 57 0.57 Med High Spender 25 0.25 82 0.82 High Spender 18 0.18 100 1.00 ------------------------------------------------------------ Total 100 1.00 Frequency table table(mseg) table(mseg)/length(mseg) cumsum(table(mseg)) cumsum(table(mseg))/length(mseg)

44 Bar chart of log(oct08) (2,3](3,4](4,5](5,6](6,7](7,8](8,9](9,10](10,11] 0 5 10 15 20

45 Histogram & Bar chart Histogram : for quantitative variables connected bar’s Bar chart : for categorical variables disconnected bar’s

46 A B R T Total L 5 13 0 8 26 B 11 8 0 1 20 M 2 4 2 3 11 A 8 7 2 8 25 H 5 0 6 7 18 Total 31 32 10 27 100 Contingency table of mseg and iseg mseg iseg table(mseg,iseg) apply(table(mseg,iseg),1,sum) apply(table(mseg,iseg),2,sum)

47 A B R T Pie chart of iseg 31 32 10 27 pie(table(iseg),col=c("red","light green","green","blue"))

48 ABRT 0 5 10 15 20 25 30 Segmented bar chart of (mseg, iseg) - serial barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"))

49 ABRT 0 2 4 6 8 10 12 Segmented bar chart of (mseg, iseg) - parallel barplot(table(mseg,iseg),col=c("red","light green","green","blue","purple"),beside=TRUE)

50 Mosaic Plot iseg mseg ABRT L B M A H mosaicplot(~iseg+mseg,col=rainbow(5))

51 LBMAH 4 6 8 10 Box plot of log(oct08) by mseg boxplot(loct08[oct08>0]~mseg[oct08>0])

52 ABCDEF 1011033 7171559 2021712315 141126522 141634315 121413616 101725113 231715110 171935326 202105226 14712624 13 444

53

54 Thank you !!


Download ppt "Summarizing Data. Statistics statistics probability probability vs. statistics sampling inference."

Similar presentations


Ads by Google