Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics.

Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics

What I know What “they” can do How “they” can do it What is some particular entity doing ? How safe is your particular information ? Is big brother watching me right now ? What I don’t know

* Data being created at lightning pace * Moore’s law: (doubling / 2 years – transistors on integrated circuits) Internet “hits” Scanner cards e-mails Intercepted messages Credit scores Environmental Monitoring Satellite Images Weather Data Health & Birth Records

So we have some data – now what?? Predict defaults, dropouts, etc. Find buying patterns Segment your market Detect SPAM (or others) Diagnose handwriting Cluster ANALYZE IT !!!!

Data Mining - What is it? Large datasets Fast methods Not significance testing Topics –Trees (recursive splitting) –Nearest Neighbor –Neural Networks –Clustering –Association Analysis

Trees A “divisive” method (splits) Start with “root node” – all in one group Get splitting rules Response often binary Result is a “tree” Example: Framingham Heart Study

Recursive Splitting X1=Debt To Income Ratio X2 = Age Pr{default} =0.007 Pr{default} =0.012 Pr{default} =0.0001 Pr{default} =0.003 Pr{default} =0.006 x x x x x xx x x x x x x xx x x x x x x x xxx x x x x x x x D x x x x x x x x x x x x x x x x x x D x x x x xx x x x D x x x x x x x x x x x x x x x x x D x x x x x x xx x x x x x x x x xD x x x x D xx x x x x x x x x D x x x x x x x x x x xx x x x x x x x x x x x x x xxx x x x x x x x x x x x x x x D x x x xx x x D x x x x x x x x x x x x x x x X x x x x x x x x x x x x x xD x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Dx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x DD x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x xx x x x x x x x xxx x x D x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x xx D x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x xxx x xx x x x x x x x x x x x x x xxx x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x X x x x x D x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x xx x x x x x x x xxx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x xx x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x xxx x x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x D X x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x D D x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x x x x x

Some Actual Data Framingham Heart Study First Stage Coronary Heart Disease –P{CHD} = Function of: Age - no drug yet!  Cholesterol Systolic BP Import

Example of a “tree” All 1615 patients Split # 1: Age “terminal node” Systolic BP

How to make splits? Which variable to use? Where to split? –Cholesterol > ____ –Systolic BP > _____ Goal: Pure “leaves” or “terminal nodes” Ideal split: Everyone with BP>x has problems, nobody with BP<x has problems

Where to Split? Maximize “dependence” statistically We use “contingency tables” 95 5 55 45 Heart Disease No Yes Low BP High BP 100 DEPENDENT 75 25 75 25 INDEPENDENT Heart Disease No Yes

Measuring Dependence Expect 100(150/200)=75 in upper left if independent (etc. e.g. 100(50/200)=25) 95 (75) 5 (25) 55 (75) 45 (25) Heart Disease No Yes Low BP High BP 100 150 50 200 How far from expectations is “too far” (significant dependence)

 2 Test Statistic 95 (75) 5 (25) 55 (75) 45 (25) Low BP High BP 100 150 50 200 2(400/75)+2(400/25) = 42.67 42.67 - So what?

Use Probability! “P-value” “Significance Level” (0.05)

Measuring “Worth” of a Split P-value is probability of   as great as that observed if independence is true. (Pr {  2 >42.67} is 0.000000000064 P-values all too small to understand. Logworth = -log 10 (p-value) = 10.19 Best Chi-square  max logworth.

Logworth for Age Splits Age 47 maximizes logworth

How to make splits? Which variable to use? Where to split? –Cholesterol > ____ –Systolic BP > _____ Idea – Pick BP cutoff to minimize p-value for  2 What does “signifiance” mean now?

Multiple testing 50 different BPs in data, 49 ways to split Sunday football highlights always look good! If he shoots enough baskets, even 95% free throw shooter will miss. Tried 49 splits, each has 5% chance of declaring significance even if there’s no relationship.

Multiple testing  = Pr{ falsely reject hypothesis 1}  = Pr{ falsely reject hypothesis 2} Pr{ falsely reject one or the other} < 2  Desired: 0.05 probabilty or less Solution: use  = 0.05/2 Or – compare 2  to 0.05

Multiple testing 50 different BPs in data, m=49 ways to split Multiply p-value by 49 Stop splitting if minimum p-value is large (logworth is small). For m splits, logworth becomes -log 10 (m*p-value)

Other Split Evaluations Gini Diversity Index –{ E E E E G E G G L G} –Pick 2, Pr{different} = 1-Pr{EE}-Pr{GG}-Pr{LL} 1- [ 10 + 6 + 0]/45 =29/45=0.64 –{ E E G L G E E G L L } 1-[6+3+3]/45 = 33/45 = 0.73  MORE DIVERSE, LESS PURE Shannon Entropy –Larger  more diverse (less pure) – -  i p i log 2 (p i ) {0.5, 0.4, 0.1}  1.36 {0.4, 0.2, 0.3}  1.51 (more diverse)

Goals Split if diversity in parent “node” > summed diversities in child nodes Observations should be –Homogeneous (not diverse) within leaves –Different between leaves –Leaves should be diverse Framingham tree used Gini for splits

Cross validation Traditional stats – small dataset, need all observations to estimate parameters of interest. Data mining – loads of data, can afford “holdout sample” Variation: n-fold cross validation –Randomly divide data into n sets –Estimate on n-1, validate on 1 –Repeat n times, using each set as holdout.

Pruning Grow bushy tree on the “fit data” Classify holdout data Likely farthest out branches do not improve, possibly hurt fit on holdout data Prune non-helpful branches. What is “helpful”? What is good discriminator criterion?

Goals Want diversity in parent “node” > summed diversities in child nodes Goal is to reduce diversity within leaves Goal is to maximize differences between leaves Use same evaluation criteria as for splits Costs (profits) may enter the picture for splitting or evaluation.

Accounting for Costs Pardon me (sir, ma’am) can you spare some change? Say “sir” to male +$2.00 Say “ma’am” to female +$5.00 Say “sir” to female -$1.00 (balm for slapped face) Say “ma’am” to male -$10.00 (nose splint)

Including Probabilities True Gender M F Leaf has Pr(M)=.7, Pr(F)=.3. You say: M F 0.7 (2) 0.3 (-1) 0.7 (-10) 0.3 (5) Expected profit is 2(0.7)-1(0.3) = $1.10 if I say “sir” Expected profit is -7+1.5 = -$5.50 (a loss) if I say “Ma’am” Weight leaf profits by leaf size (# obsns.) and sum Prune (and split) to maximize profits.

Additional Ideas Forests – Draw samples with replacement (bootstrap) and grow multiple trees. Random Forests – Randomly sample the “features” (predictors) and build multiple trees. Classify new point in each tree then average the probabilities, or take a plurality vote from the trees

* Lift Chart - Go from leaf of most to least response. - Lift is cumulative proportion responding.

Regression Trees Continuous response (not just class) Predicted response constant in regions Predict 50 Predict 80 Predict 100 Predict 130 Predict 20 X1X1 X2X2 {47, 51, 57, 45} 50 = mean

Predict P i in cell i (it’s cell mean) Y ij j th response in cell i. Split to minimize  i  j (Y ij -P i ) 2 [sum of squared deviations from cell mean] Predict 50 Predict 80 Predict 100 Predict 130 Predict 20 {-3, 1, 7, -5} SSq=9+1+49+25 = 84

Predict P i in cell i. Y ij j th response in cell i. Split to minimize  i  j (Y ij -P i ) 2

Logistic Regression Logistic – another classifier Older – “tried & true” method Predict probability of response from input variables (“Features”) Need to insure 0 < probability < 1

Example: Shuttle Missions O-rings failed in Challenger disaster Low temperature Prior flights “erosion” and “blowby” in O-rings Feature: Temperature at liftoff Target: problem (1) - erosion or blowby vs. no problem (0)

We can easily “fit” lines Lines exceed 1, fall below 0 Model L as linear in temperature L = a+b(temp) Convert: p = e L /(1+e L ) = e a+b(temp) / (1+e a+b(temp) ) Convert

Example: Ignition Flame exposure time = X Ignited Y=1, did not ignite Y=0 –Y=0, X= 3, 5, 9 10, 13, 16 –Y=1, X = 11, 12 14, 15, 17, 25, 30 Probability of our data is “Q” Q=(1-p)(1-p)(1-p)(1-p)pp(1-p)pp(1-p)ppp P’s all different p=f(exposure) Find a,b to maximize Q(a,b)

Likelihood function (Q) -2.6 0.23

IGNITION DATA The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr > ChiSq Intercept 1 -2.5879 1.8469 1.9633 0.1612 TIME 1 0.2346 0.1502 2.4388 0.1184 Association of Predicted Probabilities and Observed Responses Percent Concordant 79.2 Somers' D 0.583 Percent Discordant 20.8 Gamma 0.583 Percent Tied 0.0 Tau-a 0.308 Pairs 48 c 0.792

5 right, 4 wrong 4 right, 1 wrong

Example: Framingham X=age Y=1 if heart trouble, 0 otherwise

Framingham The LOGISTIC Procedure Analysis of Maximum Likelihood Estimates Standard Wald Parameter DF Estimate Error Chi-Square Pr>ChiSq Intercept 1 -5.4639 0.5563 96.4711 <.0001 age 1 0.0630 0.0110 32.6152 <.0001

Neural Networks Very flexible functions “Hidden Layers” “Multilayer Perceptron” Logistic function of Logistic functions Of data output inputs

Arrows represent linear combinations of “basis functions,” e.g. logistics b1 Example: Y = a + b1 p1 + b2 p2 + b3 p3 Y = 4 + p1+ 2 p2 - 4 p3

Should always use holdout sample Perturb coefficients to optimize fit (fit data) Eliminate unnecessary arrows using holdout data.

Terms Train: estimate coefficients Bias: intercept a in Neural Nets Weights: coefficients b Radial Basis Function: Normal density Score: Predict (usually Y from new Xs) Activation Function: transformation to target Supervised Learning: Training data has response.

Hidden Layer L1 = -1.87 -.27*Age – 0.20*SBP22 H11=exp(L1)/(1+exp(L1)) L2 = -20.76 -21.38*H11 Pr{first_chd} = exp(L2)/(1+exp(L2)) “Activation Function”

Unsupervised Learning We have the “features” (predictors) We do NOT have the response even on a training data set (UNsupervised) Clustering –Agglomerative Start with each point separated –Divisive Start with all points in one cluster then spilt

Clustering – political (hypothetical) 300 people: “mark line to indicate concern”: ---------0-------------- X1: economy X2: war in Iraq X3: health care 1 st person (2.2 -3.1 0.9) 2 nd person (-1.6 1 0.6) Etc.

Clusters as Created

As Clustered

Association Analysis Market basket analysis –What they’re doing when they scan your “VIP” card at the grocery –People who buy diapers tend to also buy _________ (beer?) –Just a matter of accounting but with new terminology (of course ) –Examples from SAS Appl. DM Techniques, by Sue Walsh:

Termnilogy Baskets: ABC ACD BCD ADE BCE Rule Support Confidence X=>Y Pr{X and Y} Pr{Y|X} A => D 2/5 2/3 C => A 2/5 2/4 B&C => D 1/5 1/3

Don’t be Fooled! Lift = Confidence /Expected Confidence if Independent Checking-> Saving V No (1500) Yes (8500)(10000) No50035004000 Yes100050006000 SVG=>CHKG Expect 8500/10000 = 85% if independent Observed Confidence is 5000/6000 = 83% Lift = 83/85 < 1. Savings account holders actually LESS likely than others to have checking account !!!

Summary Data mining – a set of fast stat methods for large data sets Some new ideas, many old or extensions of old Some methods: –Decision Trees –Nearest Neighbor –Neural Nets –Clustering –Association

Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics.

Similar presentations

Presentation on theme: "Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics.

Similar presentations

Presentation on theme: "Data Mining: What is All That Data Telling Us? Dave Dickey NCSU Statistics."— Presentation transcript:

Similar presentations

About project

Feedback