Download presentation
Presentation is loading. Please wait.
1
Collective Intelligence Week 7: Decision Trees
Old Dominion University Department of Computer Science CS 795/895 Spring 2009 Michael L. Nelson 2/25/09
2
Decision Trees A decision tree for classifying fruit
3
Scenario: Predicting Subscriptions
Your web site offers premium content. You run a promotion offering free subscriptions for some period of time. You collect mostly http-level information about the people signing up. Can we predict who will sign up for basic or premium service at the end of the trial period based on the data we’ve collected?
4
User Data add this line to make the table match the data / code
slashdot UK no None add this line to make the table match the data / code
5
Is Reading the FAQ a Good Predictor For Subscription?
>>> import treepredict >>> treepredict.divideset(treepredict.my_data,2,'yes') ([['slashdot', 'USA', 'yes', 18, 'None'], ['google', 'France', 'yes', 23, 'Premium'], ['digg', 'USA', 'yes', 24, 'Basic'], ['kiwitobes', 'France', 'yes', 23, 'Basic'], ['slashdot', 'France', 'yes', 19, 'None'], ['digg', 'New Zealand', 'yes', 12, 'Basic'], ['google', 'UK', 'yes', 18, 'Basic'], ['kiwitobes', 'France', 'yes', 19, 'Basic']], [['google', 'UK', 'no', 21, 'Premium'], ['(direct)', 'New Zealand', 'no', 12, 'None'], ['(direct)', 'UK', 'no', 21, 'Basic'], ['google', 'USA', 'no', 24, 'Premium'], ['digg', 'USA', 'no', 18, 'None'], ['google', 'UK', 'no', 18, 'None'], ['kiwitobes', 'UK', 'no', 19, 'None'], ['slashdot', 'UK', 'no', 21, 'None']]) line breaks & spaces added for clarity (cf. Table 7-2) Eyeballing the result, it doesn’t appear FAQ is a good predictor
6
Gini Impurity a measure of set homogeneity;
>>> treepredict.giniimpurity(treepredict.my_data) >>> set1,set2=treepredict.divideset(treepredict.my_data,2,'yes') >>> treepredict.giniimpurity(set1) >>> treepredict.giniimpurity(set2) >>> set1 [['slashdot', 'USA', 'yes', 18, 'None'], ['google', 'France', 'yes', 23, 'Premium'], ['digg', 'USA', 'yes', 24, 'Basic'], ['kiwitobes', 'France', 'yes', 23, 'Basic'], ['slashdot', 'France', 'yes', 19, 'None'], ['digg', 'New Zealand', 'yes', 12, 'Basic'], ['google', 'UK', 'yes', 18, 'Basic'], ['kiwitobes', 'France', 'yes', 19, 'Basic']] a measure of set homogeneity; explanation/code on p. 147; 0 = homogeneous set IG = (#none/#total)(1-#none/#total) + (#basic/#total)(1-#basic/#total) + (#premium/#total)(1-#premium/#total) = (2/8)(6/8) + (5/8)(3/8) + (1/8)(7/8) =
7
Entropy a measure of set disorder; explanation/code on p. 148;
>>> treepredict.entropy(treepredict.my_data) >>> set1,set2=treepredict.divideset(treepredict.my_data,2,'yes') >>> treepredict.entropy(set1) >>> treepredict.entropy(set2) >>> set1 [['slashdot', 'USA', 'yes', 18, 'None'], ['google', 'France', 'yes', 23, 'Premium'], ['digg', 'USA', 'yes', 24, 'Basic'], ['kiwitobes', 'France', 'yes', 23, 'Basic'], ['slashdot', 'France', 'yes', 19, 'None'], ['digg', 'New Zealand', 'yes', 12, 'Basic'], ['google', 'UK', 'yes', 18, 'Basic'], ['kiwitobes', 'France', 'yes', 19, 'Basic']] a measure of set disorder; explanation/code on p. 148; 0 = homogeneous set H = -(2/8)(log22/8) + -(5/8)(log22/8) + -(1/8)(log21/8) = = 1.298
8
Building a Decision Tree
Maximize Information Gain,the difference between the entropy of the current set and the weighted average entropy of the two new groups max(H-H(i)) Recursively repeat on each branch of tree until Information Gain is < 0 i.e., stop when you’re creating more disorder
9
Building the Tree i=‘google’ gives max(H-H(i))
>>> treepredict.entropy(treepredict.my_data) >>> set1,set2=treepredict.divideset(treepredict.my_data,0,'slashdot') >>> treepredict.entropy(set1) 0.0 >>> set1 [['slashdot', 'USA', 'yes', 18, 'None'], ['slashdot', 'France', 'yes', 19, 'None'], ['slashdot', 'UK', 'no', 21, 'None']] >>> treepredict.entropy(set2) >>> set1,set2=treepredict.divideset(treepredict.my_data,0,'digg') >>> set1,set2=treepredict.divideset(treepredict.my_data,0,'(direct)') 1.0 >>> set1,set2=treepredict.divideset(treepredict.my_data,0,'kiwitobes') >>> set1,set2=treepredict.divideset(treepredict.my_data,0,'google') [['google', 'France', 'yes', 23, 'Premium'], ['google', 'UK', 'no', 21, 'Premium'], ['google', 'USA', 'no', 24, 'Premium'], ['google', 'UK', 'no', 18, 'None'], ['google', 'UK', 'yes', 18, 'Basic']] >>> set2 [['slashdot', 'USA', 'yes', 18, 'None'], ['digg', 'USA', 'yes', 24, 'Basic'], ['kiwitobes', 'France', 'yes', 23, 'Basic'], ['(direct)', 'New Zealand', 'no', 12, 'None'], ['(direct)', 'UK', 'no', 21, 'Basic'], ['slashdot', 'France', 'yes', 19, 'None'], ['digg', 'USA', 'no', 18, 'None'], ['kiwitobes', 'UK', 'no', 19, 'None'], ['digg', 'New Zealand', 'yes', 12, 'Basic'], ['slashdot', 'UK', 'no', 21, 'None'], ['kiwitobes', 'France', 'yes', 19, 'Basic']] >>> slashdot = 0*3/ *13/16 = 1.24 digg = 0.91*3/ *13/16 = 1.41 (direct) = 1.0*2/ *14/16 = 1.46 kiwitobes = 0.91*3/ *13/16 = 1.41 google = 1.37*5/ *11/16 = 1.10 i=‘google’ gives max(H-H(i)) set1,set2 cardinality not shown in python transcript for digg, (direct), kiwitobes
10
Viewing the Tree >>> tree=treepredict.buildtree(treepredict.my_data) >>> treepredict.printtree(tree) 0:google? T-> 3:21? T-> {'Premium': 3} F-> 2:yes? T-> {'Basic': 1} F-> {'None': 1} F-> 0:slashdot? T-> {'None': 3} T-> {'Basic': 4} F-> 3:21? F-> {'None': 3} >>> treepredict.drawtree(tree, jpeg='treeview.jpg') >>>
11
Pruning The Tree The Tree can become overfitted to the
>>> treepredict.printtree(tree) 0:google? T-> 3:21? T-> {'Premium': 3} F-> 2:yes? T-> {'Basic': 1} F-> {'None': 1} F-> 0:slashdot? T-> {'None': 3} T-> {'Basic': 4} F-> 3:21? F-> {'None': 3} >>> treepredict.prune(tree,0.1) (same tree) >>> treepredict.prune(tree,0.5) >>> treepredict.prune(tree,0.75) >>> treepredict.prune(tree,0.90) F-> {'None': 6, 'Basic': 5} >>> treepredict.drawtree(tree,jpeg='pruned-tree.jpeg') The Tree can become overfitted to the training data. Pruning checks pairs of nodes with a common parent to see if merging would increase H by < threshold.
12
Missing Data >>> # reminder: referer,location,FAQ,pages
>>> treepredict.mdclassify(['google',None,'yes',None],tree) {'Premium': 2.25, 'Basic': 0.25} >>> treepredict.mdclassify(['google','France',None,None],tree) {'None': 0.125, 'Premium': 2.25, 'Basic': 0.125} >>> treepredict.mdclassify(['google',None,None,'14'],tree) {'None': 0.5, 'Basic': 0.5} ex1: location & pages unknown FAQ=yes, so 1 outcome faq_weight = 1/1 basic = 1 * 1.0 if pages >20 then 3 outcomes else 1 outcome pages_true_weight=3/4 pages_false_weight=1/4 premium = 3 * 3/4, basic = 1.0 * 1/4 ex2: FAQ & pages unknown if FAQ then 1 outcome else 1 outcome faq_true_weight = 1/2 faq_false_weight = 1/2 none = 1 * 0.5, basic = 1 * 0.5 if pages >20 then 3 outcomes else 2 outcomes (each with weight = 0.5) pages_true_weight=3/4 pages_false_weight=1/4 premium = 3 * 3/4, basic = 0.5 * 1/4, none = 0.5 * 1/4
13
Numerical, Not Categorical Outcomes
height (in) = {56,59,59,61,62,74,76,76,78} this list could be categorized as: short (<65”) tall (>72”) or we could use the integers as values we would use variance as our measure of dispersion, not Gini Impurity or Entropy
14
Zillow API I could not get the Cambridge, MA
>>> import zillow >>> housedata=zillow.getpricelist() 510 Rhode Island 511 Rhode Island 516 Rhode Island 517 Rhode Island 519 Rhode Island 520 Rhode Island 523 Rhode Island 524 Rhode Island 527 Rhode Island 530 Rhode Island 532 Rhode Island 535 Rhode Island 536 Rhode Island 539 Rhode Island >>> import treepredict >>> housetree=treepredict.buildtree(housedata,scoref=treepredict.variance) >>> treepredict.drawtree(housetree,’norfolk.jpeg') >>> #zip,type,yearbuilt,bathrooms,bedrooms,rooms(always 1),est. value >>> housedata [(u'23508', u'SingleFamily', 1918, 2.0, 3, 1, u'281500'), (u'23508', u'SingleFamily', 1925, 2.0, 5, 1, u'408000'), (u'23508', u'SingleFamily', 1918, 1.0, 3, 1, u'367000'), (u'23508', u'SingleFamily', 1920, 1.0, 3, 1, u'317500'), (u'23508', u'SingleFamily', 1932, 2.0, 4, 1, u'329500'), (u'23508', u'SingleFamily', 1923, 1.0, 3, 1, u'239500'), (u'23508', u'SingleFamily', 1923, 2.5, 3, 1, u'262000'), (u'23508', u'SingleFamily', 1918, 1.5, 3, 1, u'272000'), (u'23508', u'SingleFamily', 1918, 2.0, 4, 1, u'279500'), (u'23508', u'SingleFamily', 1914, 2.0, 4, 1, u'306500'), (u'23508', u'SingleFamily', 1913, 1.0, 3, 1, u'266500'), (u'23508', u'Quadruplex', 1920, 4.0, 8, 1, u'541500'), (u'23508', u'SingleFamily', 1927, 2.0, 3, 1, u'321000'), (u'23508', u'SingleFamily', 1918, 1.0, 3, 1, u'229500')] >>> treepredict.mdclassify(['23508','SingleFamily',1920,2.0,4,1,None],housetree) {u'279500': 1} >>> treepredict.mdclassify(['23508','SingleFamily',1920,1.5,4,1,None],housetree) {u'317500': 1} Zillow API I could not get the Cambridge, MA example to work, so I did my street. N.B. -- nonconsecutive house numbers; zillow will just make up results for non-existent houses not enough training?
15
Hot or Not? get 500 random profiles for each profile, get data
>>> import hotornot >>> l1=hotornot.getrandomratings(500) >>> len(l1) 440 >>> pdata=hotornot.getpeopledata(l1) >>> pdata[0] (u'male', 18, 'Mid Atlantic', 9) >>> pdata[1] (u'male', 25, 'West', 9) >>> pdata[2] (u'male', 25, 'Midwest', 7) >>> hottree=treepredict.buildtree(pdata,scoref=treepredict.variance) >>> treepredict.drawtree(hottree,'hottree1.jpeg') >>> treepredict.prune(hottree,0.5) >>> treepredict.drawtree(hottree,'hottree2.jpeg') >>> south=treepredict.mdclassify((None,None,'South'),hottree) >>> ne=treepredict.mdclassify((None,None,'New England'),hottree) >>> south[10]/sum(south.values()) e-05 >>> ne[10]/sum(ne.values()) >>> south {8: , 9: , 10: , 6: , 7: } >>> ne {5: , 6: , 7: , 8: , 9: , 10: } >>> south[9]/sum(south.values()) >>> south[9]/sum(ne.values()) >>> ne[9]/sum(ne.values()) Hot or Not? get 500 random profiles for each profile, get data
16
Decision Trees Summary
pros Easy to interpret predictive & descriptive categories + numerical data can have nodes with many outcomes (probabilistic outcomes) cons only <, > operators on numerical outcomes doesn’t handle many {inputs|outcomes} well can’t uncover complex relationships between inputs
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.