Machine Learning Reading: Chapter 18
2 Text Classification Is text i a finance new article? PositiveNegative
3 20 attributes Investors 2 Dow 2 Jones 2 Industrial 1 Average 3 Percent 5 Gain 6 Trading 8 Broader 5 stock 5 Indicators 6 Standard 2 Rolling 1 Nasdaq 3 Early 10 Rest 12 More 13 first 11 Same 12 The 30
4 20 attributes Men’s Basketball Championship UConn Huskies Georgia Tech Women Playing Crown Titles Games Rebounds All-America early rolling Celebrates Rest More First The same
Example stockrollingtheclass 10340other 26835finance 37725other 45714other 58220finance 69425finance 75620finance 80235other finance other
6 Constructing the Decision Tree Goal: Find the smallest decision tree consistent with the examples Find the attribute that best splits examples Form tree with root = best attribute For each value v i (or range) of best attribute Selects those examples with best=v i Construct subtree i by recursively calling decision tree with subset of examples, all attributes except best Add a branch to tree with label=v i and subtree=subtree i
7 Choosing the Best Attribute: Binary Classification Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction Information theory (Shannon and Weaver 49) Entropy: a measure that characterizes the impurity of a collection of examples Information gain: the expected reduction in entropy caused by partitioning the xamples according to this attribute
8 Formula for Entropy H(P(v 1 ),…P(v n ))=∑-P(v i )log 2 P(v i ) where P(v) = probability of v Examples: Suppose we have a collection of 10 examples, 5 positive, 5 negative: H(1/2,1/2)=-1/2log 2 1/2-1/2log 2 1/2=1 bit Suppose we have a collection of 100 examples, 1 positive and 99 negative: H(1/100,99/100)=-.01log log 2.99=.08 bits n i=1
9 Choosing the Best Attribute: Information Gain Information gain (from attribute test) = difference between the original information requirement and new requirement Gain(A)=H(p/p+n,n/p+n)-Remainder(A) H=entropy Highest when the set is equally divided between positive (p) and negative (n) examples (.5,.5) (value of 1) Lower as the set becomes more unbalanced (e.g., (.9,.1) )
Information based on attributes = Remainder (A) P=n=10, so H(1/2,1/2)= 1 bit
11 Text Classification Is text i a finance new article? PositiveNegative
Example stockrollingtheclass 10340other 26835finance 37725other 45714other 58220finance 69425finance 75620finance 80235other finance other
stockrolling <55-10 <5 1,8,9,102,3,4,5,6,7 1,5,6,8 2,3,4,7 9,10
14 Algorithm as specified so far is designed for binary classification, attributes with discrete values Attributes: Outlook: sunny, overcast, rain Temperature: hot, mild, cool Humidity: normal, high Wind: weak, strong Classification PlayTennis?: Yes, No
DayOutlookTemperatu re HumidityWindPlayTennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalStrongYes D8SunnyMildHighWeakNo D9SunnyCoolNormalWeakYes D10RainMildNormalWeakYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo
Humidity E=.940 (9/14 yes) Wind E=.94 Outlook E=.940 Temperature E=.940 HighNormal StrongWeak Overcast Sunny Rain Cool Mild Hot E=.985 E=.592 E=.811 E=.1.0 Gain(S,Outlook)=.246, Gain(S,Humidity)=.151, Gain(S,Wind)=.048, Gain(S,Temperature)=.029 Outlook is selected because it has highest gain Gain(humidity)=.940-(7/14).985-(7/14).592Gain(wind)=.940-(6/14).811-(8/14)1.0 Gain(outlook)=.940-(4/14)0-(5/14).79- (5/14).79
DayOutlookTemperatu re HumidityWindPlayTennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalStrongYes D8SunnyMildHighWeakNo D9SunnyCoolNormalWeakYes D10RainMildNormalWeakYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo
DayOutlookTemperatu re HumidityWindPlayTennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalStrongYes D8SunnyMildHighWeakNo D9SunnyCoolNormalWeakYes D10RainMildNormalWeakYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo
19 Extending the algorithm for continuous valued attributes Dynamically define new discrete-valued attributes that partition the continuous attribute into a discrete set of intervals For continuous A, create A c that is true if A<c, false otherwise How to select the best value for threshold c? Sort examples by continuous attribute Identify adjacent examples that differ in target classification Generate a set of candidate thresholds midway between corresponding values of A Choose threshold c that maximizes information gain
20 Example: temperature as continuous value Temp Play tennis? No Yes No Two candidate thresholds: (48+60)/2 (80+90)/2 Information gain greater for Temperature >54 than for Temperature >85
21 Other cases What if class is discrete valued, not binary? What if an attribute has many values (e.g., 1 per instance)?
22 Training vs. Testing A learning algorithm is good if it uses its learned hypothesis to make accurate predictions on unseen data Collect a large set of examples (with classifications) Divide into two disjoint sets: the training set and the test set Apply the learning algorithm to the training set, generating hypothesis h Measure the percentage of examples in the test set that are correctly classified by h Repeat for different sizes of training sets and different randomly selected training sets of each size.
24 Overfitting Learning algorithms may use irrelevant attributes to make decisions For news, day published and newspaper When else can overfitting occur? Solution #1: Decision tree pruning Prune away attributes with low information gain Use statistical significance to test whether gain is meaningful
25 K-fold Cross Validation Solution #2: To reduce overfitting Run k experiments Use a different 1/k of data for testing each time Average the results 5-fold, 10-fold, leave-one-out
26 Cross-Validation Model Lather, rinse, repeat (10 times) 9 folds (approx. 1409)1 fold (approx. 157) Train Evaluate Report average Split into 10 folds Labeled data (1566)
27 Example
28 Ensemble Learning Learn from a collection of hypotheses Majority voting Enlarges the hypothesis space
29 Boosting Uses a weighted training set Each example has an associated weight w j 0 Higher weighted examples have higher importance Initially, w j =1 for all examples Next round: increase weights of misclassified examples, decrease other weights From the new weighted set, generate hypothesis h 2 Continue until M hypotheses generated Final ensemble hypothesis = weighted-majority combination of all M hypotheses Weight each hypothesis according to how well it did on training data
30 AdaBoost If input learning algorithm is a weak learning algorithm L always returns a hypothesis with weighted error on training slightly better than random Returns hypothesis that classifies training data perfectly for large enough M Boosts the accuracy of the original learning algorithm on training data