Machine Learning Reading: Chapter 18. 2 Text Classification  Is text i a finance new article? PositiveNegative.

Machine Learning Reading: Chapter 18

2 Text Classification  Is text i a finance new article? PositiveNegative

3 20 attributes  Investors 2  Dow 2  Jones 2  Industrial 1  Average 3  Percent 5  Gain 6  Trading 8  Broader 5  stock 5  Indicators 6  Standard 2  Rolling 1  Nasdaq 3  Early 10  Rest 12  More 13  first 11  Same 12  The 30

4 20 attributes  Men’s  Basketball  Championship  UConn Huskies  Georgia Tech  Women  Playing  Crown  Titles  Games  Rebounds  All-America  early  rolling  Celebrates  Rest  More  First  The  same

Example stockrollingtheclass 10340other 26835finance 37725other 45714other 58220finance 69425finance 75620finance 80235other 901125finance 1001528other

6 Constructing the Decision Tree  Goal: Find the smallest decision tree consistent with the examples  Find the attribute that best splits examples  Form tree with root = best attribute  For each value v i (or range) of best attribute  Selects those examples with best=v i  Construct subtree i by recursively calling decision tree with subset of examples, all attributes except best  Add a branch to tree with label=v i and subtree=subtree i

7 Choosing the Best Attribute: Binary Classification  Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction  Information theory (Shannon and Weaver 49) Entropy: a measure that characterizes the impurity of a collection of examples Information gain: the expected reduction in entropy caused by partitioning the xamples according to this attribute

8 Formula for Entropy H(P(v 1 ),…P(v n ))=∑-P(v i )log 2 P(v i ) where P(v) = probability of v Examples: Suppose we have a collection of 10 examples, 5 positive, 5 negative: H(1/2,1/2)=-1/2log 2 1/2-1/2log 2 1/2=1 bit Suppose we have a collection of 100 examples, 1 positive and 99 negative: H(1/100,99/100)=-.01log 2.01-.99log 2.99=.08 bits n i=1

9 Choosing the Best Attribute: Information Gain  Information gain (from attribute test) = difference between the original information requirement and new requirement  Gain(A)=H(p/p+n,n/p+n)-Remainder(A)  H=entropy  Highest when the set is equally divided between positive (p) and negative (n) examples (.5,.5) (value of 1)  Lower as the set becomes more unbalanced (e.g., (.9,.1) )

Information based on attributes = Remainder (A) P=n=10, so H(1/2,1/2)= 1 bit

11 Text Classification  Is text i a finance new article? PositiveNegative

Example stockrollingtheclass 10340other 26835finance 37725other 45714other 58220finance 69425finance 75620finance 80235other 901125finance 1001528other

stockrolling <55-10  10 5-10 <5 1,8,9,102,3,4,5,6,7 1,5,6,8 2,3,4,7 9,10

14 Algorithm as specified so far is designed for binary classification, attributes with discrete values Attributes:  Outlook: sunny, overcast, rain  Temperature: hot, mild, cool  Humidity: normal, high  Wind: weak, strong Classification  PlayTennis?: Yes, No

DayOutlookTemperatu re HumidityWindPlayTennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalStrongYes D8SunnyMildHighWeakNo D9SunnyCoolNormalWeakYes D10RainMildNormalWeakYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo

Humidity E=.940 (9/14 yes) Wind E=.94 Outlook E=.940 Temperature E=.940 HighNormal StrongWeak Overcast Sunny Rain Cool Mild Hot E=.985 E=.592 E=.811 E=.1.0 Gain(S,Outlook)=.246, Gain(S,Humidity)=.151, Gain(S,Wind)=.048, Gain(S,Temperature)=.029 Outlook is selected because it has highest gain Gain(humidity)=.940-(7/14).985-(7/14).592Gain(wind)=.940-(6/14).811-(8/14)1.0 Gain(outlook)=.940-(4/14)0-(5/14).79- (5/14).79

DayOutlookTemperatu re HumidityWindPlayTennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalStrongYes D8SunnyMildHighWeakNo D9SunnyCoolNormalWeakYes D10RainMildNormalWeakYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo

19 Extending the algorithm for continuous valued attributes  Dynamically define new discrete-valued attributes that partition the continuous attribute into a discrete set of intervals  For continuous A, create A c that is true if A<c, false otherwise  How to select the best value for threshold c?  Sort examples by continuous attribute  Identify adjacent examples that differ in target classification  Generate a set of candidate thresholds midway between corresponding values of A  Choose threshold c that maximizes information gain

20 Example: temperature as continuous value Temp404860728090 Play tennis? No Yes No  Two candidate thresholds:  (48+60)/2  (80+90)/2  Information gain greater for Temperature >54 than for Temperature >85

21 Other cases  What if class is discrete valued, not binary?  What if an attribute has many values (e.g., 1 per instance)?

22 Training vs. Testing  A learning algorithm is good if it uses its learned hypothesis to make accurate predictions on unseen data  Collect a large set of examples (with classifications)  Divide into two disjoint sets: the training set and the test set  Apply the learning algorithm to the training set, generating hypothesis h  Measure the percentage of examples in the test set that are correctly classified by h  Repeat for different sizes of training sets and different randomly selected training sets of each size.

24 Overfitting  Learning algorithms may use irrelevant attributes to make decisions For news, day published and newspaper  When else can overfitting occur?  Solution #1: Decision tree pruning Prune away attributes with low information gain Use statistical significance to test whether gain is meaningful

25 K-fold Cross Validation  Solution #2: To reduce overfitting  Run k experiments Use a different 1/k of data for testing each time Average the results  5-fold, 10-fold, leave-one-out

26 Cross-Validation Model Lather, rinse, repeat (10 times) 9 folds (approx. 1409)1 fold (approx. 157) Train Evaluate Report average Split into 10 folds Labeled data (1566)

27 Example

28 Ensemble Learning  Learn from a collection of hypotheses  Majority voting  Enlarges the hypothesis space

29 Boosting  Uses a weighted training set Each example has an associated weight w j 0 Higher weighted examples have higher importance  Initially, w j =1 for all examples  Next round: increase weights of misclassified examples, decrease other weights  From the new weighted set, generate hypothesis h 2  Continue until M hypotheses generated  Final ensemble hypothesis = weighted-majority combination of all M hypotheses  Weight each hypothesis according to how well it did on training data

30 AdaBoost  If input learning algorithm is a weak learning algorithm L always returns a hypothesis with weighted error on training slightly better than random  Returns hypothesis that classifies training data perfectly for large enough M  Boosts the accuracy of the original learning algorithm on training data

Machine Learning Reading: Chapter 18. 2 Text Classification  Is text i a finance new article? PositiveNegative.

Similar presentations

Presentation on theme: "Machine Learning Reading: Chapter 18. 2 Text Classification  Is text i a finance new article? PositiveNegative."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning Reading: Chapter 18. 2 Text Classification  Is text i a finance new article? PositiveNegative.

Similar presentations

Presentation on theme: "Machine Learning Reading: Chapter 18. 2 Text Classification  Is text i a finance new article? PositiveNegative."— Presentation transcript:

Similar presentations

About project

Feedback