Machine Learning Reading: Chapter 18. 2 Text Classification  Is text i a finance new article? PositiveNegative.

Slides:

Advertisements

Similar presentations

Decision Tree Learning - ID3

Advertisements

Decision Trees Decision tree representation ID3 learning algorithm

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Decision Tree Approach in Data Mining

Decision Tree Algorithm (C4.5)

ICS320-Foundations of Adaptive and Learning Systems

Classification Techniques: Decision Tree Learning

Decision Tree Example MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way.

Final Exam: May 10 Thursday. If event E occurs, then the probability that event H will occur is p ( H | E ) IF E ( evidence ) is true THEN H ( hypothesis.

ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.

Decision Tree Learning

Part 7.3 Decision Trees Decision tree representation ID3 learning algorithm Entropy, information gain Overfitting.

Decision Tree Algorithm

CS 590M Fall 2001: Security Issues in Data Mining Lecture 4: ID3.

Decision Tree Learning Learning Decision Trees (Mitchell 1997, Russell & Norvig 2003) –Decision tree induction is a simple but powerful learning paradigm.

Ensemble Learning: An Introduction

Induction of Decision Trees

1 Classification with Decision Trees I Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei.

Machine Learning Reading: Chapter Machine Learning and AI  Improve task performance through observation, teaching  Acquire knowledge automatically.

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

Ch 3. Decision Tree Learning

Machine Learning Lecture 10 Decision Trees G53MLE Machine Learning Dr Guoping Qiu1.

ID3 and Decision tree by Tuan Nguyen May 2008.

NAÏVE BAYES CLASSIFIER 1 ACM Student Chapter, Heritage Institute of Technology 10 th February, 2012 SIGKDD Presentation by Anirban Ghose Parami Roy Sourav.

National Centre for Agricultural Economics and Policy Research (NCAP), New Delhi Rajni Jain

By Wang Rui State Key Lab of CAD&CG

Fall 2004 TDIDT Learning CS478 - Machine Learning.

Machine Learning Chapter 3. Decision Tree Learning

Machine Learning CPS4801. Research Day Keynote Speaker o Tuesday 9:30-11:00 STEM Lecture Hall (2 nd floor) o Meet-and-Greet 11:30 STEM 512 Faculty Presentation.

Learning what questions to ask. 8/29/03Decision Trees2  Job is to build a tree that represents a series of questions that the classifier will ask of.

Short Introduction to Machine Learning Instructor: Rada Mihalcea.

CS 484 – Artificial Intelligence1 Announcements List of 5 source for research paper Homework 5 due Tuesday, October 30 Book Review due Tuesday, October.

Artificial Intelligence 7. Decision trees

Machine Learning CS 165B Spring 2012

Classification with Decision Trees and Rules Evgueni Smirnov.

Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.

Machine Learning Lecture 10 Decision Tree Learning 1.

CpSc 810: Machine Learning Decision Tree Learning.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Learning from Observations Chapter 18 Through

Decision-Tree Induction & Decision-Rule Induction

Decision Tree Learning

Data Mining-Knowledge Presentation—ID3 algorithm Prof. Sin-Min Lee Department of Computer Science.

For Wednesday No reading Homework: –Chapter 18, exercise 6.

Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.

Artificial Intelligence 8. Supervised and unsupervised learning Japan Advanced Institute of Science and Technology (JAIST) Yoshimasa Tsuruoka.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Decision Trees, Part 1 Reading: Textbook, Chapter 6.

Training Examples. Entropy and Information Gain Information answers questions The more clueless I am about the answer initially, the more information.

Seminar on Machine Learning Rada Mihalcea Decision Trees Very short intro to Weka January 27, 2003.

Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Review of Decision Tree Learning Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Decision Tree Learning CMPT 463. Reminders Homework 7 is due on Tuesday, May 10 Projects are due on Tuesday, May 10 o Moodle submission: readme.doc and.

CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.

Machine Learning Reading: Chapter Classification Learning Input: a set of attributes and values Output: discrete valued function Learning a continuous.

Decision Trees an introduction.

Decision Tree Learning

Decision trees (concept learnig)

Machine Learning Lecture 2: Decision Tree Learning.

Classification Algorithms

Artificial Intelligence

Data Science Algorithms: The Basic Methods

Decision Tree Saed Sayad 9/21/2018.

Machine Learning Chapter 3. Decision Tree Learning

Machine Learning: Lecture 3

Decision Trees Decision tree representation ID3 learning algorithm

Machine Learning Chapter 3. Decision Tree Learning

Decision Trees Decision tree representation ID3 learning algorithm

Machine Learning: Decision Tree Learning

Presentation transcript:

Machine Learning Reading: Chapter 18

2 Text Classification  Is text i a finance new article? PositiveNegative

3 20 attributes  Investors 2  Dow 2  Jones 2  Industrial 1  Average 3  Percent 5  Gain 6  Trading 8  Broader 5  stock 5  Indicators 6  Standard 2  Rolling 1  Nasdaq 3  Early 10  Rest 12  More 13  first 11  Same 12  The 30

4 20 attributes  Men’s  Basketball  Championship  UConn Huskies  Georgia Tech  Women  Playing  Crown  Titles  Games  Rebounds  All-America  early  rolling  Celebrates  Rest  More  First  The  same

Example stockrollingtheclass 10340other 26835finance 37725other 45714other 58220finance 69425finance 75620finance 80235other finance other

6 Constructing the Decision Tree  Goal: Find the smallest decision tree consistent with the examples  Find the attribute that best splits examples  Form tree with root = best attribute  For each value v i (or range) of best attribute  Selects those examples with best=v i  Construct subtree i by recursively calling decision tree with subset of examples, all attributes except best  Add a branch to tree with label=v i and subtree=subtree i

7 Choosing the Best Attribute: Binary Classification  Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction  Information theory (Shannon and Weaver 49) Entropy: a measure that characterizes the impurity of a collection of examples Information gain: the expected reduction in entropy caused by partitioning the xamples according to this attribute

8 Formula for Entropy H(P(v 1 ),…P(v n ))=∑-P(v i )log 2 P(v i ) where P(v) = probability of v Examples: Suppose we have a collection of 10 examples, 5 positive, 5 negative: H(1/2,1/2)=-1/2log 2 1/2-1/2log 2 1/2=1 bit Suppose we have a collection of 100 examples, 1 positive and 99 negative: H(1/100,99/100)=-.01log log 2.99=.08 bits n i=1

9 Choosing the Best Attribute: Information Gain  Information gain (from attribute test) = difference between the original information requirement and new requirement  Gain(A)=H(p/p+n,n/p+n)-Remainder(A)  H=entropy  Highest when the set is equally divided between positive (p) and negative (n) examples (.5,.5) (value of 1)  Lower as the set becomes more unbalanced (e.g., (.9,.1) )

Information based on attributes = Remainder (A) P=n=10, so H(1/2,1/2)= 1 bit

11 Text Classification  Is text i a finance new article? PositiveNegative

Example stockrollingtheclass 10340other 26835finance 37725other 45714other 58220finance 69425finance 75620finance 80235other finance other

stockrolling <55-10  <5 1,8,9,102,3,4,5,6,7 1,5,6,8 2,3,4,7 9,10

14 Algorithm as specified so far is designed for binary classification, attributes with discrete values Attributes:  Outlook: sunny, overcast, rain  Temperature: hot, mild, cool  Humidity: normal, high  Wind: weak, strong Classification  PlayTennis?: Yes, No

DayOutlookTemperatu re HumidityWindPlayTennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalStrongYes D8SunnyMildHighWeakNo D9SunnyCoolNormalWeakYes D10RainMildNormalWeakYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo

Humidity E=.940 (9/14 yes) Wind E=.94 Outlook E=.940 Temperature E=.940 HighNormal StrongWeak Overcast Sunny Rain Cool Mild Hot E=.985 E=.592 E=.811 E=.1.0 Gain(S,Outlook)=.246, Gain(S,Humidity)=.151, Gain(S,Wind)=.048, Gain(S,Temperature)=.029 Outlook is selected because it has highest gain Gain(humidity)=.940-(7/14).985-(7/14).592Gain(wind)=.940-(6/14).811-(8/14)1.0 Gain(outlook)=.940-(4/14)0-(5/14).79- (5/14).79

DayOutlookTemperatu re HumidityWindPlayTennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalStrongYes D8SunnyMildHighWeakNo D9SunnyCoolNormalWeakYes D10RainMildNormalWeakYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo

DayOutlookTemperatu re HumidityWindPlayTennis D1SunnyHotHighWeakNo D2SunnyHotHighStrongNo D3OvercastHotHighWeakYes D4RainMildHighWeakYes D5RainCoolNormalWeakYes D6RainCoolNormalStrongNo D7OvercastCoolNormalStrongYes D8SunnyMildHighWeakNo D9SunnyCoolNormalWeakYes D10RainMildNormalWeakYes D11SunnyMildNormalStrongYes D12OvercastMildHighStrongYes D13OvercastHotNormalWeakYes D14RainMildHighStrongNo

19 Extending the algorithm for continuous valued attributes  Dynamically define new discrete-valued attributes that partition the continuous attribute into a discrete set of intervals  For continuous A, create A c that is true if A<c, false otherwise  How to select the best value for threshold c?  Sort examples by continuous attribute  Identify adjacent examples that differ in target classification  Generate a set of candidate thresholds midway between corresponding values of A  Choose threshold c that maximizes information gain

20 Example: temperature as continuous value Temp Play tennis? No Yes No  Two candidate thresholds:  (48+60)/2  (80+90)/2  Information gain greater for Temperature >54 than for Temperature >85

21 Other cases  What if class is discrete valued, not binary?  What if an attribute has many values (e.g., 1 per instance)?

22 Training vs. Testing  A learning algorithm is good if it uses its learned hypothesis to make accurate predictions on unseen data  Collect a large set of examples (with classifications)  Divide into two disjoint sets: the training set and the test set  Apply the learning algorithm to the training set, generating hypothesis h  Measure the percentage of examples in the test set that are correctly classified by h  Repeat for different sizes of training sets and different randomly selected training sets of each size.

23

24 Overfitting  Learning algorithms may use irrelevant attributes to make decisions For news, day published and newspaper  When else can overfitting occur?  Solution #1: Decision tree pruning Prune away attributes with low information gain Use statistical significance to test whether gain is meaningful

25 K-fold Cross Validation  Solution #2: To reduce overfitting  Run k experiments Use a different 1/k of data for testing each time Average the results  5-fold, 10-fold, leave-one-out

26 Cross-Validation Model Lather, rinse, repeat (10 times) 9 folds (approx. 1409)1 fold (approx. 157) Train Evaluate Report average Split into 10 folds Labeled data (1566)

27 Example

28 Ensemble Learning  Learn from a collection of hypotheses  Majority voting  Enlarges the hypothesis space

29 Boosting  Uses a weighted training set Each example has an associated weight w j 0 Higher weighted examples have higher importance  Initially, w j =1 for all examples  Next round: increase weights of misclassified examples, decrease other weights  From the new weighted set, generate hypothesis h 2  Continue until M hypotheses generated  Final ensemble hypothesis = weighted-majority combination of all M hypotheses  Weight each hypothesis according to how well it did on training data

30 AdaBoost  If input learning algorithm is a weak learning algorithm L always returns a hypothesis with weighted error on training slightly better than random  Returns hypothesis that classifies training data perfectly for large enough M  Boosts the accuracy of the original learning algorithm on training data