Machine Learning Reading: Chapter 18. 2 Machine Learning and AI  Improve task performance through observation, teaching  Acquire knowledge automatically.

Slides:



Advertisements
Similar presentations
Learning from Observations Chapter 18 Section 1 – 3.
Advertisements

Random Forest Predrag Radenković 3237/10
Decision Trees Decision tree representation ID3 learning algorithm
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Indian Statistical Institute Kolkata
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Spring 2004.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2005.
18 LEARNING FROM OBSERVATIONS
Learning From Observations
Decision Tree Algorithm
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18 Fall 2004.
Learning from Observations Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 18.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Three kinds of learning
LEARNING DECISION TREES
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Machine Learning Reading: Chapter Text Classification  Is text i a finance new article? PositiveNegative.
MACHINE LEARNING. What is learning? A computer program learns if it improves its performance at some task through experience (T. Mitchell, 1997) A computer.
Machine Learning Chapter 3. Decision Tree Learning
Machine Learning CPS4801. Research Day Keynote Speaker o Tuesday 9:30-11:00 STEM Lecture Hall (2 nd floor) o Meet-and-Greet 11:30 STEM 512 Faculty Presentation.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Learning CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Chapter 9 – Classification and Regression Trees
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
Learning from observations
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
Copyright R. Weber Machine Learning, Data Mining INFO 629 Dr. R. Weber.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1 Tournament Not complete Processing will begin again tonight, 7:30PM until wee hours Friday, 8-5. Extra Credit 5 points for passing screening, in tournament.
DECISION TREE Ge Song. Introduction ■ Decision Tree: is a supervised learning algorithm used for classification or regression. ■ Decision Tree Graph:
Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer.
Chapter 18 Section 1 – 3 Learning from Observations.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Learning From Observations Inductive Learning Decision Trees Ensembles.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Decision Tree Learning CMPT 463. Reminders Homework 7 is due on Tuesday, May 10 Projects are due on Tuesday, May 10 o Moodle submission: readme.doc and.
Machine Learning Reading: Chapter Classification Learning Input: a set of attributes and values Output: discrete valued function Learning a continuous.
Learning from Observations
Learning from Observations
Chapter 18 From Data to Knowledge
Introduce to machine learning
Decision Trees (suggested time: 30 min)
Presented By S.Yamuna AP/CSE
Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)
Statistical Learning Dong Liu Dept. EEIS, USTC.
Learning from Observations
Learning from Observations
Machine Learning: Methodology Chapter
Presentation transcript:

Machine Learning Reading: Chapter 18

2 Machine Learning and AI  Improve task performance through observation, teaching  Acquire knowledge automatically for use in a task  Learning as a key component in intelligence

3 What does it mean to learn?

4 Different kinds of Learning  Rote learning  Learning from instruction  Learning by analogy  Learning from observation and discovery  Learning from examples

5 Inductive Learning  Input: x, f(x)  Output: a function h that approximates f  A good hypothesis, h, is a generalization or learned rule

6 How do systems learn?  Supervised  Unsupervised  Reinforcement

7 Three Types of Learning  Rule induction  E.g., decision trees  Knowledge based  E.g., using a domain theory  Statistical  E.g., Naïve bayes, Nearest neighbor, support vector machines

8 Applications  Language/speech  Machine translation  Summarization  Grammars  IR  Text categorization, relevance feedback  Medical  Assessment of illness severity  Vision  Face recognition, digit recognition, outdoor scene recognition  Security  Intrusion detection, network traffic, credit fraud  Social networks  traffic  To think about: applications to systems, computer engineering, software?

9 Language Tasks  Text summarization  Task: given a document which sentences could serve as the summary  Training data: summary + document pairs  Output: rules which extract sentences given an unseen document  Grammar induction  Task: produce a tree representing syntactic structure given a sentence  Training data: set of sentences annotated with parse tree  Output: rules which generate a parse tree given an unseen sentence

10 IR Task  Text categorization Task: given a web page, is it news or not?  Binary classification (yes, no)  Classify as one of business&economy,news&media, computer Training data: documents labeled with category Output: a yes/no response for a new document; a category for a new document

11 Medical  Task: Does a patient have heart disease (on a scale from 1 to 4)  Training data:  Age, sex,cholesterol, chest pain location, chest pain type, resting blood pressure, smoker?, fasting blood sugar, etc.  Characterization of heart disease (0,1-4)  Output:  Given a new patient, classification by disease

12 General Approach  Formulate task  Prior model (parameters, structure)  Obtain data  What representation should be used? (attribute/value pairs)  Annotate data  Learn/refine model with data (training)  Use model for classification or prediction on unseen data (testing)  Measure accuracy

13 Issues  Representation  How to map from a representation in the domain to a representation used for learning?  Training data  How can training data be acquired?  Amount of training data  How well does the algorithm do as we vary the amount of data?  Which attributes influence learning most?  Does the learning algorithm provide insight into the generalizations made?

14 Classification Learning  Input: a set of attributes and values  Output: discrete valued function  Learning a continuous valued function is called regression  Binary or boolean classification: category is either true or false

15 Learning Decision Trees  Each node tests the value of an input attribute  Branches from the node correspond to possible values of the attribute  Leaf nodes supply the values to be returned if that leaf is reached

16 Example  ml ml  Iris Plant Database  Which of 3 classes is a given Iris plant?  Iris Setosa  Iris Versicolour  Iris Virginica  Attributes  Sepal length in cm  Sepal width in cm  Petal length in cm  Petal width in cm

17 Summary Statistics: Min Max Mean SD ClassCorrelation sepal length: sepal width: petal length: (high!) petal width: (high!)  Rules to learn  If sepal length > 6 and sepal width > 3.8 and petal length < 2.5 and petal width < 1.5 then class = Iris Setosa  If sepal length > 5 and sepal width > 3 and petal length >5.5 and petal width >2 then class = Iris Versicolour  If sepal length 3 and petal length  2.5 and ≤ 5.5 and petal width  1.5 and ≤ 2 then class = Iris Virginica

Data S-lengthS-widthP-lengthP-widthClass Versicolour Setosa Verginica Verginica Versicolour Setosa Setosa Verginica Versicolour

Data S-lengthS-widthP-lengthClass Versicolour Setosa Verginica Verginica Versicolour Setosa Setosa Verginica Versicolour

20 Constructing the Decision Tree  Goal: Find the smallest decision tree consistent with the examples  Find the attribute that best splits examples  Form tree with root = best attribute  For each value v i (or range) of best attribute  Selects those examples with best=v i  Construct subtree i by recursively calling decision tree with subset of examples, all attributes except best  Add a branch to tree with label=v i and subtree=subtree i

21 Construct example decision tree

22 Tree and Rules Learned S-length <5.5  5.5 3,4,8P-length  5.6 ≤2.4 1,5,9 2,6,7 If S-length < 5.5., then Verginica If S-length  5.5 and P-length  5.6, then Versicolour If S-length  5.5 and P-length ≤ 2.4, then Setosa

23 Comparison of Target and Learned If S-length < 5.5., then Verginica If S-length  5.5 and P-length  5.6, then Versicolour If S-length  5.5 and P-length ≤ 2.4, then Setosa If sepal length > 6 and sepal width > 3.8 and petal length < 2.5 and petal width < 1.5 then class = Iris Setosa If sepal length > 5 and sepal width > 3 and petal length >5.5 and petal width >2 then class = Iris Versicolour If sepal length 3 and petal length  2.5 and ≤ 5.5 and petal width  1.5 and ≤ 2 then class = Iris Virginica

24 Text Classification  Is text i a finance new article? PositiveNegative

25 20 attributes  Investors 2  Dow 2  Jones 2  Industrial 1  Average 3  Percent 5  Gain 6  Trading 8  Broader 5  stock 5  Indicators 6  Standard 2  Rolling 1  Nasdaq 3  Early 10  Rest 12  More 13  first 11  Same 12  The 30

26 20 attributes  Men’s  Basketball  Championship  UConn Huskies  Georgia Tech  Women  Playing  Crown  Titles  Games  Rebounds  All-America  early  rolling  Celebrates  Rest  More  First  The  same

27 Example stockrollingtheclass 10340other 26835finance 37725other 45714other 58220finance 69425finance 75620finance 80235other finance other

28 Issues  Representation  How to map from a representation in the domain to a representation used for learning?  Training data  How can training data be acquired?  Amount of training data  How well does the algorithm do as we vary the amount of data?  Which attributes influence learning most?  Does the learning algorithm provide insight into the generalizations made?

29 Constructing the Decision Tree  Goal: Find the smallest decision tree consistent with the examples  Find the attribute that best splits examples  Form tree with root = best attribute  For each value v i (or range) of best attribute  Selects those examples with best=v i  Construct subtree i by recursively calling decision tree with subset of examples, all attributes except best  Add a branch to tree with label=v i and subtree=subtree i

30 Choosing the Best Attribute: Binary Classification  Want a formal measure that returns a maximum value when attribute makes a perfect split and minimum when it makes no distinction  Information theory (Shannon and Weaver 49)  H(P(v 1 ),…P(v n ))=∑-P(v i )log 2 P(v i ) H(1/2,1/2)=-1/2log 2 1/2-1/2log 2 1/2=1 bit H(1/100,1/99)=-.01log log 2.99=.08 bits n i=1

31 Information based on attributes = Remainder (A) P=n=10, so H(1/2,1/2)= 1 bit

32 Information Gain  Information gain (from attribute test) = difference between the original information requirement and new requirement  Gain(A)=H(p/p+n,n/p+n)- Remainder(A)

33 Example stockrollingtheclass 10340other 26835finance 37725other 45714other 58220finance 69425finance 75620finance 80235other finance other

stockrolling <55-10  <5 1,8,9,102,3,4,5,6,7 1,5,6,8 2,3,4,7 9,10 Gain(stock)=1-[4/10H(1/10,3/10)+6/10H(4/10,2/10)]= 1-[.4((-.1* -3.32)-(.3*-1.74))+.6((-.4*-1.32)-(.2*-2.32))]= 1-[ ]=.105 Gain(rolling)=1-[4/10H(1/2,1/2)+4/10H(1/2,1/2)+2/10H(1/2,1/2)]=0

35 Other cases  What if class is discrete valued, not binary?  What if an attribute has many values (e.g., 1 per instance)?

36 Training vs. Testing  A learning algorithm is good if it uses its learned hypothesis to make accurate predictions on unseen data  Collect a large set of examples (with classifications)  Divide into two disjoint sets: the training set and the test set  Apply the learning algorithm to the training set, generating hypothesis h  Measure the percentage of examples in the test set that are correctly classified by h  Repeat for different sizes of training sets and different randomly selected training sets of each size.

37

38 Division into 3 sets  Inadvertent peeking Parameters that must be learned (e.g., how to split values) Generate different hypotheses for different parameter values on training data Choose values that perform best on testing data Why do we need to do this for selecting best attributes?

39 Overfitting  Learning algorithms may use irrelevant attributes to make decisions For news, day published and newspaper  Decision tree pruning Prune away attributes with low information gain Use statistical significance to test whether gain is meaningful

40 K-fold Cross Validation  To reduce overfitting  Run k experiments Use a different 1/k of data for testing each time Average the results  5-fold, 10-fold, leave-one-out

41 Ensemble Learning  Learn from a collection of hypotheses  Majority voting  Enlarges the hypothesis space

42 Boosting  Uses a weighted training set Each example as an associated weight w j 0 Higher weighted examples have higher importance  Initially, w j =1 for all examples  Next round: increase weights of misclassified examples, decrease other weights  From the new weighted set, generate hypothesis h 2  Continue until M hypotheses generated  Final ensemble hypothesis = weighted-majority combination of all M hypotheses  Weight each hypothesis according to how well it did on training data

43 AdaBoost  If input learning algorithm is a weak learning algorithm L always returns a hypothesis with weighted error on training slightly better than random  Returns hypothesis that classifies training data perfectly for large enough M  Boosts the accuracy of the original learning algorithm on training data

44 Issues  Representation  How to map from a representation in the domain to a representation used for learning?  Training data  How can training data be acquired?  Amount of training data  How well does the algorithm do as we vary the amount of data?  Which attributes influence learning most?  Does the learning algorithm provide insight into the generalizations made?