Albert Gatt Corpora and Statistical Methods Lecture 13.

Slides:



Advertisements
Similar presentations
3.6 Support Vector Machines
Advertisements

Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
M5N1. Students will further develop their understanding of whole numbers. A. Classify the set of counting numbers into subsets with distinguishing characteristics.
Chapter 7 Sampling and Sampling Distributions
Solving Multi-Step Equations
Hash Tables.
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Text Categorization.
Putting Statistics to Work
Basics of Statistical Estimation
Classification Classification Examples
Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Albert Gatt Corpora and Statistical Methods Lecture 13.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Evaluation of Decision Forests on Text Categorization
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
ICML Linear Programming Boosting for Uneven Datasets Jurij Leskovec, Jožef Stefan Institute, Slovenia John Shawe-Taylor, Royal Holloway University.
Text Categorization Karl Rees Ling 580 April 2, 2001.
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Decision Tree Rong Jin. Determine Milage Per Gallon.
ID3 Algorithm Abbas Rizvi CS157 B Spring What is the ID3 algorithm? ID3 stands for Iterative Dichotomiser 3 Algorithm used to generate a decision.
Decision Tree Algorithm
Ensemble Learning: An Introduction
Induction of Decision Trees
Three kinds of learning
Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K.
Classification.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Ensemble Learning (2), Tree and Forest
Learning Chapter 18 and Parts of Chapter 20
Module 04: Algorithms Topic 07: Instance-Based Learning
Fall 2004 TDIDT Learning CS478 - Machine Learning.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Mohammad Ali Keyvanrad
Chapter 9 – Classification and Regression Trees
Decision Tree Learning Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 25, 2014.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Decision Trees Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Classification and Regression Trees
Outline Decision tree representation ID3 learning algorithm Entropy, Information gain Issues in decision tree learning 2.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Decision Tree Learning DA514 - Lecture Slides 2 Modified and expanded from: E. Alpaydin-ML (chapter 9) T. Mitchell-ML.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
KNN & Naïve Bayes Hongning Wang
CSE573 Autumn /11/98 Machine Learning Administrative –Finish this topic –The rest of the time is yours –Final exam Tuesday, Mar. 17, 2:30-4:20.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
DECISION TREES An internal node represents a test on an attribute.
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Introduction to Data Mining, 2nd Edition by
Introduction to Data Mining, 2nd Edition by
MIS2502: Data Analytics Classification using Decision Trees
Machine Learning: Lecture 3
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Machine Learning in Practice Lecture 17
Learning Chapter 18 and Parts of Chapter 20
Presentation transcript:

Albert Gatt Corpora and Statistical Methods Lecture 13

Supervised methods Part 2

General characterisation Training set: documents labeled with one or more classes encoded using some representation model Typical data representation model: every document represented as vector of real-valued measurements + class vector may represent word counts class is given since this is supervised learning

Data sources: Reuters collection uters21578/ uters21578/ Large collection of Reuters newswire texts, categorised by the topic. Topics include: earn(ings) grain wheat acq(uisitions) …

Reuters dataset Text 1Text 2 17-MAR :07:22.82 earn AMRE INC <AMRE> 3RD QTR JAN 31 NET DALLAS, MArch 17 - Shr five cts vs one ct Net 196,986 vs 37,966 Revs 15.5 mln vs 8,900,000 Nine mths Shr 52 cts vs 22 cts Net two mln vs 874,000 Revs 53.7 mln vs 28.6 mln Reuter 17-MAR :26:47.36 acq DEVELOPMENT CORP OF AMERICA <DCA> MERGED HOLLYWOOD, Fla., March 17 - Development Corp of America said its merger with Lennar Corp <LEN> was completed and its stock no longer existed. Development Corp of America, whose board approved the acquisition last November for 90 mln dlrs, said the merger was effective today and its stock now represents the right to receive 15 dlrs a share.The American Stock Exchange said it would provide further details later. Reuter

Representing documents: vector representations Suppose we select k = 20 keywords that are diagonistic of the earnings category. Can be done using chi-square, topic signatures etc Each document d represented as a vector, containing term weights for each of the k terms: #times term i occurs in doc j length of doc j

Why use a log weighting scheme? A formula like 1 + log(tf) dampens the actual frequency Example: let d be a document of 89 words profit occurs 6 times tf(profit) = 6; 10 * [1+log(tf(profit))/1+log(89)] = 6 cts (“cents”) occurs 3 times tf(cents) = 3; 10 * [1+log(tf(cts))/1+log(89)] = 5 we avoid overestimating the importance of profit relative to cts (profit is more important than cts, but not twice as important) Log weighting schemes are common in information retrieval

Decision trees

Form of a decision tree Example: probability of belonging to category “earnings” given that s(cts) > 2 is.116 node 4node 3 node items p(c|n1) = 0.3 split: cts value: 2 node items p(c|n2) = split: net value: 1 node items p(c|n5) = 0.9 split: vs value: 2 net < 1 net ≥ 1 cts < 2cts ≥ 2 node 7node 6 vs < 2net ≥ 2

Form of a decision tree Equivalent to a formula in disjunctive normal form. (cts < 2 & net < 1 &…) V (cts ≥ 2 & net ≥ 1 &…) a complete path is a conjunction node 4node 3 node items p(c|n1) = 0.3 split: cts value: 2 node items p(c|n2) = split: net value: 1 node items p(c|n5) = 0.9 split: vs value: 2 net < 1 net ≥ 1 cts < 2cts ≥ 2 node 7node 6 vs < 2net ≥ 2

How to grow a decision tree Typical procedure: grow a very large tree prune it Pruning avoids overfitting the training data. e.g. a tree can contain several branches which are based on accidental properties of the training set e.g. only 1 document in category earnings contains both “dlrs” and “pct”

Growing the tree Splitting criterion: to identify a value for a feature a on which a node is split Stopping criterion: determines when to stop splitting e.g. stop splitting when all elements at a node have an identical representation (equal vectors for all keywords)

Growing the tree: Splitting criterion Information gain: do we reduce uncertainty if we split node n into two when attribute a has value y? let t be the distribution of n this is equivalent to comparing: entropy of t vs entropy of t given a i.e. entropy of t vs entropy of its child nodes if we split sum of entropy of child nodes, weighted by the proportion p of items from n in each child (l & r)

Information gain example at node 1 P(c|n1) = 0.3 H = 0.6 at node 2: P(c|n2) = 0.1 H = 0.35 at node 5: P(c|n5) = 0.9 H = 0.22 weighted sum of 2 & 5 = gain = – = node 4node 3 node items p(c|n1) = 0.3 split: cts value: 2 node items p(c|n2) = split: net value: 1 node items p(c|n5) = 0.9 split: vs value: 2 net < 1 net ≥ 1 cts < 2cts ≥ 2 node 7node 6 vs < 2net ≥ 2

Leaf nodes Suppose n3 has: 1500 “earnings” docs other docs in other categories Where do we classify a new doc d? e.g. use MLE with add-one smoothing node items p(c|n4) = node items p(c|n3) = node items p(c|n1) = 0.3 split: cts value: 2 node items p(c|n2) = split: net value: 1 node items p(c|n5) = 0.9 split: vs value: 2 net < 1 net ≥ 1 cts < 2cts ≥ 2

Pruning the tree Pruning proceeds by removing leaf nodes one by one, until tree is empty. At each step, remove the leaf node expected to be least helpful. Needs a pruning criterion. i.e. a measure of “confidence” indicating what evidence we have that the node is useful. Each pruning step gives us a new tree (old tree minus one node) – total of n trees if original tree had n nodes Which of these trees do we select as our final classifier?

Pruning the tree: held-out data To select the best tree, we can use held-out data. At each pruning step, try resulting tree against held-out data, and check success rate. Since held-out data reduces training set, better to perform cross-validation.

When are decision trees useful? Some disadvantages: A decision tree is a complex classification device many parameters split training data into very small chunks small sets will display regularities that don’t generalise (overfitting) Main advantage: very easy to understand!

Maximum entropy for text classification

A reminder from lecture 9 MaxEnt distribution a log-linear model: probability of a category c and document d computed in terms of weighted multiplication of feature values (normalised by a constant) each feature imposes a constraint on the model:

A reminder from lecture 9 The MaxEnt principle dictates that we find the simplest model p* satisfying the constraints: where P is the set of possible distributions with p* is unique and has the form given earlier Weights for features can be found using Generalised Iterative Scaling

Application to text categorisation Example: we’ve identified 20 keywords which are diagnostic of the “earnings” category in Reuters each keyword is a feature

“Earnings” features (from M&S `99) f j (word)Weight α j log α j cts profit net loss dlrs pct is Very salient/ diagnostic features (higher weights) less important features

Classifying with the maxent model Recall that: As a decision criterion we can use: Classify a new document as “earnings” if P(“earnings”|d) > P(¬”earnings”|d)

K nearest neighbour classification

Rationale Simple nearest neighbour (1NN): Given: a new document d Find: the document in the training set that is most similar to d Classify d with the same category Generalisation (kNN): compare d to its k nearest neighbours The crucial thing is the similarity measure.

Example: 1NN + cosine similarity Given: document d Goal: categorise d based on training set T Define: Find the subset T’ of T s.t.:

Generalising to k>1 neighbours Choose the k nearest neighbours and weight them by similarity. Repeat method for each neighbour. Decide on a classification based on the majority class for these neighbours.