1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004.

1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004

2 Today Algorithms for Classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods Multi-Class classification Decision Trees Naïve Bayes K nearest neighbor

3 Binary Classification: examples Spam filtering (spam, not spam) Customer service message classification (urgent vs. not urgent) Information retrieval (relevant, not relevant) Sentiment classification (positive, negative) Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes

4 Binary Classification Given: some data items that belong to a positive (+1 ) or a negative (-1 ) class Task: Train the classifier and predict the class for a new data item Geometrically: find a separator

5 Linear versus Non Linear algorithms Linearly separable data: if all the data points can be correctly classified by a linear (hyperplanar) decision boundary

6 Linearly separable data Class1 Class2 Linear Decision boundary

7 Non linearly separable data Class1 Class2

8 Non linearly separable data Non Linear Classifier Class1 Class2

9 Linear versus Non Linear algorithms Linear or Non linear separable data? We can find out only empirically Linear algorithms (algorithms that find a linear decision boundary) When we think the data is linearly separable Advantages –Simpler, less parameters Disadvantages –High dimensional data (like for NLT) is usually not linearly separable Examples: Perceptron, Winnow, SVM Note: we can use linear algorithms also for non linear problems (see Kernel methods)

10 Linear versus Non Linear algorithms Non Linear When the data is non linearly separable Advantages –More accurate Disadvantages –More complicated, more parameters Example: Kernel methods Note: the distinction between linear and non linear applies also for multi-class classification (we’ll see this later)

11 Simple linear algorithms Perceptron and Winnow algorithm Linear Binary classification Online (process data sequentially, one data point at the time) Mistake driven Simple single layer Neural Networks

12 From Gert Lanckriet, Statistical Learning Theory Tutorial Linear binary classification Data: {(x i,y i )} i=1...n x in R d (x is a vector in d-dimensional space)  feature vector y in {-1,+1}  label (class, category) Question: Design a linear decision boundary: wx + b (equation of hyperplane) such that the classification rule associated with it has minimal probability of error classification rule: –y = sign(w x + b) which means: –if wx + b > 0 then y = +1 –if wx + b < 0 then y = -1

13 From Gert Lanckriet, Statistical Learning Theory Tutorial Linear binary classification Find a good hyperplane (w,b) in R d+1 that correctly classifies data points as much as possible In online fashion: one data point at the time, update weights as necessary wx + b = 0 Classification Rule: y = sign(wx + b)

14 From Gert Lanckriet, Statistical Learning Theory Tutorial Perceptron algorithm Initialize: w 1 = 0 Updating rule For each data point x If class(x) != decision(x,w) then w k+1  w k + y i x i k  k + 1 else w k+1  w k Function decision(x, w) If wx + b > 0 return +1 Else return - 1 wkwk 0 +1 w k x + b = 0 w k+1 W k+1 x + b = 0

15 From Gert Lanckriet, Statistical Learning Theory Tutorial Perceptron algorithm Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem (convergence, global optimum) Limitations Only linear separations Only converges for linearly separable data Not really “efficient with many features”

16 From Gert Lanckriet, Statistical Learning Theory Tutorial Winnow algorithm Another online algorithm for learning perceptron weights: f(x) = sign(wx + b) Linear, binary classification Update-rule: again error-driven, but multiplicative (instead of additive)

17 From Gert Lanckriet, Statistical Learning Theory Tutorial Winnow algorithm wkwk 0 +1 w k x + b= 0 w k+1 W k+1 x + b = 0 Initialize: w 1 = 0 Updating rule For each data point x If class(x) != decision(x,w) then w k+1  w k + y i x i  Perceptron w k+1  w k *exp(y i x i )  Winnow k  k + 1 else w k+1  w k Function decision(x, w) If wx + b > 0 return +1 Else return -1

18 From Gert Lanckriet, Statistical Learning Theory Tutorial Perceptron vs. Winnow Assume N available features only K relevant items, with K<<N Perceptron: number of mistakes: O( K N) Winnow: number of mistakes: O(K log N) Winnow is more robust to high-dimensional feature spaces

19 From Gert Lanckriet, Statistical Learning Theory Tutorial Perceptron vs. Winnow Perceptron Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Winnow Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Used in NLP

20 Weka Winnow in Weka

21 From Gert Lanckriet, Statistical Learning Theory Tutorial Another family of linear algorithms Intuition (Vapnik, 1965) If the classes are linearly separable: Separate the data Place hyper-plane “far” from the data: large margin Statistical results guarantee good generalization Large margin classifier BAD

22 From Gert Lanckriet, Statistical Learning Theory Tutorial GOOD  Maximal Margin Classifier Intuition (Vapnik, 1965) if linearly separable: Separate the data Place hyperplane “far” from the data: large margin Statistical results guarantee good generalization Large margin classifier

23 From Gert Lanckriet, Statistical Learning Theory Tutorial If not linearly separable Allow some errors Still, try to place hyperplane “far” from each class Large margin classifier

24 Large Margin Classifiers Advantages Theoretically better (better error bounds) Limitations Computationally more expensive, large quadratic programming

25 From Gert Lanckriet, Statistical Learning Theory Tutorial Support Vector Machine (SVM) Large Margin Classifier Linearly separable case Goal: find the hyperplane that maximizes the margin w T x + b = 0 M w T x a + b = 1 w T x b + b = -1 Support vectors

26 From Gert Lanckriet, Statistical Learning Theory Tutorial Support Vector Machine (SVM) Text classification Hand-writing recognition Computational biology (e.g., micro-array data) Face detection Face expression recognition Time series prediction

27 Non Linear problem

28 Non Linear problem

29 From Gert Lanckriet, Statistical Learning Theory Tutorial Non Linear problem Kernel methods A family of non-linear algorithms Transform the non linear problem in a linear one (in a different feature space) Use linear algorithms to solve the linear problem in the new space

30 Main intuition of Kernel methods (Copy here from black board)

31 From Gert Lanckriet, Statistical Learning Theory Tutorial X=[x z] Basic principle kernel methods  : R d  R D (D >> d)  (X)=[x 2 z 2 xz] f(x) = sign(w 1 x 2 +w 2 z 2 +w 3 xz +b) w T (x)+b=0

32 From Gert Lanckriet, Statistical Learning Theory Tutorial Basic principle kernel methods Linear separability: more likely in high dimensions Mapping:  maps input into high-dimensional feature space Classifier: construct linear classifier in high- dimensional feature space Motivation: appropriate choice of  leads to linear separability We can do this efficiently!

33 Basic principle kernel methods We can use the linear algorithms seen before (Perceptron, SVM) for classification in the higher dimensional space

34 Multi-class classification Given: some data items that belong to one of M possible classes Task: Train the classifier and predict the class for a new data item Geometrically: harder problem, no more simple geometry

35 Multi-class classification

36 Multi-class classification: Examples Author identification Language identification Text categorization (topics)

37 (Some) Algorithms for Multi-class classification Linear Parallel class separators: Decision Trees Non parallel class separators: Naïve Bayes Non Linear K-nearest neighbors

38 Linear, parallel class separators (ex: Decision Trees)

39 Linear, NON parallel class separators (ex: Naïve Bayes)

40 Non Linear (ex: k Nearest Neighbor)

41 http://dms.irb.hr/tutorial/tut_dtrees.php Decision Trees Decision tree is a classifier in the form of a tree structure, where each node is either: Leaf node - indicates the value of the target attribute (class) of examples, or Decision node - specifies some test to be carried out on a single attribute-value, with one branch and sub-tree for each possible outcome of the test. A decision tree can be used to classify an example by starting at the root of the tree and moving through it until a leaf node, which provides the classification of the instance.

42 Training Examples Goal: learn when we can play Tennis and when we cannot

43 www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp Decision Tree for PlayTennis Outlook SunnyOvercastRain Humidity HighNormal Wind StrongWeak NoYes No

44 www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp Decision Tree for PlayTennis Outlook SunnyOvercastRain Humidity HighNormal NoYes Each internal node tests an attribute Each branch corresponds to an attribute value node Each leaf node assigns a classification

45 www.math.tau.ac.il/~nin/ Courses/ML04/DecisionTreesCLS.pp No Decision Tree for PlayTennis Outlook SunnyOvercastRain Humidity HighNormal Wind StrongWeak NoYes No Outlook Temperature Humidity Wind PlayTennis Sunny Hot High Weak ?

46 Foundations of Statistical Natural Language Processing, Manning and Schuetze Decision Tree for Reuter classification

47 Foundations of Statistical Natural Language Processing, Manning and Schuetze Decision Tree for Reuter classification

48 Building Decision Trees Given training data, how do we construct them? The central focus of the decision tree growing algorithm is selecting which attribute to test at each node in the tree. The goal is to select the attribute that is most useful for classifying examples. Top-down, greedy search through the space of possible decision trees. That is, it picks the best attribute and never looks back to reconsider earlier choices.

49 Building Decision Trees Splitting criterion Finding the features and the values to split on –for example, why test first “cts” and not “vs”? –Why test on “cts < 2” and not “cts < 5” ? Split that gives us the maximum information gain (or the maximum reduction of uncertainty) Stopping criterion When all the elements at one node have the same class, no need to split further In practice, one first builds a large tree and then one prunes it back (to avoid overfitting) See Foundations of Statistical Natural Language Processing, Manning and Schuetze for a good introductionFoundations of Statistical Natural Language Processing

50 http://dms.irb.hr/tutorial/tut_dtrees.php Decision Trees: Strengths Decision trees are able to generate understandable rules. Decision trees perform classification without requiring much computation. Decision trees are able to handle both continuous and categorical variables. Decision trees provide a clear indication of which features are most important for prediction or classification.

51 http://dms.irb.hr/tutorial/tut_dtrees.php Decision Trees: weaknesses Decision trees are prone to errors in classification problems with many classes and relatively small number of training examples. Decision tree can be computationally expensive to train. Need to compare all possible splits Pruning is also expensive Most decision-tree algorithms only examine a single field at a time. This leads to rectangular classification boxes that may not correspond well with the actual distribution of records in the decision space.

52 Decision Trees Decision Trees in Weka

53 Naïve Bayes More powerful that Decision Trees Decision Trees Naïve Bayes

54 Naïve Bayes Models Graphical Models: graph theory plus probability theory Nodes are variables Edges are conditional probabilities A BC P(A) P(B|A) P(C|A)

55 Naïve Bayes Models Graphical Models: graph theory plus probability theory Nodes are variables Edges are conditional probabilities Absence of an edge between nodes implies independence between the variables of the nodes A BC P(A) P(B|A) P(C|A)  P(C|A,B)

56 Foundations of Statistical Natural Language Processing, Manning and Schuetze Naïve Bayes for text classification

57 Naïve Bayes for text classification earn Shr 34ctsvsshr per

58 Naïve Bayes for text classification The words depend on the topic: P(w i | Topic) P(cts|earn) > P(tennis| earn) Naïve Bayes assumption: all words are independent given the topic From training set we learn the probabilities P(w i | Topic) for each word and for each topic in the training set Topic w1w1 w2w2 w3w3 w4w4 wnwn w n-1

59 Naïve Bayes for text classification To: Classify new example Calculate P(Topic | w 1, w 2, … w n ) for each topic Bayes decision rule: Choose the topic T’ for which P(T’ | w 1, w 2, … w n ) > P(T | w 1, w 2, … w n ) for each T  T’ Topic w1w1 w2w2 w3w3 w4w4 wnwn w n-1

60 Naïve Bayes: Math Naïve Bayes define a joint probability distribution: P(Topic, w 1, w 2, … w n ) = P(Topic)  P(w i | Topic) We learn P(Topic) and P(w i | Topic) in training Test: we need P(Topic | w 1, w 2, … w n ) P(Topic | w 1, w 2, … w n ) = P(Topic, w 1, w 2, … w n ) / P(w 1, w 2, … w n )

61 Naïve Bayes: Strengths Very simple model Easy to understand Very easy to implement Very efficient, fast training and classification Modest space storage Widely used because it works really well for text categorization Linear, but non parallel decision boundaries

62 Naïve Bayes: weaknesses Naïve Bayes independence assumption has two consequences: The linear ordering of words is ignored (bag of words model) The words are independent of each other given the class: False –President is more likely to occur in a context that contains election than in a context that contains poet Naïve Bayes assumption is inappropriate if there are strong conditional dependencies between the variables (But even if the model is not “right”, Naïve Bayes models do well in a surprisingly large number of cases because often we are interested in classification accuracy and not in accurate probability estimations)

63 Naïve Bayes Naïve Bayes in Weka

64 k Nearest Neighbor Classification Nearest Neighbor classification rule: to classify a new object, find the object in the training set that is most similar. Then assign the category of this nearest neighbor K Nearest Neighbor (KNN): consult k nearest neighbors. Decision based on the majority category of these neighbors. More robust than k = 1 Example of similarity measure often used in NLP is cosine similarity

65 1-Nearest Neighbor

68 3-Nearest Neighbor Assign the category of the majority of the neighbors But this is closer.. We can weight neighbors according to their similarity

69 k Nearest Neighbor Classification Strengths Robust Conceptually simple Often works well Powerful (arbitrary decision boundaries) Weaknesses Performance is very dependent on the similarity measure used (and to a lesser extent on the number of neighbors k used) Finding a good similarity measure can be difficult Computationally expensive

70 Summary Algorithms for Classification Linear versus non linear classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods Multi-Class classification Decision Trees Naïve Bayes K nearest neighbor On Wednesday: Weka

1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004.

Similar presentations

Presentation on theme: "1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004.

Similar presentations

Presentation on theme: "1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004."— Presentation transcript:

Similar presentations

About project

Feedback