CSA3180: Natural Language Processing

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Albert Gatt Corpora and Statistical Methods Lecture 13.
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
1 Text categorization Feature selection: chi square test.
October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
SIMS 290-2: Applied Natural Language Processing
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
1 I256: Applied Natural Language Processing Preslav Nakov and Marti Hearst October 16, 2006 (Many slides originally by Barbara Rosario, modified here)
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)
1 I256: Applied Natural Language Processing Marti Hearst Nov 1, 2006 (Most slides originally by Barbara Rosario, modified here)
1 SIMS 290-2: Applied Natural Language Processing Barbara Rosario October 4, 2004.
1 Lecture 8 Measures of association: chi square test, mutual information, binomial distribution and log likelihood ratio.
I256 Applied Natural Language Processing Fall 2009 Lecture 10 Classification Barbara Rosario.
1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov Sept 29, 2004.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 5: Information Retrieval and Web Search
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Name the Language How good is your knowledge of the world’s main languages?
Universit at Dortmund, LS VIII
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Chapter 6: Information Retrieval and Web Search
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Text Classification and Naïve Bayes Text Classification: Evaluation.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
CS 9633 Machine Learning Support Vector Machines
Queensland University of Technology
Chapter 7. Classification and Prediction
Sentiment analysis algorithms and applications: A survey
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Text Based Information Retrieval
Instance Based Learning
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Lecture 15: Text Classification & Naive Bayes
Machine Learning Basics
CS 4/527: Artificial Intelligence
An Introduction to Support Vector Machines
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
Perceptron as one Type of Linear Discriminants
Word Embedding Word2Vec.
Chapter 5: Information Retrieval and Web Search
Introduction to Text Analysis
Hankz Hankui Zhuo Text Categorization Hankz Hankui Zhuo
Text Categorization Berlin Chen 2003 Reference:
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CS249: Neural Language Model
Presentation transcript:

CSA3180: Natural Language Processing Classification I Discovering Word Associations Text Classification TF.IDF Clustering/Data Mining Linear and Non-Linear Classification Binary Classification Multi-Class Classification October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Introduction Slides partly based on Lectures by Barbara Rosario and Preslav Nakov Classification Text categorization (and other applications) Various issues regarding classification Clustering vs. classification, binary vs. multi-way, flat vs. hierarchical classification… Introduce the steps necessary for a classification task Define classes Label text Features Training and evaluation of a classifier October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Classification Goal: Assign ‘objects’ from a universe to two or more classes or categories Problem Object Categories Tagging Word POS Sense Disambiguation Word The word’s senses Information retrieval Document Relevant/not relevant Sentiment classification Document Positive/negative Author identification Document Authors Language identification Document Languages Text Classification Document Topics October 2005 CSA3180: Text Processing III

Author Identification They agreed that Mrs. X should only hear of the departure of the family, without being alarmed on the score of the gentleman's conduct; but even this partial communication gave her a great deal of concern, and she bewailed it as exceedingly unlucky that the ladies should happen to go away, just as they were all getting so intimate together. Gas looming through the fog in divers places in the streets, much as the sun may, from the spongey fields, be seen to loom by husbandman and ploughboy. Most of the shops lighted two hours before their time--as the gas seems to know, for it has a haggard and unwilling look. The raw afternoon is rawest, and the dense fog is densest, and the muddy streets are muddiest near that leaden-headed old obstruction, appropriate ornament for the threshold of a leaden-headed old corporation, Temple Bar. October 2005 CSA3180: Text Processing III

Author Identification Jane Austen (1775-1817), Pride and Prejudice Charles Dickens (1812-70), Bleak House October 2005 CSA3180: Text Processing III

Author Identification Federalist papers 77 short essays written in 1787-1788 by Hamilton, Jay and Madison to persuade NY to ratify the US Constitution; published under a pseudonym The authorships of 12 papers was in dispute (disputed papers) In 1964 Mosteller and Wallace* solved the problem They identified 70 function words as good candidates for authorships analysis Using statistical inference they concluded the author was Madison October 2005 CSA3180: Text Processing III

Author Identification Function Words October 2005 CSA3180: Text Processing III

Author Identification October 2005 CSA3180: Text Processing III

Language Identification Tutti gli esseri umani nascono liberi ed eguali in dignità e diritti. Essi sono dotati di ragione e di coscienza e devono agire gli uni verso gli altri in spirito di fratellanza. Alle Menschen sind frei und gleich an Würde und Rechten geboren. Sie sind mit Vernunft und Gewissen begabt und sollen einander im Geist der Brüderlichkeit begegnen. Universal Declaration of Human Rights, UN, in 363 languages October 2005 CSA3180: Text Processing III

Language Identification égaux - French eguali - Italian iguales - Spanish edistämään - Finnish għ - Maltese October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Text Classification http://news.google.com/ Reuters Collection of (21,578) newswire documents. For research purposes: a standard text collection to compare systems and algorithms 135 valid topics categories October 2005 CSA3180: Text Processing III

Reuters Newswire Corpus October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Reuters Sample <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> <DATE> 2-MAR-1987 16:51:43.42</DATE> <TOPICS><D>livestock</D><D>hog</D></TOPICS> <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> <DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter </BODY></TEXT></REUTERS> October 2005 CSA3180: Text Processing III

Classification vs. Clustering Classification assumes labeled data: we know how many classes there are and we have examples for each class (labeled data). Classification is supervised In Clustering we don’t have labeled data; we just assume that there is a natural division in the data and we may not know how many divisions (clusters) there are Clustering is unsupervised October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Classification Class1 Class2 October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Classification Class1 Class2 October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Classification Class1 Class2 October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Classification Class1 Class2 October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Clustering October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Clustering October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Clustering October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Clustering October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Clustering October 2005 CSA3180: Text Processing III

Categories (Labels, Classes) Labeling data 2 problems: Decide the possible classes (which ones, how many) Domain and application dependent http://news.google.com Label text Difficult, time consuming, inconsistency between annotators October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Reuters <REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="12981" NEWID="798"> <DATE> 2-MAR-1987 16:51:43.42</DATE> <TOPICS><D>livestock</D><D>hog</D></TOPICS> <TITLE>AMERICAN PORK CONGRESS KICKS OFF TOMORROW</TITLE> <DATELINE> CHICAGO, March 2 - </DATELINE><BODY>The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter </BODY></TEXT></REUTERS> Why not topic = policy ? October 2005 CSA3180: Text Processing III

Binary vs. Multi-Class Classification Binary classification: two classes Multi-way classification: more than two classes Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes October 2005 CSA3180: Text Processing III

Flat vs. Hierarchical Classification Flat classification: relations between the classes undetermined Hierarchical classification: hierarchy where each node is the sub-class of its parent’s node October 2005 CSA3180: Text Processing III

Single vs. Multi-Category Classification In single-category text classification each text belongs to exactly one category In multi-category text classification, each text can have zero or more categories October 2005 CSA3180: Text Processing III

LabeledText class in NLTK >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix." >>> label = “sport” >>> labeled_text = LabeledText(text, label) >>> labeled_text.text() “Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix.” >>> labeled_text.label() “sport” October 2005 CSA3180: Text Processing III

NLTK Classifier Interface classify determines which label is most appropriate for a given text token, and returns a labeled text token with that label. labels returns the list of category labels that are used by the classifier. >>> token = Token(“The World Health Organization is recommending more importance be attached to the prevention of heart disease and other cardiovascular ailments rather than focusing on treatment.”) >>> my_classifier.classify(token) “The World Health Organization is recommending more importance be attached to the prevention of heart disease and other cardiovascular ailments rather than focusing on treatment.”/ health >>> my_classifier.labels() ("sport", "health", "world",…) October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Features >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix." >>> label = “sport” >>> labeled_text = LabeledText(text, label) Here the classification takes as input the whole string What’s the problem with that? What are the features that could be useful for this example? October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Features Feature: An aspect of the text that is relevant to the task Some typical features Words present in text Frequency of words Capitalization Are there NE? WordNet Others? October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Features Feature: An aspect of the text that is relevant to the task Feature value: the realization of the feature in the text Words present in text : Kerry, Schumacher, China… Frequency of word: Kerry(10), Schumacher(1)… Are there dates? Yes/no Are there PERSONS? Yes/no Are there ORGANIZATIONS? Yes/no WordNet: Holonyms (China is part of Asia), Synonyms(China, People's Republic of China, mainland China) October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Features Boolean (or Binary) Features Features that generate boolean (binary) values. Boolean features are the simplest and the most common type of feature. f1(text) = 1 if text contain “Kerry” 0 otherwise f2(text) = 1 if text contain PERSON October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Features Integer Features Features that generate integer values. Integer features can be used to give classifiers access to more precise information about the text. f1(text) = Number of times text contains “Kerry” f2(text) = Number of times text contains PERSON October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Features in NLTK Feature Detectors Features can be defined using feature detector functions, which map LabeledTexts to values Method: detect, which takes a labeled text, and returns a feature value. >>> def ball(ltext): return (“ball” in ltext.text()) >>> fdetector = FunctionFeatureDetector(ball) >>> document1 = "John threw the ball over the fence".split() >>> fdetector.detect(LabeledText(document1) 1 >>> document2 = "Mary solved the equation".split() >>> fdetector.detect(LabeledText(document2) October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Features Linguistic features Words lowercase? (should we convert to?) normalized? (e.g. “texts”  “text”) Phrases Word-level n-grams Character-level n-grams Punctuation Part of Speech Non-linguistic features document formatting informative character sequences (e.g. &lt) October 2005 CSA3180: Text Processing III

When do we need Feature Selection? If the algorithm cannot handle all possible features e.g. language identification for 100 languages using all words text classification using n-grams Good features can result in higher accuracy But! Why feature selection? What if we just keep all features? Even the unreliable features can be helpful. But we need to weight them: In the extreme case, the bad features can have a weight of 0 (or very close), which is… a form of feature selection! October 2005 CSA3180: Text Processing III

Why do we need Feature Selection? Not all features are equally good! Bad features: best to remove Infrequent unlikely to be be met again co-occurrence with a class can be due to chance Too frequent mostly function words Uniform across all categories Good features: should be kept Co-occur with a particular category Do not co-occur with other categories The rest: good to keep October 2005 CSA3180: Text Processing III

What types of Feature Selection? Feature selection reduces the number of features Usually: Eliminating features Weighting features Normalizing features Sometimes by transforming parameters e.g. Latent Semantic Indexing using Singular Value Decomposition Method may depend on problem type For classification and filtering, may use information from example documents to guide selection October 2005 CSA3180: Text Processing III

What types of Feature Selection? Task independent methods Document Frequency (DF) Term Strength (TS) Task-dependent methods Information Gain (IG) Mutual Information (MI) 2 statistic (CHI) Empirically compared by Yang & Pedersen (1997) October 2005 CSA3180: Text Processing III

Document Frequency (DF) What about the frequent terms? DF: number of documents a term appears in Based on Zipf’s Law Remove the rare terms: (met 1-2 times) Non-informative Unreliable – can be just noise Not influential in the final decision Unlikely to appear in new documents Plus Easy to compute Task independent: do not need to know the classes Minus Ad hoc criterion Rare terms can be good discriminators (e.g., in IR) What is a “rare” term? October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Stop Word Removal Common words from a predefined list Mostly from closed-class categories: unlikely to have a new word added include: auxiliaries, conjunctions, determiners, prepositions, pronouns, articles But also some open-class words like numerals Bad discriminators uniformly spread across all classes can be safely removed from the vocabulary Is this always a good idea? (e.g. author identification) October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Term Weighting In the study just shown, terms were (mainly) treated as binary features If a term occurred in a document, it was assigned 1 Else 0 Often it us useful to weight the selected features Standard technique: tf.idf October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III TF.IDF Term Weighting TF: term frequency definition: TF = tij frequency of term i in document j purpose: makes the frequent words for the document more important IDF: inverted document frequency definition: IDF = log(N/ni) ni : number of documents containing term i N : total number of documents purpose: makes rare words across documents more important TF.IDF definition: tij  log(N/ni) October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III 2 statistic (CHI) 2 statistic (pronounced “kai square”) The most commonly used method of comparing proportions. Checks whether there is a relationship between being in one of two groups and a characteristic under study. Example: Let us measure the dependency between a term t and a category c. the groups would be: 1) the documents from a category ci 2) all other documents the characteristic would be: “document contains term t” October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III 2 statistic (CHI) Is “jaguar” a good predictor for the “auto” class? We want to compare: the observed distribution above; and null hypothesis: that jaguar and auto are independent Term = jaguar Term  jaguar Class = auto 2 500 Class  auto 3 9500 October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III 2 statistic (CHI) Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? We would have: Pr(j,a) = Pr(j)  Pr(a) So, there would be: N  Pr(j,a), i.e. N  Pr(j)  Pr(a) Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 Which is: N(5/N)(502/N)=2510/N=2510/10005  0.25 Term = jaguar Term  jaguar Class = auto 2 500 Class  auto 3 9500 October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III 2 statistic (CHI) Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect? We would have: Pr(j,a) = Pr(j)  Pr(a) So, there would be: N  Pr(j,a), i.e. N  Pr(j)  Pr(a) Pr(j) = (2+3)/N; Pr(a) = (2+500)/N; N=2+3+500+9500 Which is: N(5/N)(502/N)=2510/N=2510/1005  0.25 Term = jaguar Term  jaguar Class = auto 2 (0.25) 500 (502) Class  auto 3 (4.75) 9500 (9498) expected: fe observed: fo October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III 2 statistic (CHI) 2 is interested in (fo – fe)2/fe summed over all table entries: The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence). Term = jaguar Term  jaguar Class = auto 2 (0.25) 500 (502) Class  auto 3 (4.75) 9500 (9498) expected: fe observed: fo October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III 2 statistic (CHI) How to use 2 for multiple categories? Compute 2 for each category and then combine: we can require to discriminate well across all categories, then we need to take the expected value of 2: or to discriminate well for a single category, then we take the maximum: October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III 2 statistic (CHI) Pros normalized and thus comparable across terms 2(t,c) is 0, when t and c are independent can be compared to 2 distribution, 1 degree of freedom Cons unreliable for low frequency terms computationally expensive October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Term Normalization Combine different words into a single representation Stemming/morphological analysis bought, buy, buys -> buy General word categories $23.45, 5.30 Yen -> MONEY 1984, 10,000 -> DATE, NUM PERSON ORGANIZATION (Covered in Information Extraction segment) Generalize with lexical hierarchies WordNet, MeSH (Covered later in the semester) October 2005 CSA3180: Text Processing III

Stemming and Lemmatization Purpose: conflate morphological variants of a word to a single index term Stemming: normalize to a pseudoword e.g. “more” and “morals” become “mor” (Porter stemmer) Lemmatization: convert to the root form e.g. “more” and “morals” become “more” and “moral” Plus: vocabulary size reduction data sparseness reduction Minus: loses important features (even to_lowercase() can be bad!) questionable utility (maybe just “-s”, “-ing” and “-ed”?) October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Practical Approach Feature selection infrequent term removal infrequent across the whole collection (i.e. DF) met in a single document most frequent term removal (i.e. stop words) Normalization: Stemming. (often) Word classes (sometimes) Feature weighting: TF.IDF or IDF Dimensionality reduction. (occasionally) October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Classification Linear versus non linear classification Binary classification Perceptron Winnow Support Vector Machines (SVM) Kernel Methods (covered in statistics lectures) Multi-Class classification (covered in Statistics Lectures) Decision Trees Naïve Bayes K nearest neighbor October 2005 CSA3180: Text Processing III

Binary Classification Spam filtering (spam, not spam) Customer service message classification (urgent vs. not urgent) Information retrieval (relevant, not relevant) Sentiment classification (positive, negative) Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes October 2005 CSA3180: Text Processing III

Binary Classification Given: some data items that belong to a positive (+1 ) or a negative (-1 ) class Task: Train the classifier and predict the class for a new data item Geometrically: find a separator October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Linear vs. Non-Linear Linearly separable data: if all the data points can be correctly classified by a linear (hyperplanar) decision boundary October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Linear vs. Non-Linear Linear Decision boundary October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Linear vs. Non-Linear Class1 Class2 Non-Linearly Separable October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Linear vs. Non-Linear Class1 Class2 Non Linear Classifier October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Linear vs. Non-Linear Linear or Non linear separable data? We can find out only empirically Linear algorithms (algorithms that find a linear decision boundary) When we think the data is linearly separable Advantages Simpler, less parameters Disadvantages High dimensional data (like for NLT) is usually not linearly separable Examples: Perceptron, Winnow, SVM Note: we can use linear algorithms also for non linear problems (see Kernel methods) October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Linear vs. Non-Linear Non Linear When the data is non linearly separable Advantages More accurate Disadvantages More complicated, more parameters Example: Kernel methods Note: the distinction between linear and non linear applies also for multi-class classification (we’ll see this later) October 2005 CSA3180: Text Processing III

Simple Linear Algorithms Perceptron and Winnow algorithm Linear Binary classification Online (process data sequentially, one data point at the time) Mistake driven Simple single layer Neural Networks October 2005 CSA3180: Text Processing III

Simple Linear Algorithms Data: {(xi,yi)}i=1...n x in Rd (x is a vector in d-dimensional space)  feature vector y in {-1,+1}  label (class, category) Question: Design a linear decision boundary: wx + b (equation of hyperplane) such that the classification rule associated with it has minimal probability of error classification rule: y = sign(w x + b) which means: if wx + b > 0 then y = +1 if wx + b < 0 then y = -1 October 2005 CSA3180: Text Processing III

Simple Linear Algorithms Find a good hyperplane (w,b) in Rd+1 that correctly classifies data points as much as possible In online fashion: one data point at the time, update weights as necessary wx + b = 0 Classification Rule: y = sign(wx + b) October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Perceptron Algorithm Initialize: w1 = 0 Updating rule For each data point x If class(x) != decision(x,w) then wk+1  wk + yixi k  k + 1 else wk+1  wk Function decision(x, w) If wx + b > 0 return +1 Else return -1 wk+1 Wk+1 x + b = 0 wk +1 -1 wk x + b = 0 October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Perceptron Algorithm Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem (convergence, global optimum) Limitations Only linear separations Only converges for linearly separable data Not really “efficient with many features” October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Winnow Algorithm Another online algorithm for learning perceptron weights: f(x) = sign(wx + b) Linear, binary classification Update-rule: again error-driven, but multiplicative (instead of additive) October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Winnow Algorithm Initialize: w1 = 0 Updating rule For each data point x If class(x) != decision(x,w) then wk+1  wk + yixi  Perceptron wk+1  wk *exp(yixi)  Winnow k  k + 1 else wk+1  wk Function decision(x, w) If wx + b > 0 return +1 Else return -1 wk+1 Wk+1 x + b = 0 wk +1 -1 wk x + b= 0 October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Perceptron vs. Winnow Assume N available features only K relevant items, with K<<N Perceptron: number of mistakes: O( K N) Winnow: number of mistakes: O(K log N) Winnow is more robust to high-dimensional feature spaces October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Perceptron vs. Winnow Perceptron Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Winnow Online: can adjust to changing target, over time Advantages Simple and computationally efficient Guaranteed to learn a linearly separable problem Suitable for problems with many irrelevant attributes Limitations only linear separations only converges for linearly separable data not really “efficient with many features” Used in NLP October 2005 CSA3180: Text Processing III

Support Vector Machine (SVM) Large Margin Classifier Linearly separable case Goal: find the hyperplane that maximizes the margin wTxa + b = 1 wTxb + b = -1 Support vectors M wT x + b = 0 October 2005 CSA3180: Text Processing III

Support Vector Machine (SVM) Text classification Hand-writing recognition Computational biology (e.g., micro-array data) Face detection Face expression recognition Time series prediction October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III Classification II Non-linear algorithms Kernel methods Multi-class classification Decision trees Naïve Bayes Last topic for today: k Nearest Neighbour October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III k Nearest Neighbour Nearest Neighbor classification rule: to classify a new object, find the object in the training set that is most similar. Then assign the category of this nearest neighbor K Nearest Neighbor (KNN): consult k nearest neighbors. Decision based on the majority category of these neighbors. More robust than k = 1 Example of similarity measure often used in NLP is cosine similarity October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III 1 Nearest Neighbour October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III 1 Nearest Neighbour October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III 3 Nearest Neighbour October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III 3 Nearest Neighbour But this is closer.. We can weight neighbors according to their similarity Assign the category of the majority of the neighbors October 2005 CSA3180: Text Processing III

CSA3180: Text Processing III k Nearest Neighbour Strengths Robust Conceptually simple Often works well Powerful (arbitrary decision boundaries) Weaknesses Performance is very dependent on the similarity measure used (and to a lesser extent on the number of neighbors k used) Finding a good similarity measure can be difficult Computationally expensive October 2005 CSA3180: Text Processing III