©2012 Paula Matuszek GATE information based on ©2012 Paula Matuszek.

Slides:

Advertisements

Similar presentations

Introduction to Support Vector Machines (SVM)

Advertisements

University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.

University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.

University of Sheffield NLP Module 4: Machine Learning.

University of Sheffield NLP Module 11: Advanced Machine Learning.

ECG Signal processing (2)

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

ONLINE ARABIC HANDWRITING RECOGNITION By George Kour Supervised by Dr. Raid Saabne.

An Introduction of Support Vector Machine

Classification / Regression Support Vector Machines

CHAPTER 10: Linear Discrimination

An Introduction of Support Vector Machine

Support Vector Machines

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Support Vector Machine

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Support Vector Machines

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

An Introduction to Support Vector Machines Martin Law.

Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

This week: overview on pattern recognition (related to machine learning)

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Support Vector Machine & Image Classification Applications

Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.

CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Classification - SVM CS 685: Special Topics in Data Mining Jinze Liu.

Transcription of Text by Incremental Support Vector machine Anurag Sahajpal and Terje Kristensen.

Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek

1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines 2 (SVMs)

Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

1 CSC 8520 Spring Paula Matuszek CS 8520: Artificial Intelligence Machine Learning 2 Paula Matuszek Spring, 2013.

An Introduction to Support Vector Machines (M. Law)

1 CMSC 671 Fall 2010 Class #24 – Wednesday, November 24.

CS 478 – Tools for Machine Learning and Data Mining SVM.

1 CSC 4510, Spring © Paula Matuszek CSC 4510 Support Vector Machines (SVMs)

CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

©2012 Paula Matuszek CSC 9010: Text Mining Applications Lab 3 Dr. Paula Matuszek (610)

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

An Exercise in Machine Learning

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

©2012 Paula Matuszek GATE and ANNIE Information taken primarily from the GATE user manual, gate.ac.uk/sale/tao, and GATE training materials,

A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

University of Sheffield NLP Sentiment Analysis (Opinion Mining) with Machine Learning in GATE.

University of Sheffield NLP Module 11: Machine Learning © The University of Sheffield, This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike.

Machine Learning © The University of Sheffield,

Machine Learning © The University of Sheffield,

PREDICT 422: Practical Machine Learning

Module 11: Machine Learning

Supervised Machine Learning

An Introduction to Support Vector Machines

LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS

CS 2750: Machine Learning Support Vector Machines

Support Vector Machine _ 2 (SVM)

Class #212 – Thursday, November 12

Presentation transcript:

©2012 Paula Matuszek GATE information based on ©2012 Paula Matuszek CSC 9010: Text Mining Applications: GATE Machine Learning Dr. Paula Matuszek (610)

©2012 Paula Matuszek GATE information based on Information Extraction in GATE l In the last assignment we looked at how to set up GATE to do information extraction “by hand”. –Gazeteers –JAPE rules l We modified the system to add UPenn and Villanova to universities to be extracted l Not hard for two entities, but this can get tricky. Is there a better way?

©2012 Paula Matuszek GATE information based on Machine Learning l We would like: –give GATE examples of universities –let it figure out for itself how to identify them l This is basically a classification problem: given an entity, is it a university? l This sounds familiar! l The classification methods we looked at in NLTK can be considered as forms of supervised machine learning.

©2012 Paula Matuszek GATE information based on Machine Learning in GATE l GATE has machine learning processing resources: CREOLE Learning plugin l Same classification algorithms we saw in NLTK l But both features to be used and class to be learned can be richer, using PRs like ANNIE

5 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt Machine Learning We have data items comprising labels and features E.g. an instance of “cat” has features “whiskers=1”, “fur=1”. A “stone” has “whiskers=0” and “fur=0” Machine learning algorithm learns a relationship between the features and the labels E.g. “if whiskers=1 then cat” This is used to label new data We have a new instance with features “whiskers=1” and “fur=1”--is it a cat or not???

6 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt ML in Information Extraction We have annotations (classes) We have features (words, context, word features etc.) Can we learn how features match classes using ML? Once obtained, the ML representation can do our annotation for us based on features in the text Pre-annotation Automated systems Possibly good alternative to knowledge engineering approaches No need to write the rules However, need to prepare training data

University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt ML in Information Extraction Central to ML work is evaluation Need to try different methods, different parameters, to obtain good result Precision: How many of the annotations we identified are correct? Recall: How many of the annotations we should have identified did we? F-Score: F = 2(precision.recall)/(precision+recall) Testing requires an unseen test set Hold out a test set Simple approach but data may be scarce Cross-validation split training data into e.g. 10 sections Take turns to use each “fold” as a test set Average score across the 10

©2012 Paula Matuszek GATE information based on More on SVMs l Primary Machine Learning engine in GATE is the Support Vector Machine –Handles very large feature sets well –Has been shown empirically to be effective in a variety of unstructured text learning tasks. l We covered this briefly earlier; more detail will help with using them in GATE

©2012 Paula Matuszek GATE information based on Basic Idea Underlying SVMs l Find a line, or a plane, or a hyperplane, that separates our classes cleanly. –This is the same concept as we have seen in regression. l By finding the greatest margin separating them

Borrowed heavily from Andrew tutorials:: Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers denotes +1 denotes -1 How would you classify this data?

Borrowed heavily from Andrew tutorials:: Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers denotes +1 denotes -1 How would you classify this data?

Borrowed heavily from Andrew tutorials:: Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers denotes +1 denotes -1 Any of these would be fine....but which is best?

Borrowed heavily from Andrew tutorials:: Copyright © 2001, 2003, Andrew W. Moore Classifier Margin denotes +1 denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Borrowed heavily from Andrew tutorials:: Copyright © 2001, 2003, Andrew W. Moore Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. Called Linear Support Vector Machine (SVM)

Borrowed heavily from Andrew tutorials:: Copyright © 2001, 2003, Andrew W. Moore Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. Called Linear Support Vector Machine (SVM) Support Vectors are those datapoints that the margin pushes up against

©2012 Paula Matuszek GATE information based on Messy Data l This is all good so far. l Suppose our aren’t that neat:

©2012 Paula Matuszek GATE information based on Soft Margins l Intuitively, it still looks like we can make a decent separation here. –Can’t make a clean margin –But can almost do so, if we allow some errors l A soft margin is one which lets us make some errors, in order to get a wider margin l Tradeoff between wide margin and classification errors

©2012 Paula Matuszek GATE information based on Messy Data

©2012 Paula Matuszek GATE information based on Slack Variables and Cost In order to find a soft margin, we allow slack variables, which measure the degree of misclassification. –Takes into account the number of misclassified instances and the distance from the margin l We then modify this by a cost (C) for these misclassified instances. –High cost -- narrow margins, won’t generalize well –Low cost -- broader margins but misclassify more data. l How much we want it to cost to misclassify instances depends on our domain -- what we are trying to do

©2012 Paula Matuszek GATE information based on Non-Linearly-Separable Data l Suppose even with a soft margin we can’t do a good linear separation of our data? l Allowing non-linearity will give us much better modeling of many data sets. In SVMs, we do this by using a kernel. l A kernel is a function which maps our data into a higher-order order feature space where we can find a separating hyperplane

Borrowed heavily from Andrew tutorials:: Hard 1-dimensional dataset What can be done about this? x=0

Borrowed heavily from Andrew tutorials:: Hard 1-dimensional dataset x=0

Borrowed heavily from Andrew tutorials:: Hard 1-dimensional dataset x=0

©2012 Paula Matuszek GATE information based on The Kernel Trick! l If we can’t do a linear separation –add higher order features –and combinations of features l Combinations of features may allow separation where either feature alone will not This is known as the kernel trick l Typical kernels are polynomial and RBF (Radial Basis Function (Gaussian)) l GATE supports linear and polynomial kernels

Borrowed heavily from Andrew tutorials::

Borrowed heavily from Andrew tutorials::

©2012 Paula Matuszek GATE information based on What If We Want Three Classes? l Suppose our task involves more than two classes: universities, cities, businesses? l Reduce multiple class problem to multiple binary class problems. –one-versus-all, one-vs-others –one-versus-one, one-vs-another l GATE will do this automatically if there are more than two classes.

©2012 Paula Matuszek GATE information based on Doing this in GATE l GATE has a Learning plugin available in CREOLE. The PR available from it is Batch Learning –Newest machine learning PR –focused on chunk recognition, text classification, relation extraction. –Primary algorithm is an SVM –Can also interface to WEKA for Naive Bayes, KNN and decision trees l There is also an older Machine_Learning plugin, which contains wrappers for several different external learning systems.

©2012 Paula Matuszek GATE information based on Using Batch Learning PR l The batch learning PR is basically learning new annotations (the class) from previous annotations (the features) l Need three things: –annotated training documents –possibly preprocessed to get annotations we want to learn from –an XML configuration file (which is external to the IDE) l The PR treats the parent directory of the config file as its working directory.

University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt ML applications in GATE Batch Learning PR Evaluation Training Application Runs after all other PRs – must be last PR Configured via xml file A single directory holds generated features, models, and config file

University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt Instances, attributes, classes California Governor Arnold Schwarzenegger proposes deep cuts. Token Tok Entity.type=Person Attributes:Any annotation feature relative to instances Token.String Token.category (POS) Sentence.length Instances:Any annotation Tokens are often convenient Class:The thing we want to learn A feature on an annotation SentenceToken Entity.type =Location

University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt Surround mode This learned class covers more than one instance.... Begin / End boundary learning Dealt with by API - surround mode Transparent to the user California Governor Arnold Schwarzenegger proposes deep cuts. Token Entity.type=Person

University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt Multi class to binary California Governor Arnold Schwarzenegger proposes deep cuts. Entity.type=Person Entity.type =Location Three classes, including null Many algorithms are binary classifiers One against all (One against others)  LOC vs PERS+NULL / PERS vs LOC+NULL / NULL vs LOC+PERS One against one (One against another one)  LOC vs PERS / LOC vs NULL / PERS vs NULL Dealt with by API - multClassification2Binary Transparent to the user

©2012 Paula Matuszek GATE information based on The XML File l Root element is ML-CONFIG. l Contains two required elements: –DATASET –ENGINE l And some optional settings –Described in manual; most important are next.

University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt The configuration file Verbosity: 0,1,2 Surround mode: set true for entities, false for relations Filtering: e.g. remove instances distant from the hyperplane

University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt Thresholds Control selection of boundaries and classes in post processing The defaults we give will work Experiment See the documentation <PARAMETER name="thresholdProbabilityEntity" value="0.3"/> <PARAMETER name="thresholdProbabilityBoundary" value="0.5"/> <PARAMETER name="thresholdProbabilityClassification" value="0.5"/>

University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt Multiclass and evaluation Multi-class one-vs-others One-vs-another Evaluation Kfold – runs gives number of folds holdout – ratio gives training/test

University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt The learning Engine Learning algorithm and implementation specific SVM: Java implementation of LibSVM Uneven margins set with -tau <ENGINE nickname="SVM" implementationName="SVMLibSvmJava" options=" -c 0.7 -t 1 -d 3 -m 100 -tau 0.6"/>

University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt The dataset Defines Instance annotation Class Annotation feature to instance attribute mapping

©2012 Paula Matuszek GATE information based on Parameters in IDE l Path to the config file: when you create the new PR l Run-time: –corpus –input annotation set name: features and class name –output annotation set name: where results go. Same as input when evaluating –learningMode

©2012 Paula Matuszek GATE information based on LearningMode l Run-time parameter. Default is TRAINING l Primary modes are: –TRAINING: PR learns from data and saves models into learnedModels.save –APPLICATION: reads model from learnedModels.save and applies to data –EVALUTION: k-fold or hold-out evaluation. Output to GATE Developer. l Additional modes for incremental (adaptive) training and for displaying outcomes

©2012 Paula Matuszek GATE information based on Running the Batch Learning PR l Multiple PRs: –Running training and evaluation modes update the model and other data files: don’t run more than one in the same working directory l Processing order: – for training and evaluation modes: needs to go last. If there’s post-processing to be done make a separate application. –for application mode: controlled by BATCH-APP- INTERVAL. –set to 1: acts like normal PR –set higher: must go last (may be more efficient)

©2012 Paula Matuszek GATE information based on Summary l GATE supports a very rich set of classifier- type machine learning capabilities –Current: the Learning plugin has the Batch Learner –Older: Machine_Learning plugin has wrappers for various ML systems l Both class to be learned and features to learn from are document annotations l Same algorithms as we saw in NLTK, but use can be very different –classifying chunks of text within documents –alternative to engineering JAPE rules by hand