Presentation is loading. Please wait.

Presentation is loading. Please wait.

©2012 Paula Matuszek GATE information based on ©2012 Paula Matuszek.

Similar presentations


Presentation on theme: "©2012 Paula Matuszek GATE information based on ©2012 Paula Matuszek."— Presentation transcript:

1 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek CSC 9010: Text Mining Applications: GATE Machine Learning Dr. Paula Matuszek Paula.Matuszek@villanova.edu Paula.Matuszek@gmail.com (610) 647-9789

2 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Information Extraction in GATE l In the last assignment we looked at how to set up GATE to do information extraction “by hand”. –Gazeteers –JAPE rules l We modified the system to add UPenn and Villanova to universities to be extracted l Not hard for two entities, but this can get tricky. Is there a better way?

3 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Machine Learning l We would like: –give GATE examples of universities –let it figure out for itself how to identify them l This is basically a classification problem: given an entity, is it a university? l This sounds familiar! l The classification methods we looked at in NLTK can be considered as forms of supervised machine learning.

4 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Machine Learning in GATE l GATE has machine learning processing resources: CREOLE Learning plugin l Same classification algorithms we saw in NLTK l But both features to be used and class to be learned can be richer, using PRs like ANNIE

5 5 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt Machine Learning We have data items comprising labels and features E.g. an instance of “cat” has features “whiskers=1”, “fur=1”. A “stone” has “whiskers=0” and “fur=0” Machine learning algorithm learns a relationship between the features and the labels E.g. “if whiskers=1 then cat” This is used to label new data We have a new instance with features “whiskers=1” and “fur=1”--is it a cat or not???

6 6 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt ML in Information Extraction We have annotations (classes) We have features (words, context, word features etc.) Can we learn how features match classes using ML? Once obtained, the ML representation can do our annotation for us based on features in the text Pre-annotation Automated systems Possibly good alternative to knowledge engineering approaches No need to write the rules However, need to prepare training data

7 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt ML in Information Extraction Central to ML work is evaluation Need to try different methods, different parameters, to obtain good result Precision: How many of the annotations we identified are correct? Recall: How many of the annotations we should have identified did we? F-Score: F = 2(precision.recall)/(precision+recall) Testing requires an unseen test set Hold out a test set Simple approach but data may be scarce Cross-validation split training data into e.g. 10 sections Take turns to use each “fold” as a test set Average score across the 10

8 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html More on SVMs l Primary Machine Learning engine in GATE is the Support Vector Machine –Handles very large feature sets well –Has been shown empirically to be effective in a variety of unstructured text learning tasks. l We covered this briefly earlier; more detail will help with using them in GATE

9 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Basic Idea Underlying SVMs l Find a line, or a plane, or a hyperplane, that separates our classes cleanly. –This is the same concept as we have seen in regression. l By finding the greatest margin separating them

10 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers denotes +1 denotes -1 How would you classify this data?

11 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers denotes +1 denotes -1 How would you classify this data?

12 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Linear Classifiers denotes +1 denotes -1 Any of these would be fine....but which is best?

13 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Classifier Margin denotes +1 denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

14 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin. Called Linear Support Vector Machine (SVM)

15 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Copyright © 2001, 2003, Andrew W. Moore Maximum Margin denotes +1 denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin. Called Linear Support Vector Machine (SVM) Support Vectors are those datapoints that the margin pushes up against

16 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Messy Data l This is all good so far. l Suppose our aren’t that neat:

17 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Soft Margins l Intuitively, it still looks like we can make a decent separation here. –Can’t make a clean margin –But can almost do so, if we allow some errors l A soft margin is one which lets us make some errors, in order to get a wider margin l Tradeoff between wide margin and classification errors

18 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Messy Data

19 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Slack Variables and Cost In order to find a soft margin, we allow slack variables, which measure the degree of misclassification. –Takes into account the number of misclassified instances and the distance from the margin l We then modify this by a cost (C) for these misclassified instances. –High cost -- narrow margins, won’t generalize well –Low cost -- broader margins but misclassify more data. l How much we want it to cost to misclassify instances depends on our domain -- what we are trying to do

20 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Non-Linearly-Separable Data l Suppose even with a soft margin we can’t do a good linear separation of our data? l Allowing non-linearity will give us much better modeling of many data sets. In SVMs, we do this by using a kernel. l A kernel is a function which maps our data into a higher-order order feature space where we can find a separating hyperplane

21 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Hard 1-dimensional dataset What can be done about this? x=0

22 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Hard 1-dimensional dataset x=0

23 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials Hard 1-dimensional dataset x=0

24 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html The Kernel Trick! l If we can’t do a linear separation –add higher order features –and combinations of features l Combinations of features may allow separation where either feature alone will not This is known as the kernel trick l Typical kernels are polynomial and RBF (Radial Basis Function (Gaussian)) l GATE supports linear and polynomial kernels

25 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials

26 Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. http://www.cs.cmu.edu/~awm/tutorials

27 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html What If We Want Three Classes? l Suppose our task involves more than two classes: universities, cities, businesses? l Reduce multiple class problem to multiple binary class problems. –one-versus-all, one-vs-others –one-versus-one, one-vs-another l GATE will do this automatically if there are more than two classes.

28 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Doing this in GATE l GATE has a Learning plugin available in CREOLE. The PR available from it is Batch Learning –Newest machine learning PR –focused on chunk recognition, text classification, relation extraction. –Primary algorithm is an SVM –Can also interface to WEKA for Naive Bayes, KNN and decision trees l There is also an older Machine_Learning plugin, which contains wrappers for several different external learning systems.

29 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Using Batch Learning PR l The batch learning PR is basically learning new annotations (the class) from previous annotations (the features) l Need three things: –annotated training documents –possibly preprocessed to get annotations we want to learn from –an XML configuration file (which is external to the IDE) l The PR treats the parent directory of the config file as its working directory.

30 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt ML applications in GATE Batch Learning PR Evaluation Training Application Runs after all other PRs – must be last PR Configured via xml file A single directory holds generated features, models, and config file

31 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt Instances, attributes, classes California Governor Arnold Schwarzenegger proposes deep cuts. Token Tok Entity.type=Person Attributes:Any annotation feature relative to instances Token.String Token.category (POS) Sentence.length Instances:Any annotation Tokens are often convenient Class:The thing we want to learn A feature on an annotation SentenceToken Entity.type =Location

32 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt Surround mode This learned class covers more than one instance.... Begin / End boundary learning Dealt with by API - surround mode Transparent to the user California Governor Arnold Schwarzenegger proposes deep cuts. Token Entity.type=Person

33 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt Multi class to binary California Governor Arnold Schwarzenegger proposes deep cuts. Entity.type=Person Entity.type =Location Three classes, including null Many algorithms are binary classifiers One against all (One against others)  LOC vs PERS+NULL / PERS vs LOC+NULL / NULL vs LOC+PERS One against one (One against another one)  LOC vs PERS / LOC vs NULL / PERS vs NULL Dealt with by API - multClassification2Binary Transparent to the user

34 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html The XML File l Root element is ML-CONFIG. l Contains two required elements: –DATASET –ENGINE l And some optional settings –Described in manual; most important are next.

35 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt The configuration file Verbosity: 0,1,2 Surround mode: set true for entities, false for relations Filtering: e.g. remove instances distant from the hyperplane

36 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt Thresholds Control selection of boundaries and classes in post processing The defaults we give will work Experiment See the documentation <PARAMETER name="thresholdProbabilityEntity" value="0.3"/> <PARAMETER name="thresholdProbabilityBoundary" value="0.5"/> <PARAMETER name="thresholdProbabilityClassification" value="0.5"/>

37 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt Multiclass and evaluation Multi-class one-vs-others One-vs-another Evaluation Kfold – runs gives number of folds holdout – ratio gives training/test

38 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt The learning Engine Learning algorithm and implementation specific SVM: Java implementation of LibSVM Uneven margins set with -tau <ENGINE nickname="SVM" implementationName="SVMLibSvmJava" options=" -c 0.7 -t 1 -d 3 -m 100 -tau 0.6"/>

39 University of Sheffield NLP gate.ac.uk/sale/talks/gate-course-july09/ml.ppt The dataset Defines Instance annotation Class Annotation feature to instance attribute mapping

40 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Parameters in IDE l Path to the config file: when you create the new PR l Run-time: –corpus –input annotation set name: features and class name –output annotation set name: where results go. Same as input when evaluating –learningMode

41 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html LearningMode l Run-time parameter. Default is TRAINING l Primary modes are: –TRAINING: PR learns from data and saves models into learnedModels.save –APPLICATION: reads model from learnedModels.save and applies to data –EVALUTION: k-fold or hold-out evaluation. Output to GATE Developer. l Additional modes for incremental (adaptive) training and for displaying outcomes

42 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Running the Batch Learning PR l Multiple PRs: –Running training and evaluation modes update the model and other data files: don’t run more than one in the same working directory l Processing order: – for training and evaluation modes: needs to go last. If there’s post-processing to be done make a separate application. –for application mode: controlled by BATCH-APP- INTERVAL. –set to 1: acts like normal PR –set higher: must go last (may be more efficient)

43 ©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.htmlhttp://gate.ac.uk/sale/tao/splitch18.html Summary l GATE supports a very rich set of classifier- type machine learning capabilities –Current: the Learning plugin has the Batch Learner –Older: Machine_Learning plugin has wrappers for various ML systems l Both class to be learned and features to learn from are document annotations l Same algorithms as we saw in NLTK, but use can be very different –classifying chunks of text within documents –alternative to engineering JAPE rules by hand


Download ppt "©2012 Paula Matuszek GATE information based on ©2012 Paula Matuszek."

Similar presentations


Ads by Google