University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell.

Slides:

Advertisements

Similar presentations

1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.

Advertisements

1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004

An Introduction to GATE

University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.

University of Sheffield NLP Module 4: Machine Learning.

University of Sheffield NLP Module 11: Advanced Machine Learning.

Information Extraction Lecture 7 – Linear Models (Basic Machine Learning) CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.

Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.

SVM—Support Vector Machines

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

©2012 Paula Matuszek GATE information based on ©2012 Paula Matuszek.

Automatic indexing and retrieval of crime-scene photographs Katerina Pastra, Horacio Saggion, Yorick Wilks NLP group, University of Sheffield Scene of.

Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.

05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.

Introduction to Machine Learning Approach Lecture 5.

Overview of Search Engines

Ontology-based Information Extraction for Business Intelligence

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

An Introduction to Support Vector Machines Martin Law.

University of Sheffield NLP Opinion Mining in GATE Horacio Saggion & Adam Funk.

This week: overview on pattern recognition (related to machine learning)

The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Survey of Semantic Annotation Platforms

Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.

Information Extraction From Medical Records by Alexander Barsky.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.

Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.

A Language Independent Method for Question Classification COLING 2004.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.

An Introduction to Support Vector Machines (M. Law)

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

MedKAT Medical Knowledge Analysis Tool December 2009.

Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.

Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Machine Learning in GATE Valentin Tablan. 2 Machine Learning in GATE Uses classification. [Attr 1, Attr 2, Attr 3, … Attr n ]  Class Classifies annotations.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.

Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.

Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.

Unveiling Zeus Automated Classification of Malware Samples Abedelaziz Mohaisen Omar Alrawi Verisign Inc, VA, USA Verisign Labs, VA, USA

University of Sheffield NLP Sentiment Analysis (Opinion Mining) with Machine Learning in GATE.

University of Sheffield NLP Module 11: Machine Learning © The University of Sheffield, This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike.

Module 11: Machine Learning

Sentiment analysis algorithms and applications: A survey

Chunking—Practical Exercise

Supervised Machine Learning

Automatic Extraction of Hierarchical Relations from Text

Using Uneven Margins SVM and Perceptron for IE

Hierarchical, Perceptron-like Learning for OBIE

Presentation transcript:

University of Sheffield NLP Machine Learning in GATE Angus Roberts, Horacio Saggion, Genevieve Gorrell

University of Sheffield NLP Recap Previous two days looked at knowledge engineered IE This session looks at machine learned IE Supervised learning Effort is shifted from language engineers to annotators

University of Sheffield NLP Outline Machine Learning and IE Support Vector Machines GATE's learning API and PR Learning entities – hands on Learning relations – demo (classifying sentences and documents)

University of Sheffield NLP Machine learning for information extraction

University of Sheffield NLP Machine Learning We have data items comprising labels and features E.g. an instance of “cat” has features “whiskers=1”, “fur=1”. A “stone” has “whiskers=0” and “fur=0” Machine learning algorithm learns a relationship between the features and the labels E.g. “if whiskers=1 then cat” This is used to label new data We have a new instance with features “whiskers=1” and “fur=1”--is it a cat or not???

University of Sheffield NLP Types of ML Classification Training instances pre-labelled with classes ML algorithm learns to classify unseen data according to attributes Clustering Unlabelled training data Clusters are determined automatically from the data Derive representation using ML algorithm Automate decision-making in the future

University of Sheffield NLP ML in Information Extraction We have annotations (classes) We have features (words, context, word features etc.) Can we learn how features match classes using ML? Once obtained, the ML representation can do our annotation for us based on features in the text Pre-annotation Automated systems Possibly good alternative to knowledge engineering approaches No need to write the rules However, need to prepare training data

University of Sheffield NLP ML in Information Extraction Central to ML work is evaluation Need to try different methods, different parameters, to obtain good result Precision: How many of the annotations we identified are correct? Recall: How many of the annotations we should have identified did we? F-Score: F = 2(precision.recall)/(precision+recall) Testing requires an unseen test set Hold out a test set Simple approach but data may be scarce Cross-validation split training data into e.g. 10 sections Take turns to use each “fold” as a test set Average score across the 10

University of Sheffield NLP ML Algorithms Vector space models Data have attributes (word features, context etc.) Each attribute is a dimension Data positioned in space Methods involve splitting the space Having learned the split, apply to new data Support vector machines, K-Nearest Neighbours etc. Finite state models, decision trees, Bayesian classification and more … We will focus on support vector machines today

University of Sheffield NLP Support vector machines

University of Sheffield NLP Support Vector Machines Attempt to find a hyperplane that separates data Goal: maximize margin separating two classes Wider margin = greater generalisation

University of Sheffield NLP Support Vector Machines Points near decision boundary: support vectors (removing them would change boundary) Points far from boundary not important for decision What if data doesn't split? – Soft boundary methods exist for imperfect solutions – However linear separator may be completely unsuitable

University of Sheffield NLP Support Vector Machines What if there is no separating hyperplane? See example: Or class may be a globule They do not work!

University of Sheffield NLP Kernel Trick Map data into different dimensionality Now the points are separable! E.g. features alone may not make class linearly separable but combining features may Generate many new features and let algorithm decide which to use

University of Sheffield NLP Support Vector Machines SVMs combined with kernel trick provide a powerful technique Multiclass methods simple extention to two class technique (one vs. another, one vs. others) Widely used with great success across a range of linguistic tasks

University of Sheffield NLP GATE's learning API and PR

University of Sheffield NLP API and PRs User Guide 9.24  Machine Learning PR Chapter 11  Machine Learning API Support for 3 types of learning Produce features from annotations Abstracts away from ML algorithms  Batch Learning PR A GATE language analyser

University of Sheffield NLP Instances, attributes, classes California Governor Arnold Schwarzenegger proposes deep cuts. Token Tok Entity.type=Person Attributes:Any annotation feature relative to instances Token.String Token.category (POS) Sentence.length Instances:Any annotation Tokens are often convenient Class:The thing we want to learn A feature on an annotation Sentence Token Entity.type =Location

University of Sheffield NLP Surround mode This learned class covers more than one instance.... Begin / End boundary learning Dealt with by API - surround mode Transparent to the user California Governor Arnold Schwarzenegger proposes deep cuts. Token Entity.type=Person

University of Sheffield NLP Multi class to binary California Governor Arnold Schwarzenegger proposes deep cuts. Entity.type=Person Entity.type =Location Three classes, including null Many algorithms are binary classifiers One against all (One against others)  LOC vs PERS+NULL / PERS vs LOC+NULL / NULL vs LOC+PERS One against one (One against another one)  LOC vs PERS / LOC vs NULL / PERS vs NULL Dealt with by API - multClassification2Binary Transparent to the user

University of Sheffield NLP ML applications in GATE Batch Learning PR  Evaluation  Training  Application Runs after all other PRs – must be last PR Configured via xml file A single directory holds generated features, models, and config file

University of Sheffield NLP The configuration file Verbosity: 0,1,2 Surround mode: set true for entities, false for relations Filtering: e.g. remove instances distant from the hyperplane

University of Sheffield NLP Thresholds Control selection of boundaries and classes in post processing The defaults we give will work Experiment See the documentation <PARAMETER name="thresholdProbabilityEntity" value="0.3"/> <PARAMETER name="thresholdProbabilityBoundary" value="0.5"/> <PARAMETER name="thresholdProbabilityClassification" value="0.5"/>

University of Sheffield NLP Multiclass and evaluation Multi-class  one-vs-others  One-vs-another Evaluation  Kfold – runs gives number of folds  holdout – ratio gives training/test

University of Sheffield NLP The learning Engine Learning algorithm and implementation specific SVM: Java implementation of LibSVM – Uneven margins set with -tau <ENGINE nickname="SVM" implementationName="SVMLibSvmJava" options=" -c 0.7 -t 1 -d 3 -m 100 -tau 0.6"/>

University of Sheffield NLP The dataset Defines  Instance annotation  Class  Annotation feature to instance attribute mapping

University of Sheffield NLP Learning entities Hands on

University of Sheffield NLP The Problem Information extraction consists on the identification of pre-specified facts in running texts One important component of any information extraction system is a named entity identification component Two main approaches exist for the identification of entities in text: Hand-crafted rules: you’ve seen the ANNIE system Machine learning approaches: we will explore one possibility in this session using a classification system Manually developed rules use different source of information: identity of tokens, parts of speech, orthography of the tokens, dictionary information (e.g. Lookup process), etc. ML components also rely on those sources of information and features have to be carefully selected by the ML developer

University of Sheffield NLP The Problem

University of Sheffield NLP Features for learning

University of Sheffield NLP Features for learning Consider the string “Alcan, Inc.” in the text what we want the ML component to do is to annotate this whole string as a company name. Note that the ML component will treat this problem as classification: it will transform this into the problem of classifying individual tokens in text (e.g. “Alcan” is the beginning of a company name and “.” (after Inc) is the end of the company name There are several “features” one could use to recognize the string as the name of a company: the first token is a NNP (proper noun), the last token is a company designator, the first token after the string is the verb “to engage”, etc. We are going to consider features which can be extracted from the linguistic and semantic analysis of the text: tokenisation, parts of speech tagging, morphological analysis, gazetteer lookup, and entity recognition Additionally one may use information computed by a parser, dependency relations, or syntactic information In some cases extra processes will be required in order to transform the result of the analysis into features the ML component can use

University of Sheffield NLP Exercise I Implement a ML component based on SVM to identify the following concepts in company profiles: company name; address; fax; phone; web site; industry type; creation date; industry sector; main products; market locations; number of employees; stock exchange listings Materials (under directory hand-on-resources/ml/entity-learning) training data: a set of 5 company profiles annotated with the target concepts (corpus/annotated) - each document contains an annotation Mention with a feature class representing the target concept Test documents (without target concepts): a set of company profiles from the same source as the training data (corpus/testing) SVM configuration file learn-company.xml (experiments/company-profile- learning)

University of Sheffield NLP Exercise I 1.Run an experiment with the training data to check the performance of the learning component Create a corpus and populate it with the training data Create a Learning PR using the provided configuration file Create a corpus pipeline containing the Learning PR: set the Learning PR to “evaluation” mode Run the pipeline over the corpus and examine the results 2.Run an experiment with the test data and check the results of the annotation process on unseen documents Create a corpus and populate it with the training data Create a Learning PR using the provided configuration file Create a corpus pipeline containing the Learning PR: set the Learning PR to “training” mode

University of Sheffield NLP Exercise I 1.Run an experiment with the test data and check the results of the annotation process on unseen documents (cont) Create a corpus with the test documents Annotate the documents in the corpus with ANNIE + grammar to create Entity (grammars/create_entity.jape) Train the learning system using the training documents (training mode) Apply the learning system (application mode) to the test documents – use your own annotation set as output Examine the result of the annotation process 2.Run an experiment with the training data to check the performance of the learning component by modifying some of the parameters (follow the steps in 1.) - create a working directory, copy the configuration file, modify it, and test the learning component with the modified configuration file (change for example the tau parameter from 1 to 0.5, etc.)

University of Sheffield NLP Exercise II Implement a ML component based on SVM to learn ANNIE, e.g. To learn to identify the following concepts or named entities: Location, Address, Date, Person, Organization Materials (under directory hand-on-resources/ml/entity-learning) We will use the testing data provided in Exercise I Create a corpus with the test data and prepare it for learning and testing Annotate the corpus with ANNIE + the Entity grammar Inspired by the previous exercise create a configuration file that will learn the concept Entity and its type (you can not use Entity as a feature for learning!) Run a ML experiment using your configuration file, use the “evaluation” mode over the corpus and analyse the results

University of Sheffield NLP Exercise II As a variation, separate a few documents for testing, train the learner without the separated documents, and run it in application mode over the test documents You may want to use the annotationDiff tool verify in each document how the learner performed

University of Sheffield NLP Learning relations Demonstration

University of Sheffield NLP Entities, modifiers, relations, coreference The CLEF project More sophisticated indexing and querying Why was a drug given? What were the results of an exam?

University of Sheffield NLP Supervised system architecture

University of Sheffield NLP Previous work Clinical relations have usually been extracted as part of a larger clinical IE system Extraction has usually involved syntactic parses, domain-specific grammars and knowledge bases, often hand crafted In other areas of biomedicine, statistical machine learning has come to predominate We apply statistical techniques to clinical relations

University of Sheffield NLP Entity types

University of Sheffield NLP Relation types

University of Sheffield NLP System architecture GATE pipeline Relation model learning and application Pre- process Training and test texts Relation annotations SVM models Pair entities Generate features

University of Sheffield NLP Learning relations Learn relations between pairs of entities Create all possible pairings of entities across n sentences in the gold standard, constrained by legal entity types  n: e.g. the same, or adjacent Generate features describing the characteristics of these pairs Build SVM models from these features

University of Sheffield NLP Configuring in GATE theInstanceAnnotation featureForIdOfArg1 featureForIdOfArg2...

University of Sheffield NLP Creating entity pairings Entity pairings provide instances They will therefore provide features A “pairing and features” PR or JAPE needs to be run before the Learning Entities and features are problem specific We do not have a generic “pairing and features” PR You currently need to write your own

University of Sheffield NLP Feature examples

University of Sheffield NLP Performance by feature set