MSM 2013 Challenge: Annotowatch Stefan Dlugolinsky, Peter Krammer, Marek Ciglan, Michal Laclavik Institute of Informatics, Slovak Academy of Sciences.

Slides:

Advertisements

Similar presentations

University of Sheffield NLP Module 4: Machine Learning.

Advertisements

University of Sheffield NLP Module 11: Advanced Machine Learning.

Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Presenters: Arni, Sanjana.  Subtask of Information Extraction  Identify known entity names – person, places, organization etc  Identify the boundaries.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.

Analysis of Classification-based Error Functions Mike Rimer Dr. Tony Martinez BYU Computer Science Dept. 18 March 2006.

Three kinds of learning

/faculteit technologie management 1 Process Mining: Extension Mining Algorithms Ana Karla Alves de Medeiros Ana Karla Alves de Medeiros Eindhoven University.

Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.

 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.

A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Issues with Data Mining

Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.

Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

1 1 Slide Evaluation. 2 2 n Interactive decision tree construction Load segmentchallenge.arff; look at dataset Load segmentchallenge.arff; look at dataset.

1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.

Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.

Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad

Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.

1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Ensemble Methods: Bagging and Boosting

CLASSIFICATION: Ensemble Methods

COLING 2012 Extracting and Normalizing Entity-Actions from Users’ comments Swapna Gottipati, Jing Jiang School of Information Systems, Singapore Management.

School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

TimeML compliant text analysis for Temporal Reasoning Branimir Boguraev and Rie Kubota Ando.

Automatic Transformation of Raw Clinical Data into Clean Data Using Decision Tree Learning Jian Zhang Supervised by: Karen Petrie 1.

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

MedKAT Medical Knowledge Analysis Tool December 2009.

Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Ivica Dimitrovski 1, Dragi Kocev 2, Suzana Loskovska 1, Sašo Džeroski 2 1 Faculty of Electrical Engineering and Information Technologies, Department of.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.

Classification Ensemble Methods 1

COMP24111: Machine Learning Ensemble Models Gavin Brown

***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.

Linear Models & Clustering Presented by Kwak, Nam-ju 1.

1 Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1classifier 2classifier.

The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.

Automatically Labeled Data Generation for Large Scale Event Extraction

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

CRF &SVM in Medication Extraction

COMP1942 Classification: More Concept Prepared by Raymond Wong

Saisai Gong, Wei Hu, Yuzhong Qu

Source: Procedia Computer Science（2015）70:

Chunking—Practical Exercise

Classifying enterprises by economic activity

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning with Weka

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Using Uneven Margins SVM and Perceptron for IE

Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.

The Voted Perceptron for Ranking and Structured Classification

By Hossein Hematialam and Wlodek Zadrozny Presented by

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

MSM 2013 Challenge: Annotowatch Stefan Dlugolinsky, Peter Krammer, Marek Ciglan, Michal Laclavik Institute of Informatics, Slovak Academy of Sciences

Approach Not to create a new NER method Combine various existing NER taggers, which are based on diverse methods through ML Candidate NER taggers and their methods: MSM2013, Rio de Janeiro5/13/13

Candidate NER taggers evaluation results worse performance on micropost text than on news texts for which the taggers were intended to be used, but: – high recall when unified (low precision drawback) – together can discover different named entities We felt that there can be a superior performance achieved by combining the taggers MSM2013, Rio de Janeiro5/13/13

Evaluation details 1/4 evaluated on a modified MSM 2013 Challenge training dataset modifications made to original training set: – Removed duplicate and overlapping microposts – Removed country/nation adjectivals and demonyms (e.g. MISC/English) modified dataset of 2752 unique microposts no tweaking or configuration made to evaluated taggers prior to evaluation MSM2013, Rio de Janeiro5/13/13

Evaluation details 2/4 S – strict comparison (exact offests), L – lenient comparison (overlap), A – average of S&L MSM2013, Rio de Janeiro5/13/13

Evaluation details 3/4 MSM2013, Rio de Janeiro5/13/13

Evaluation details 4/4 Unified NER taggers’ results (evaluated over the training set) MSM2013, Rio de Janeiro5/13/13

Chosen NER taggers ANNIE Apache OpenNLP Illinois NER Illinois Wikifier OpenCalais Stanford NER Wikipedia Miner We have created one special tagger “Miscinator” for MISC extraction of entertainment award and sport events. It was based on a gazetteer created from MISC entities found in the training set and enhanced by Google Sets. 5/13/13MSM2013, Rio de Janeiro Not specially configured, trained or tweaked for Microposts. Default settings and the most suitable official models were used. mapping to target entities was the only “hack” to these tools

How to combine the taggers Not simply take the best tagger for each NE class, i.e. extract LOC, MISC, ORG by OpenCalais and PER by Illinois NER (we called it “Dummy model”) Transform to Machine Learning task MSM2013, Rio de Janeiro5/13/13

Features for ML 1/4 Micropost features: – describe micropost text globally (as a whole) – considered only words with length > 3 awc – all words capital awuc – all words upper case awlc – all words lower case MSM2013, Rio de Janeiro5/13/13

Features for ML 2/4 Annotation features – Annotations of underlaying NER taggers – Considered annotation types: LOC, MISC, ORG, PER, NP, VP, OTHER – Describe each annotation found by underlaying NER tagger (reference/ML instance annotation) ne - annotation class flc - first letter capital aluc - all letters upper cased allc - all letters lower cased cw - capitalized words wc - word count MSM2013, Rio de Janeiro5/13/13

Features for ML 3/4 Overlap features – Describe, how the reference annotation overlaps with other annotations ail – average intersection length of the reference annotation with other annotations of the same type* aiia – how much the reference annotation covers the other annotations of the same type* in average aiir – how much % of the reference annotation length is covered by other annotations of the same type* in average * reference annotation can be of different type MSM2013, Rio de Janeiro5/13/13

Features for ML 4/4 Confidence features – Some taggers return their confidence about the annotation (OpenCalais, Illinois NER) E(p) - mean value of the confidence values for overlapping annotations var(p) - variance of the confidence values for overlapping annotations Answer – NE class of manual annotation which overlaps the instance/reference annotation MSM2013, Rio de Janeiro5/13/13

Model training Tried algorithms to train a classification model: C4.5 LMT (Logistic Model Trees) NBTree (Bayess Network Tree) REPTree (Fast decision tree learner) SimpleCart LADTree (LogitBoost Alternating Decision Tree) Random Forest AdaBoostM1 MultiLayer Perceptron Neural Network Bayes Network Bagging Tree FT (Functional trees) Input data: – ~36,000 instances, – 200 attributes Input data preprocessing: – removed duplicate instances – removed attributes where values changed for insignificant number of instances Preprocessed input data, ready for training: – ~31,000 instances, – 100 attributes Validation: – 10-fold cross validation, – holdout MSM2013, Rio de Janeiro5/13/13

Evaluation over the MSM test dataset Annotowatch 1, 2 and 3 are our submissions to the challenge (v3 used special post-processing) RandomForest 21 and C4.5 M13 are new models trained after the submission deadline Dummy model is our baseline (simply built of the best taggers for particular NE class) Test dataset was cleared from duplicates and there were some corrections made (results may vary from the official challenge results) Comparison of the response and gold standard was strict

Thank you! Questions? MSM2013, Rio de Janeiro5/13/13

Annotowatch 3 post-processing If a micropost identical to one in the training set was annotated, we extended the detected concepts by those from manually annotated training data (affecting three microposts) A gazetteer built from a list of organizations found in the training set has been used to extend the ORG annotations of the model (affecting 69 microposts). MSM2013, Rio de Janeiro5/13/13