Presenters: Arni, Sanjana.  Subtask of Information Extraction  Identify known entity names – person, places, organization etc  Identify the boundaries.

Slides:



Advertisements
Similar presentations
Statistical Machine Translation
Advertisements

Lazy Paired Hyper-Parameter Tuning
Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
CS479/679 Pattern Recognition Dr. George Bebis
Indian Statistical Institute Kolkata
Event Extraction Using Distant Supervision Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher D. Manning, Daniel Jurafsky 30 May 2014 Language.
Classification of the aesthetic value of images based on histogram features By Xavier Clements & Tristan Penman Supervisors: Vic Ciesielski, Xiadong Li.
Model Assessment, Selection and Averaging
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
Model Evaluation Metrics for Performance Evaluation
Sparse vs. Ensemble Approaches to Supervised Learning
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Ensemble Learning: An Introduction
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Greg GrudicIntro AI1 Introduction to Artificial Intelligence CSCI 3202: The Perceptron Algorithm Greg Grudic.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Sean Ryan Fanello. ^ (+9 other guys. )
Evaluating Classifiers
Instance Weighting for Domain Adaptation in NLP Jing Jiang & ChengXiang Zhai University of Illinois at Urbana-Champaign June 25, 2007.
Active Learning for Class Imbalance Problem
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
Qual Presentation Daniel Khashabi 1. Outline  My own line of research  Papers:  Fast Dropout training, ICML, 2013  Distributional Semantics Beyond.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Kyoungryol Kim Extracting Schedule Information from Korean .
Ensemble Methods: Bagging and Boosting
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Tokenization & POS-Tagging
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Ensemble Methods in Machine Learning
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Genetic Algorithms (in 1 Slide) l GA: based on an analogy to biological evolution l Each.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
MSM 2013 Challenge: Annotowatch Stefan Dlugolinsky, Peter Krammer, Marek Ciglan, Michal Laclavik Institute of Informatics, Slovak Academy of Sciences.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
Linear Models Tony Dodd. 21 January 2008Mathematics for Data Modelling: Linear Models Overview Linear models. Parameter estimation. Linear in the parameters.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Evaluating Classifiers
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Bias and Variance of the Estimator
CSE P573 Applications of Artificial Intelligence Decision Trees
Hyperparameters, bias-variance tradeoff, validation
Intent-Aware Semantic Query Annotation
Lecture 18: Bagging and Boosting
Machine Learning to Predict Experimental Protein-Ligand Complexes
Evaluating Models Part 1
Intro to Machine Learning
CS639: Data Management for Data Science
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Introduction to Machine learning
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Presenters: Arni, Sanjana

 Subtask of Information Extraction  Identify known entity names – person, places, organization etc  Identify the boundaries of the NE and identify its types  Example:  Input - Pope Benedict, the leader of the Catholic church wants to retire in Rome.  Output - [PER Pope Benedict XVI], the leader of [ORG the Catholic church] wants to retire in [LOC Rome]. 2

 Random Forest  Ensemble of trained decision trees  Bootstrapping– Randomly sample the labelled training data  Bayesian optimization (New!)  To set hyperparameter values 3

 Using Reuters corpus available in NLTK instead  Contains 10,788 news documents totalling 1.3 million words  We use 5000 sentences, about 10% of total data.  We use the NLTK NER tagger as gold standard [noisy!] 4

 We extracted 11 features from our corpus.  Orthographical:  Capitalization, word length, digits, dashes, …  Contextual:  Part of speech tags  Gazetteer lists:  Unambiguous NEs from training  Locations, names, corporate names 5

 Highly unbalanced data  Most words are not named entities (i.e. in class 0)  Risk of a strong bias towards class 0.  Solution:  Undersample 0-class in training data.  Use all data from other classes.  Creates extra bootstrapping parameter B. 6

 We use 10-fold cross validation to evaluate our model’s performance  Metrics  Precision & recall  F-measure  Accuracy is uninformative  How to choose hyper-parameter values?  # of trees, max depth, # of features, B  Answer: Bayesian optimization 7

 Grid search optimization is too expensive!  Bayesian optimization greatly reduces number of model evaluations:  Assume Gaussian process prior  Updates a GP posterior on each iteration  Different heuristics for picking points:  We optimize Expected Improvement  We use Spearmint B.O. package by Snoek et al. 8

9 Optimal values: B = … #features = 8 max_depth = 7

Precision-recall curve for each iteration of Bayesian optimization Unexpected proportional relationship!

Random forest with optimal hyper- parameters Sample confusion matrix

 Random forests are a feasible model for doing NER  Bayesian optimization is a powerful way of optimizing hyper-parameters  This work could benefit from:  Better NER tags in training data  More features (i.e. n-grams)  More data for training – now using 160,000 words 12