Evaluation CSCI-GA.2590 – Lecture 6A Ralph Grishman NYU.

Slides:



Advertisements
Similar presentations
Semi-supervised Relation Extraction with Large-scale Word Clustering Ang Sun Ralph Grishman Satoshi Sekine New York University June 20, 2011 NYU.
Advertisements

Lecture 22: Evaluation April 24, 2010.
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Evaluating Texts 2D summary... précis... explication...
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Evaluation of NLP Systems Martin Hassel KTH CSC Royal Institute of Technology Stockholm
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Clustering Evaluation April 29, Today Cluster Evaluation – Internal We don’t know anything about the desired labels – External We have some information.
September EXPERIMENTAL TECHNIQUES & EVALUATION IN NLP Universita’ di Venezia 1 Ottobre 2003.
Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Albert Gatt Corpora and Statistical Methods Lecture 9.
Today Evaluation Measures Accuracy Significance Testing
SVMLight SVMLight is an implementation of Support Vector Machine (SVM) in C. Download source from :
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
Sampling : Error and bias. Sampling definitions  Sampling universe  Sampling frame  Sampling unit  Basic sampling unit or elementary unit  Sampling.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
Semantic Grammar CSCI-GA.2590 – Lecture 5B Ralph Grishman NYU.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Partial Parsing CSCI-GA.2590 – Lecture 5A Ralph Grishman NYU.
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Answering Definition Questions Using Multiple Knowledge Sources Wesley Hildebrandt, Boris Katz, and Jimmy Lin MIT Computer Science and Artificial Intelligence.
Tagset Reductions in Morphosyntactic Tagging of Croatian Texts Željko Agić, Marko Tadić and Zdravko Dovedan University of Zagreb {zagic, mtadic,
Not all numbers or expressions have an exact square or cube root. If a number is NOT a perfect square or cube, you might be able to simplify it. Simplifying.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
Mebi 591D – BHI Kaggle Class Baselines kaggleclass.weebly.com/
VVSG: Usability, Accessibility, Privacy 1 VVSG, Part 1, Chapter 3 Usability, Accessibility, and Privacy December 6, 2007 Dr. Sharon Laskowski
Sangwon Park  The result of applying plug-in component based architecture ◦ Key differences with previous HanNanum (jhannanum ver.0.7.4)
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
Viterbi Algorithm CSCI-GA.2590 – Natural Language Processing Ralph Grishman NYU.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
Validation methods.
Information Retrieval (based on Jurafsky and Martin) Miriam Butt October 2003.
Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Twitter as a Corpus for Sentiment Analysis and Opinion Mining
Relation Extraction: Rule-based Approaches CSCI-GA.2590 Ralph Grishman NYU.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Learning part-of-speech taggers with inter-annotator agreement loss EACL 2014 Barbara Plank, Dirk Hovy, Anders Søgaard University of Copenhagen Presentation:
Mastering the Pipeline CSCI-GA.2590 Ralph Grishman NYU.
Sampling procedures for assessing accuracy of record linkage Paul A. Smith, S3RI, University of Southampton Shelley Gammon, Sarah Cummins, Christos Chatzoglou,
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Evaluating Classifiers
NYU Coreference CSCI-GA.2591 Ralph Grishman.
Smoothing 10/27/2017.
(Entity and) Event Extraction CSCI-GA.2591
Training and Evaluation CSCI-GA.2591
Video 6: Practice Papers for
Strategies for Multiple Choice Section
CSCI 5832 Natural Language Processing
Measurement System Analysis
Presentation transcript:

Evaluation CSCI-GA.2590 – Lecture 6A Ralph Grishman NYU

Measuring Performance NLP systems will be based on models with simplifying assumptions and limited training, so their performance will never be perfect – to be able to improve the systems we build, we need to be able to measure the performance of individual components and the entire system 3/3/15NYU2

Accuracy for part-of-speech tagging, accuracy is a simple and reasonablemetric accuracy = tokens with correct tag total tokens 3/3/15NYU3

Accuracy can be Misleading For tasks where one tag predominates, accuracy can overstate performance Consider name tagging for texts where 10% of the tokens are names A ‘baseline’ name tagger which tags every token as ‘other’ (not a name) would be rated as 90% accurate though it finds no names 3/3/15NYU4

Precision and Recall Instead of counting the tags themselves, we count the names defined by these tags: key = number of names in key response = number of names in system response correct = number of names in response which exactly match (in type and extent) a name in the key then precision = correct / response recall = correct / key 3/3/15NYU5

Precision & Recall (Example) NE system response = 3 Mary Smith runs the New York Supreme Court. NE key = 2 NE correct = 1 recall = 50% precision = 33% 3/3/15NYU6

F-measure We sometimes want a single measure to compare systems The usual choice is F-measure, the harmonic mean of recall and precision 3/3/15NYU7

Honest Test Data For honest evaluations, test data should remain ‘blind’ avoid training to the test for a corpus-trained system, set aside separate test data 3/3/15NYU8

Cross-Validation 3/3/15NYU9 train test