1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva

Slides:



Advertisements
Similar presentations
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
Advertisements

An Introduction to GATE
1/(19) GATE Evaluation Tools GATE Training Course October 2006 Kalina Bontcheva.
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
University of Sheffield NLP Module 4: Machine Learning.
1(18) GATE: A Unicode-based Infrastructure Supporting Multilingual Information Extraction Kalina Bontcheva, Diana Maynard, Valentin Tablan, Hamish Cunningham.
University of Sheffield NLP Module 11: Advanced Machine Learning.
JobTracker™ A Job Tracking System for Architects & Engineers Produced by LA Solutions.
© NCSR, Paris, December 5-6, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Enrich the lexicons for the 1 st domain based on partners remarks.
ANNIC ANNotations In Context GATE Training Course 27 – 28 April 2006 Niraj Aswani.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
University of Sheffield, NLP Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield.
1() Multi-Source and MultiLingual Information Extraction Diana Maynard Natural Language Processing Group University of Sheffield, UK BCS-SIGAI Workshop,
HOO 2012: A Report on the Preposition and Determiner Error Correction Shared Task Robert Dale, Ilya Anisimoff and George Narroway Centre for Language Technology.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
4/14/20051 ACE Annotation Ralph Grishman New York University.
Overview of Search Engines
Ontology-based Information Extraction for Business Intelligence
Seminar 1: General principles of CV & cover letter writing.
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
Towards a semantic extraction of named entities Diana Maynard, Kalina Bontcheva, Hamish Cunningham University of Sheffield, UK.
University of Sheffield NLP Module 9 Advanced GATE Applications.
 Using Microsoft Expression Web you can: › Create Web pages and Web sites › Set what you site will look like as you design it › Add text, images, multimedia.
A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
Advanced File Processing
GATE, a General Architecture for Text Engineering Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield Wednesday.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition.
ULI101 – XHTML Basics (Part II) What is Markup Language? XHTML vs. HTML General XHTML Rules Block Level XHTML Tags XHTML Validation.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
ARDA VACE Advanced Research and Development Activity (ARDA) Video Analysis and Content Extraction (VACE)
NERIL: Named Entity Recognition for Indian FIRE 2013.
ANNIC ANNotations In Context GATE Training Course October 2006 Kalina Bontcheva (with help from Niraj Aswani)
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark Greenwood Natural Language Processing Group University of Sheffield, UK.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
The WinMine Toolkit Max Chickering. Build Statistical Models From Data Dependency Networks Bayesian Networks Local Distributions –Trees Multinomial /
27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
Advanced File Processing. 2 Objectives Use the pipe operator to redirect the output of one command to another command Use the grep command to search for.
University of Sheffield, NLP Annotation and Evaluation Diana Maynard, Niraj Aswani University of Sheffield.
Experiments of Opinion Analysis On MPQA and NTCIR-6 Yaoyong Li, Kalina Bontcheva, Hamish Cunningham Department of Computer Science University of Sheffield.
Food and Agriculture Organization of the UN Library and Documentation Systems Division July 2005 Ontologies creation, extraction and maintenance 6 th AOS.
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
ETISEO Evaluation Nice, May th 2005 Evaluation Cycles.
SEMINAR WEI GUO. Software Visualization in the Large.
A Semantic Approach to IE Pattern Induction Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
ACE Automatic Content Extraction A program to develop technology to extract and characterize meaning from human language.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Chapter Five Advanced File Processing. 2 Lesson A Selecting, Manipulating, and Formatting Information.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
JAPE and Java Kalina Bontcheva, Department of Computer Science, University.
University of Sheffield, NLP Module 6: ANNIC Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
Recent Advances in ViPER David Mihalcik David Doermann Charles Lin.
A Unicode-based Environment for the Creation and use of LRs Valentin Tablan, Cristian Ursu, Kalina Bontcheva, Hamish Cunningham, Diana Maynard, Oana Hamza,
Introduction to HTML C151 Multi-User Operating Systems.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
Training and Evaluation CSCI-GA.2591
Using Uneven Margins SVM and Perceptron for IE
Presentation transcript:

1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva March 2004

2/(13) Corpus structure Located in gatecorpora in cvs Each directory under gatecorpora has a corpus, e.g., gatecorpora/ace Each corpus can have sub-parts, e.g. ace/bnews Each (sub-)corpus has a clean and marked directory, these are important Clean holds the unannotated version, while marked holds the human-marked ones There may also be a processed subdirectory – this is a datastore (unlike the other two) Corresponding files in each subdirectory must have the same name

3/(13) Tools for corpus manipulation There are lots of tools available in gatecorpora/utilities and in subdirectories of each corpus Many of the corpora, e.g. MUC, ACE come in different formats (e.g. inline vs standoff markup) and have been converted to GATE-style annotations Also tools for e.g. counting things, changing annotation names etc (mostly JAPE grammars)

4/(13) Corpora available MUC7 (newswires) MUSE (news texts from the web) ACE ACE Chinese ACE Arabic Romanian (news texts; 1984) CMU seminars Jobs CONLL’03 – part of Reuters with NEs Bulgarian - news

5/(13) MUC 7 corpus Newswires used in the official MUC 7 evaluation Data available in MUC format and GATE format Annotation types: Person, Location, Organization, Money, Percent, Date, Time Division into training and test sets

6/(13) MUSE corpus News texts from various websites (BBC, Guardian, etc.) Annotation types: Person, Organisation, Location, Date, Time, Money, Percent, Address Slight differences in annotation guidelines from MUC, e.g. people’s titles are included in names Available from gatecorpora/news in various subdirectories

7/(13) ACE corpus 3 types of text: newswire, broadcast news and newspaper Broadcast news and newspaper available as ground truth and original (degraded) texts Annotation types: Person, Organisation, Location, GPE, Facility Some annotations have roles to indicate metonymous usage Guidelines are different from MUC and MUSE Available from gatecorpora/ace in various subdirectories

8/(13) Multilingual ACE As for ACE, but in Chinese and Arabic Texts are in UTF-8 No degraded versions of these texts Available from gatecorpora/ace/ace03/Chinese/ and gatecorpora/ace/ace03/Arabic/

9/(13) CMU Seminars & Jobs Corpora frequently used to evaluate relation extraction and wrapper induction systems gatecorpora/jobs-corpus and gatecorpora/cmu-seminars Converted into gate xml, ready for use

10/(13) CONLL’03 shared task Corpus used in the CONLL’03 shared task for evaluating NE recognition In English, part of the Reuters corpus Markup is e.g.,, not converted to Muse tags Use reuterstogate.jape to convert to Muse tags gatecorpora/ReutersWithNamedEntities

11/(13) Annotation Diff: per-document evaluation

12/(13) Regression Test At corpus level – corpus benchmark tool – tracking system’s performance over time

13/(13) How it works Clean, marked, and processed Corpus_tool.properties – must be in the directory from where gate is executed Specifies configuration information about –What annotations types are to be evaluated –Threshold below which to print out debug info –Input set name and key set name Modes –Default – regression testing –Human marked against already stored, processed –Human marked against current processing results

14/(13) Conclusion This talk: More information: