Download presentation
Presentation is loading. Please wait.
Published byShanon Young Modified over 9 years ago
1
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva http://gate.ac.uk/http://gate.ac.uk/ http://nlp.shef.ac.uk/http://nlp.shef.ac.uk/ March 2004
2
2/(13) Corpus structure Located in gatecorpora in cvs Each directory under gatecorpora has a corpus, e.g., gatecorpora/ace Each corpus can have sub-parts, e.g. ace/bnews Each (sub-)corpus has a clean and marked directory, these are important Clean holds the unannotated version, while marked holds the human-marked ones There may also be a processed subdirectory – this is a datastore (unlike the other two) Corresponding files in each subdirectory must have the same name
3
3/(13) Tools for corpus manipulation There are lots of tools available in gatecorpora/utilities and in subdirectories of each corpus Many of the corpora, e.g. MUC, ACE come in different formats (e.g. inline vs standoff markup) and have been converted to GATE-style annotations Also tools for e.g. counting things, changing annotation names etc (mostly JAPE grammars)
4
4/(13) Corpora available MUC7 (newswires) MUSE (news texts from the web) ACE ACE Chinese ACE Arabic Romanian (news texts; 1984) CMU seminars Jobs CONLL’03 – part of Reuters with NEs Bulgarian - news
5
5/(13) MUC 7 corpus Newswires used in the official MUC 7 evaluation Data available in MUC format and GATE format Annotation types: Person, Location, Organization, Money, Percent, Date, Time Division into training and test sets
6
6/(13) MUSE corpus News texts from various websites (BBC, Guardian, etc.) Annotation types: Person, Organisation, Location, Date, Time, Money, Percent, Address Slight differences in annotation guidelines from MUC, e.g. people’s titles are included in names Available from gatecorpora/news in various subdirectories
7
7/(13) ACE corpus 3 types of text: newswire, broadcast news and newspaper Broadcast news and newspaper available as ground truth and original (degraded) texts Annotation types: Person, Organisation, Location, GPE, Facility Some annotations have roles to indicate metonymous usage Guidelines are different from MUC and MUSE Available from gatecorpora/ace in various subdirectories
8
8/(13) Multilingual ACE As for ACE, but in Chinese and Arabic Texts are in UTF-8 No degraded versions of these texts Available from gatecorpora/ace/ace03/Chinese/ and gatecorpora/ace/ace03/Arabic/
9
9/(13) CMU Seminars & Jobs Corpora frequently used to evaluate relation extraction and wrapper induction systems gatecorpora/jobs-corpus and gatecorpora/cmu-seminars Converted into gate xml, ready for use
10
10/(13) CONLL’03 shared task Corpus used in the CONLL’03 shared task for evaluating NE recognition In English, part of the Reuters corpus Markup is e.g.,, not converted to Muse tags Use reuterstogate.jape to convert to Muse tags gatecorpora/ReutersWithNamedEntities
11
11/(13) Annotation Diff: per-document evaluation
12
12/(13) Regression Test At corpus level – corpus benchmark tool – tracking system’s performance over time
13
13/(13) How it works Clean, marked, and processed Corpus_tool.properties – must be in the directory from where gate is executed Specifies configuration information about –What annotations types are to be evaluated –Threshold below which to print out debug info –Input set name and key set name Modes –Default – regression testing –Human marked against already stored, processed –Human marked against current processing results
14
14/(13) Conclusion This talk: http://gate.ac.uk/sale/talks/corpora-tutorial.ppt http://gate.ac.uk/sale/talks/corpora-tutorial.ppt More information: http://gate.ac.uk/http://gate.ac.uk/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.