WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”
© NCSR, Paris, December 5-6, 2002 HNERC XHTML document(s) is(are) converted into Ellogon document(s) within an Ellogon Collection Preprocessing Tokenization – Zoning Sentence Splitting Lexical Analysis (POS Tagging, Lemmatization) Gazetteer Look Up NERC 1 st Pass: Identification and 1 st Classification 2 nd Pass: Classification using classified NEs
© NCSR, Paris, December 5-6, 2002 Tokenization: Tokenization: domain specific tokenization problems have been solved for several names, expressions, terms or combinations of them appearing in the text without space, punctuation or symbol characters between them 14TFT 14 TFT PIII300 PIII 300 1024X768X16 1024 X 768 X 16 HNERC v.2: major improvements over HNERC v.1.x
© NCSR, Paris, December 5-6, 2002 Gazetteer Look-up: Gazetteer Look-up: lists have been updated so as to include a larger number of OS, Manuf and Software names (109 more names) NERC Patterns: NERC Patterns: addition of new patterns in the following categories: patterns for filtering names that are not part of laptop descriptions patterns for names that have been affected by the changes of the tokenizer evaluation in the new corpus HNERC v.2: major improvements over HNERC v.1.x
© NCSR, Paris, December 5-6, 2002 HNERC v.2: major improvements over HNERC v.1.x HNERC Evaluation Results 1 (without Dermarcation) PrecisionRecallF-measure MANUF MODEL SOFT_OS PROCESSOR DATE0,921.00,96 DURATION TIME
© NCSR, Paris, December 5-6, 2002 HNERC v.2: major improvements over HNERC v.1.x HNERC Evaluation Results 2 (without Dermarcation) PrecisionRecallF-measure SPEED CAPACITY LENGTH RESOLUTION MONEY PERCENT WEIGHT
© NCSR, Paris, December 5-6, 2002 HNERC v.2: major improvements over HNERC v.1.x HNERC has been tested in a corpus that presented greater diversity in terms of product description categories (more pages with many laptop products & more pages with laptop and non-laptop products than previous corpus) HNERC has been tested in a corpus that presented greater diversity in terms of product description categories (more pages with many laptop products & more pages with laptop and non-laptop products than previous corpus) Results are comparable to the results of the previous evaluation for categories that were commonly found in the corpus Results are comparable to the results of the previous evaluation for categories that were commonly found in the corpus Differences were observed in the results of categories that have very low frequency in the corpora (RESOLUTION, PERCENT) Differences were observed in the results of categories that have very low frequency in the corpora (RESOLUTION, PERCENT) An evaluation has also been conducted using Demarcation input, this improved results slightly for a few categories, but lowered Recall and F-measure for MONEY significantly An evaluation has also been conducted using Demarcation input, this improved results slightly for a few categories, but lowered Recall and F-measure for MONEY significantly
© NCSR, Paris, December 5-6, 2002 Demarcation Tool: Evaluation Conducted for the Hellenic and French Testing corpora that had been annotated for names and products Page categories and their frequency in the Training corpus played an important role in the performance of the tool (better performance for most common categories)
© NCSR, Paris, December 5-6, 2002 Evaluation: Hellenic Testing Corpus (1) PrecisionRecall F- measure ALLNE NUMEX TIMEX
© NCSR, Paris, December 5-6, 2002 Evaluation: Hellenic Testing Corpus (2) PrecisionRecall F- measure A1 NE NUMEX TIMEX B1 NE NUMEX TIMEX B2 NE NUMEX TIMEX
© NCSR, Paris, December 5-6, 2002 Evaluation: French Testing Corpus (1) PrecisionRecall F- measure ALLNE NUMEX TIMEX
© NCSR, Paris, December 5-6, 2002 Evaluation: French Testing Corpus (2) PrecisionRecall F- measure A1NE NUMEX TIMEX B1NE NUMEX TIMEX B2NE NUMEX TIMEX
© NCSR, Paris, December 5-6, 2002 HNERC v.2: Name Matching recognizes instances of the same name within a single laptop description using pattern matching recognizes instances of the same name within a single laptop description using pattern matching matching is conducted for MANUF, MODEL, PROCESSOR, OS, CAPACITY, SPEED, MONEY names and expressions evaluation: evaluation: uses manual annotations for names and product descriptions corpus has not been manually annotated for name matching, but name annotations combined with norm, and product_no, attributes have been used for determining coreferential names, and for the creation of the key collection was conducted for MANUF, PROCESSOR, OS