WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”

© NCSR, Paris, December 5-6, 2002 HNERC  XHTML document(s) is(are) converted into Ellogon document(s) within an Ellogon Collection  Preprocessing  Tokenization – Zoning  Sentence Splitting  Lexical Analysis (POS Tagging, Lemmatization)  Gazetteer Look Up  NERC  1 st Pass: Identification and 1 st Classification  2 nd Pass: Classification using classified NEs

© NCSR, Paris, December 5-6, 2002 Tokenization: Tokenization:  domain specific tokenization problems have been solved for several names, expressions, terms or combinations of them appearing in the text without space, punctuation or symbol characters between them  14TFT  14 TFT  PIII300  PIII 300  1024X768X16  1024 X 768 X 16 HNERC v.2: major improvements over HNERC v.1.x

© NCSR, Paris, December 5-6, 2002 Gazetteer Look-up: Gazetteer Look-up:  lists have been updated so as to include a larger number of OS, Manuf and Software names (109 more names) NERC Patterns: NERC Patterns:  addition of new patterns in the following categories:  patterns for filtering names that are not part of laptop descriptions  patterns for names that have been affected by the changes of the tokenizer  evaluation in the new corpus HNERC v.2: major improvements over HNERC v.1.x

© NCSR, Paris, December 5-6, 2002 HNERC v.2: major improvements over HNERC v.1.x HNERC Evaluation Results 1 (without Dermarcation) PrecisionRecallF-measure MANUF0.790.940.86 MODEL0.710.710.71 SOFT_OS0.780.820.80 PROCESSOR0.890.940.91 DATE0,921.00,96 DURATION0.820.920.87 TIME1.01.01.0

© NCSR, Paris, December 5-6, 2002 HNERC v.2: major improvements over HNERC v.1.x HNERC Evaluation Results 2 (without Dermarcation) PrecisionRecallF-measure SPEED0.870.940.90 CAPACITY0.810.870.84 LENGTH0.820.950.88 RESOLUTION0.680.840.75 MONEY0.730.890.80 PERCENT0.650.930.77 WEIGHT1.01.01.0

© NCSR, Paris, December 5-6, 2002 HNERC v.2: major improvements over HNERC v.1.x HNERC has been tested in a corpus that presented greater diversity in terms of product description categories (more pages with many laptop products & more pages with laptop and non-laptop products than previous corpus) HNERC has been tested in a corpus that presented greater diversity in terms of product description categories (more pages with many laptop products & more pages with laptop and non-laptop products than previous corpus) Results are comparable to the results of the previous evaluation for categories that were commonly found in the corpus Results are comparable to the results of the previous evaluation for categories that were commonly found in the corpus Differences were observed in the results of categories that have very low frequency in the corpora (RESOLUTION, PERCENT) Differences were observed in the results of categories that have very low frequency in the corpora (RESOLUTION, PERCENT) An evaluation has also been conducted using Demarcation input, this improved results slightly for a few categories, but lowered Recall and F-measure for MONEY significantly An evaluation has also been conducted using Demarcation input, this improved results slightly for a few categories, but lowered Recall and F-measure for MONEY significantly

© NCSR, Paris, December 5-6, 2002 Demarcation Tool: Evaluation Conducted for the Hellenic and French Testing corpora that had been annotated for names and products Page categories and their frequency in the Training corpus played an important role in the performance of the tool (better performance for most common categories)

© NCSR, Paris, December 5-6, 2002 Evaluation: Hellenic Testing Corpus (2) PrecisionRecall F- measure A1 NE0.9940.9650.979 NUMEX0.7820.8600.819 TIMEX0.7301.00.844 B1 NE0.7340.7480.741 NUMEX0.7120.7600.735 TIMEX0.250.250.25 B2 NE0.5040.5040.504 NUMEX0.5940.6770.633 TIMEX0.5830.50.538

© NCSR, Paris, December 5-6, 2002 Evaluation: French Testing Corpus (2) PrecisionRecall F- measure A1NE0.7930.7050.746 NUMEX0.5090.6480.570 TIMEX0.6450.6250.634 B1NE0.7360.7050.720 NUMEX0.8640.8240.843 TIMEX1.00.7770.875 B2NE0.4660.4510.459 NUMEX0.4330.4090.421 TIMEX0.750.750.75

© NCSR, Paris, December 5-6, 2002 HNERC v.2: Name Matching recognizes instances of the same name within a single laptop description using pattern matching recognizes instances of the same name within a single laptop description using pattern matching  matching is conducted for MANUF, MODEL, PROCESSOR, OS, CAPACITY, SPEED, MONEY names and expressions evaluation: evaluation:  uses manual annotations for names and product descriptions  corpus has not been manually annotated for name matching, but name annotations combined with norm, and product_no, attributes have been used for determining coreferential names, and for the creation of the key collection  was conducted for MANUF, PROCESSOR, OS

WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”

Similar presentations

Presentation on theme: "WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”

Similar presentations

Presentation on theme: "WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”"— Presentation transcript:

Similar presentations

About project

Feedback