Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.

Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October 20032 NERC Multilingual IE Architecture Web pages ENERC FNERC HNERC INERC Fact Extraction Demarcator Database Domain Ontology

Final Review 31 October 20033 WP2: Objectives Specification of language neutral NERC architecture (month 6: D2.1) NERC v.1: adaptation and integration of the four existing NERC modules (month 12: D2.2) Specification of Corpus Collection Methodology NERC v.2: improvement of NERC v.1, incorporation of name matching (month 18: D2.3) NERC-based Demarcation NERC v.3: improvement of NERC v.2, incorporation of rapid adaptation mechanisms, porting to the 2 nd domain (month 26: D2.4)

Final Review 31 October 20034 Features Specific to CROSSMARC NERC Multilinguality. C urrently 4 languages but should be able to add new languages. Web pages as input. C onversion of HTML to XHTML and use of XML as common exchange format with a specific DTD per domain. Extensible to new domains. There is a need to rapidly add new domains.

Final Review 31 October 20035 Shared Features of the NERC Components XHTML input and output, shared DTD Shared domain ontology Each reuses existing NLP tools and linguistic resources Stepwise transformation of the XHTML to incrementally add mark-up, e.g. tokenisation, sentence identification, part-of-speech tagging, entity recognition.

Final Review 31 October 20036 NERC Version 2 Final version of NERC for the 1 st domain All four monolingual systems use hand-coded rule sets –HNERC uses the Ellogon Text Engineering Platform. –ENERC uses the LT TTT and LT XML tools and adds XML annotations incrementally. –INERC is implemented as a sequence of XSLT transformations of the XML document. –FNERC uses Lingway’s XTIRP Extraction Tool which applies a sequence of rule-based modules.

Final Review 31 October 20037 NERC Version 3 Reported in D2.4. Final version of NERC, dealing with 2 nd domain. Main focus is customisation methodology and experimentation to allow rapid adaptation to new domains. NERC architecture where the monolingual components are different from each other means that customisation methods are defined per component.

Final Review 31 October 20038 ENERC Customisation Methodology Retain XML pipeline architecture. Replace the named entity rule sets with a maximum entropy tagger. Experiments with the C&C Tagger and OpenNLP. Limited human intervention (selection of appropriate features).

Final Review 31 October 20039 FNERC Customisation Methodology Retain XTIRP-based architecture and modules. Use machine learning to assist in the acquisition of regular expression named entity rules. The machine-learning module produces a first version of human-readable rules plus lists of examples and counter-examples. The human expert modifies the rule set appropriately. This method reduces rule set development time to about a third.

Final Review 31 October 200310 HNERC Customisation Methodology ML-HNERC comprises: Token-based HNERC –operates over word tokens, treating NERC as a tagging problem. –word token classification performed by five independent taggers with the final tag chosen through a simple majority voter. Phrase-based HNERC –operates over phrases which have been identified using a grammar automatically induced from the training corpus –uses a C4.5 decision tree classifier to recognize phrases that describe entities.

Final Review 31 October 200311 INERC Customisation Methodology INERC is modular, with components which are general and reusable in new domains. Customization can be restricted to the lexical knowledge bases. Statistically driven process of generalizing from the annotated corpus material to derive more generalized lexical resources. Compute a frequency score to expand the lexical resources.

Final Review 31 October 200312 Evaluation Methodology For both domains we have a hand annotated corpus of 100 pages per language, split 50-50 into training and testing material. Each monolingual NERC is evaluated against the testing corpus. Standard measures of precision, recall and f-measure are used.

Final Review 31 October 200313 Evaluation Summary Domain 1 F-score Domain 2 F-score ENERC0.730.59 FNERC0.770.75 HNERC0.860.68 INERC0.820.77

Final Review 31 October 200314 Rule-based approach gives better results but it is knowledge intensive and requires significant resources for customisation to each new domain. The FNERC approach to rule induction is promising. In our experiments the machine learning approaches give lower results but: –they allow easy adaptation to new domains –there is scope to improve performance. –more training material would give better performance. Conclusions

Final Review 31 October 200315 Other WP2 Activities Collection and annotation of corpora for each language and domain. NERC-based Demarcation

Final Review 31 October 200316 Corpus Collection Methodology For each domain the process follows two steps: –identification of interesting characteristics of product descriptions and the collection of statistics relevant to these characteristics from at least 50 different sites for a language. –collection of pages and their separation into training and testing corpora.

Final Review 31 October 200317 Corpus collection principles Domain Independent principles: Training and testing corpora have the same number of pages Corpus size fixed for all languages. Corpora are representative of the statistics found per language in the site classification step. Domain Specific Principles: The maximum number of pages from one site allowed in a corpus must be decided depending on the domain. The testing corpus must contain X number of pages that come from sites not represented in the training corpus.

Final Review 31 October 200318Annotation Annotation performed using NCSR’s annotation tool. Annotation guidelines drawn up per domain. Each corpus annotated by two separate annotators and inter-annotator agreement checked. Final corpus result of correction of cases of disagreement.

Final Review 31 October 200319 NERC-Based Demarcator Operates after NERC and before FE. Locates different product descriptions inside a web page. Current version is heuristics-based. Characteristic information: –1 st domain: manufacturer, model, price –2 nd domain: job_title, organization, education title Output: Product_No attribute on entities

Final Review 31 October 200320 Demarcator Evaluation GreekItalianEnglishFrench 1 st domain NE0.770.910.630.52 NUMEX0.750.840.540.52 TIMEX0.590.720.440.41 2 nd domain NE0.770.640.470.62

Final Review 31 October 200321 Results Overview Successful multilingual NERC system which is an integral part of a resaerch platform for extracting information from web pages. An architecture that allows for new languages and swift adaptation to new domains. Four independent approaches each of which provide good results. Well motivated corpus collection methodology. Publicly distributed corpora for all languages and both domains

Final Review 31 October 200322

Final Review 31 October 200326 Shared DTDs Domain 1 NE: MANUF, MODEL, PROCESSOR, SOFT_OS TIMEX: TIME, DATE, DURATION NUMEX: LENGTH, WEIGHT, SPEED, CAPACITY, RESOLUTION, MONEY, PERCENT Domain 2 NE: MUNICIPALITY, REGION, COUNTRY, ORGANIZATION, JOB_TITLE, EDU_TITLE, LANGUAGE, S/W TIMEX: DATE, DURATION NUMEX: MONEY TERM: SCHEDULE, ORG_UNIT

Final Review 31 October 200327 1 st Domain Evaluation Results ENERCFNERCHNERCINERC NEMANUF0.520.680.860.93 MODEL0.700.580.710.70 SOFT_OS0.760.900.800.94 PROCESSOR0.910.930.910.96 NUMEXSPEED0.780.840.900.88 CAPACITY0.900.850.840.96 LENGTH0.850.610.880.89 RESOLUTION0.960.830.750.89 MONEY0.620.80 0.74 PERCENT0.670.750.770.86 WEIGHT0.960.931.000.88 TIMEXDATE0.450.840.960.57 DURATION0.730.850.870.41 TIME0.470.691.00- Overall(aprox)0.730.770.860.82

Final Review 31 October 200328 2nd Domain Evaluation Results ENERCFNERCHNERCINERC NEMUNICIPALITY0.700.770.820.92 REGION0.650.810.400.94 COUNTRY0.870.730.840.86 ORGANIZATION0.560.580.500.71 JOB_TITLE0.550.710.500.78 EDU_TITLE0.360.570.670.82 LANGUAGE0.670.690.950.83 S/W0.550.820.700.75 NUMEXMONEY0.250.930.00 TIMEXDATE0.790.610.930.77 DURATION0.830.880.910.74 TERMORG_UNIT0.370.660.390.51 SCHEDULE0.000.570.000.40 Overall0.590.750.680.77

Final Review 31 October 200329 Further Evaluation: 2 nd Domain ENERC cross-validation New F-scorePrevious F-score NEMUNICIPALITY0.800.70 REGION0.880.65 COUNTRY0.900.87 ORGANIZATION0.710.56 JOB_TITLE0.660.55 EDU_TITLE0.36 LANGUAGE0.860.67 S/W0.560.55 NUMEXMONEY0.25 TIMEXDATE0.79 DURATION0.83 TERMORG_UNIT0.400.37 SCHEDULE0.00 Overall0.670.59

Final Review 31 October 200330 Further Evaluation: 1st Domain ENERC cross-validation ME F-scoreRule-based F- score NEMANUF0.630.52 MODEL0.530.70 SOFT_OS0.720.76 PROCESSOR0.780.91 NUMEXSPEED0.810.78 CAPACITY0.850.90 LENGTH0.920.85 RESOLUTION0.96 MONEY0.470.62 PERCENT0.440.67 WEIGHT0.770.96 TIMEXDATE-0.45 DURATION0.650.73 TIME-0.47 Overall0.740.73

Final Review 31 October 200331 Further Evaluation: 2 nd Domain ML-HNERC other languages Englis h ENERCFrenchFNERCItalianINERC NEMUNICIPALITY0.660.700.750.770.630.92 REGION0.540.650.580.810.000.94 COUNTRY0.830.870.820.730.200.86 ORGANIZATION0.470.560.150.580.220.71 JOB_TITLE0.400.550.470.710.470.78 EDU_TITLE0.130.360.390.570.250.82 LANGUAGE0.67 0.570.690.780.83 S/W0.470.550.620.820.650.75 NUMEXMONEY0.910.250.320.930.00 TIMEXDATE0.620.790.210.610.500.77 DURATION0.840.830.690.880.450.74 TERMORG_UNIT0.310.370.060.660.370.51 SCHEDULE0.670.00 0.570.000.40 Overall0.490.590.550.750.520.77

Final Review 31 October 200332 Further Evaluation: 1st Domain ML-HNERC ML-HNERC F-score Rule-based HNERC F-score NEMANUF0.570.86 MODEL0.570.71 SOFT_OS0.790.80 PROCESSOR0.640.91 NUMEXSPEED0.750.90 CAPACITY0.610.84 LENGTH0.760.88 RESOLUTION0.620.75 MONEY0.530.80 PERCENT0.710.77 WEIGHT0.801.00 TIMEXDATE0.440.96 DURATION0.470.87 TIME-1.00 Overall0.650.86

Final Review 31 October 200333 Further Evaluation: 3rd Domain ML-HNERC ML-HNERC F-score NEAGENCYs AREA0.34 CITY0.83 COUNTRY0.11 HOTEL_NAME0.00 PACK_TITLE0.00 SITE0.37 NUMEXMONEY0.73 TIMEXDATE0.22 DURATION0.67 TERMACCOM_TYPE0.63 Overall0.58

Final Review 31 October 200334 Statistics for offer description types in the 2 nd domain

Final Review 31 October 200335 2 nd Domain Characteristics

Final Review 31 October 200336 Summary statistics of the Italian Testing Corpus for the 2 nd Domain Pages50 Sites45 Job Offers156 Job Offers per Page3,12 NE, NUMEX TIMEX TERM Total1219 NE total1170 NUMEX Total0 TIMEX Total49 Mean names & expressions per description 7.81 Mean NEs per Job Offer7.5 Mean NUMEX per Job Offer0 Mean TIMEX per Job Offer0.31

Final Review 31 October 200337 Tag distribution in the Italian Job Offer Testing Corpus

Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.

Similar presentations

Presentation on theme: "Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.

Similar presentations

Presentation on theme: "Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh."— Presentation transcript:

Similar presentations

About project

Feedback