Final Review 31 October 20031 WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
© NCSR, Paris, December 5-6, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Enrich the lexicons for the 1 st domain based on partners remarks.
A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.
FNERC (towards final version v.3) Edinburgh, March 2002.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Information Retrieval in Practice
Information and Telecommunication Technology Center (ITTC) University of Kansas SmartXAutofill Intelligent Data Entry Assistant for XML Documents Danico.
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
A Practical Introduction to XML in Libraries Marty Kurth NYLA October 22, 2004.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Overview of Search Engines
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
FLAVIUS Technical presentation (Overblog, Qype, TVTrip) - WP2 Platform architecture.
Survey of Semantic Annotation Platforms
Information Extraction From Medical Records by Alexander Barsky.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.
Pastra et al., LREC 2002 How feasible is the reuse of grammars for Named Entity Recognition? Katerina Pastra, Diana Maynard, Oana Hamza, Hamish Cunningham.
A Language Independent Method for Question Classification COLING 2004.
FNERC OVERVIEW 05/12/2002. Lingway, of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
05/03/03-06/03/03 7 th Meeting Edinburgh Naïve Bayes Fact Extractor (NBFE) v.1.
Natural language processing tools Lê Đức Trọng 1.
XML Extras Outline 1 - XML in 10 Points 2 - XML Family of Technologies 3 - XML is Modular 4 - RDF and Semantic Web 5- XML Example: UK GovTalk Group’s Schema.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Xml:tm XML Text Memory Using XML technology to reduce the cost of translating XML documents.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now.
WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”
1 Italian FE Component CROSSMARC Eighth Meeting Crete 24 June 2003.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
>lingway█ Solutions in language processing Lingway & Crossmarc exploitation plan José Coch.
© NCSR, Frascati, July 18-19, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Use of PROTÉGÉ to generate ontology and lexicons for the 1 st domain.
MedKAT Medical Knowledge Analysis Tool December 2009.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison Costas Spyropoulos & Vangelis Karkaletsis.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
>lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)
ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST Third meeting Rome November 2001.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Using DSDL plus annotations for Netconf (+) data modeling Rohan Mahy draft-mahy-canmod-dsdl-01.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
WP1: Application Ontology Management Maria Teresa Pazienza Dept. Of Computer Science University of Rome “Tor Vergata”
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.
5 th -6 th December th Meeting Paris WP2: NERC.
WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”
Information Retrieval in Practice
Language Identification and Part-of-Speech Tagging
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
Institute of Informatics & Telecommunications NCSR “Demokritos”
Metadata Extraction Progress Report 12/14/2006.
Institute of Informatics & Telecommunications
Social Knowledge Mining
Presentation transcript:

Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh

Final Review 31 October NERC Multilingual IE Architecture Web pages ENERC FNERC HNERC INERC Fact Extraction Demarcator Database Domain Ontology

Final Review 31 October WP2: Objectives Specification of language neutral NERC architecture (month 6: D2.1) NERC v.1: adaptation and integration of the four existing NERC modules (month 12: D2.2) Specification of Corpus Collection Methodology NERC v.2: improvement of NERC v.1, incorporation of name matching (month 18: D2.3) NERC-based Demarcation NERC v.3: improvement of NERC v.2, incorporation of rapid adaptation mechanisms, porting to the 2 nd domain (month 26: D2.4)

Final Review 31 October Features Specific to CROSSMARC NERC Multilinguality. C urrently 4 languages but should be able to add new languages. Web pages as input. C onversion of HTML to XHTML and use of XML as common exchange format with a specific DTD per domain. Extensible to new domains. There is a need to rapidly add new domains.

Final Review 31 October Shared Features of the NERC Components XHTML input and output, shared DTD Shared domain ontology Each reuses existing NLP tools and linguistic resources Stepwise transformation of the XHTML to incrementally add mark-up, e.g. tokenisation, sentence identification, part-of-speech tagging, entity recognition.

Final Review 31 October NERC Version 2 Final version of NERC for the 1 st domain All four monolingual systems use hand-coded rule sets –HNERC uses the Ellogon Text Engineering Platform. –ENERC uses the LT TTT and LT XML tools and adds XML annotations incrementally. –INERC is implemented as a sequence of XSLT transformations of the XML document. –FNERC uses Lingway’s XTIRP Extraction Tool which applies a sequence of rule-based modules.

Final Review 31 October NERC Version 3 Reported in D2.4. Final version of NERC, dealing with 2 nd domain. Main focus is customisation methodology and experimentation to allow rapid adaptation to new domains. NERC architecture where the monolingual components are different from each other means that customisation methods are defined per component.

Final Review 31 October ENERC Customisation Methodology Retain XML pipeline architecture. Replace the named entity rule sets with a maximum entropy tagger. Experiments with the C&C Tagger and OpenNLP. Limited human intervention (selection of appropriate features).

Final Review 31 October FNERC Customisation Methodology Retain XTIRP-based architecture and modules. Use machine learning to assist in the acquisition of regular expression named entity rules. The machine-learning module produces a first version of human-readable rules plus lists of examples and counter-examples. The human expert modifies the rule set appropriately. This method reduces rule set development time to about a third.

Final Review 31 October HNERC Customisation Methodology ML-HNERC comprises: Token-based HNERC –operates over word tokens, treating NERC as a tagging problem. –word token classification performed by five independent taggers with the final tag chosen through a simple majority voter. Phrase-based HNERC –operates over phrases which have been identified using a grammar automatically induced from the training corpus –uses a C4.5 decision tree classifier to recognize phrases that describe entities.

Final Review 31 October INERC Customisation Methodology INERC is modular, with components which are general and reusable in new domains. Customization can be restricted to the lexical knowledge bases. Statistically driven process of generalizing from the annotated corpus material to derive more generalized lexical resources. Compute a frequency score to expand the lexical resources.

Final Review 31 October Evaluation Methodology For both domains we have a hand annotated corpus of 100 pages per language, split into training and testing material. Each monolingual NERC is evaluated against the testing corpus. Standard measures of precision, recall and f-measure are used.

Final Review 31 October Evaluation Summary Domain 1 F-score Domain 2 F-score ENERC FNERC HNERC INERC

Final Review 31 October Rule-based approach gives better results but it is knowledge intensive and requires significant resources for customisation to each new domain. The FNERC approach to rule induction is promising. In our experiments the machine learning approaches give lower results but: –they allow easy adaptation to new domains –there is scope to improve performance. –more training material would give better performance. Conclusions

Final Review 31 October Other WP2 Activities Collection and annotation of corpora for each language and domain. NERC-based Demarcation

Final Review 31 October Corpus Collection Methodology For each domain the process follows two steps: –identification of interesting characteristics of product descriptions and the collection of statistics relevant to these characteristics from at least 50 different sites for a language. –collection of pages and their separation into training and testing corpora.

Final Review 31 October Corpus collection principles Domain Independent principles: Training and testing corpora have the same number of pages Corpus size fixed for all languages. Corpora are representative of the statistics found per language in the site classification step. Domain Specific Principles: The maximum number of pages from one site allowed in a corpus must be decided depending on the domain. The testing corpus must contain X number of pages that come from sites not represented in the training corpus.

Final Review 31 October Annotation Annotation performed using NCSR’s annotation tool. Annotation guidelines drawn up per domain. Each corpus annotated by two separate annotators and inter-annotator agreement checked. Final corpus result of correction of cases of disagreement.

Final Review 31 October NERC-Based Demarcator Operates after NERC and before FE. Locates different product descriptions inside a web page. Current version is heuristics-based. Characteristic information: –1 st domain: manufacturer, model, price –2 nd domain: job_title, organization, education title Output: Product_No attribute on entities

Final Review 31 October Demarcator Evaluation GreekItalianEnglishFrench 1 st domain NE NUMEX TIMEX nd domain NE

Final Review 31 October Results Overview Successful multilingual NERC system which is an integral part of a resaerch platform for extracting information from web pages. An architecture that allows for new languages and swift adaptation to new domains. Four independent approaches each of which provide good results. Well motivated corpus collection methodology. Publicly distributed corpora for all languages and both domains

Final Review 31 October

Final Review 31 October

Final Review 31 October

Final Review 31 October

Final Review 31 October Shared DTDs Domain 1 NE: MANUF, MODEL, PROCESSOR, SOFT_OS TIMEX: TIME, DATE, DURATION NUMEX: LENGTH, WEIGHT, SPEED, CAPACITY, RESOLUTION, MONEY, PERCENT Domain 2 NE: MUNICIPALITY, REGION, COUNTRY, ORGANIZATION, JOB_TITLE, EDU_TITLE, LANGUAGE, S/W TIMEX: DATE, DURATION NUMEX: MONEY TERM: SCHEDULE, ORG_UNIT

Final Review 31 October st Domain Evaluation Results ENERCFNERCHNERCINERC NEMANUF MODEL SOFT_OS PROCESSOR NUMEXSPEED CAPACITY LENGTH RESOLUTION MONEY PERCENT WEIGHT TIMEXDATE DURATION TIME Overall(aprox)

Final Review 31 October nd Domain Evaluation Results ENERCFNERCHNERCINERC NEMUNICIPALITY REGION COUNTRY ORGANIZATION JOB_TITLE EDU_TITLE LANGUAGE S/W NUMEXMONEY TIMEXDATE DURATION TERMORG_UNIT SCHEDULE Overall

Final Review 31 October Further Evaluation: 2 nd Domain ENERC cross-validation New F-scorePrevious F-score NEMUNICIPALITY REGION COUNTRY ORGANIZATION JOB_TITLE EDU_TITLE0.36 LANGUAGE S/W NUMEXMONEY0.25 TIMEXDATE0.79 DURATION0.83 TERMORG_UNIT SCHEDULE0.00 Overall

Final Review 31 October Further Evaluation: 1st Domain ENERC cross-validation ME F-scoreRule-based F- score NEMANUF MODEL SOFT_OS PROCESSOR NUMEXSPEED CAPACITY LENGTH RESOLUTION0.96 MONEY PERCENT WEIGHT TIMEXDATE-0.45 DURATION TIME-0.47 Overall

Final Review 31 October Further Evaluation: 2 nd Domain ML-HNERC other languages Englis h ENERCFrenchFNERCItalianINERC NEMUNICIPALITY REGION COUNTRY ORGANIZATION JOB_TITLE EDU_TITLE LANGUAGE S/W NUMEXMONEY TIMEXDATE DURATION TERMORG_UNIT SCHEDULE Overall

Final Review 31 October Further Evaluation: 1st Domain ML-HNERC ML-HNERC F-score Rule-based HNERC F-score NEMANUF MODEL SOFT_OS PROCESSOR NUMEXSPEED CAPACITY LENGTH RESOLUTION MONEY PERCENT WEIGHT TIMEXDATE DURATION TIME-1.00 Overall

Final Review 31 October Further Evaluation: 3rd Domain ML-HNERC ML-HNERC F-score NEAGENCYs AREA0.34 CITY0.83 COUNTRY0.11 HOTEL_NAME0.00 PACK_TITLE0.00 SITE0.37 NUMEXMONEY0.73 TIMEXDATE0.22 DURATION0.67 TERMACCOM_TYPE0.63 Overall0.58

Final Review 31 October Statistics for offer description types in the 2 nd domain

Final Review 31 October nd Domain Characteristics

Final Review 31 October Summary statistics of the Italian Testing Corpus for the 2 nd Domain Pages50 Sites45 Job Offers156 Job Offers per Page3,12 NE, NUMEX TIMEX TERM Total1219 NE total1170 NUMEX Total0 TIMEX Total49 Mean names & expressions per description 7.81 Mean NEs per Job Offer7.5 Mean NUMEX per Job Offer0 Mean TIMEX per Job Offer0.31

Final Review 31 October Tag distribution in the Italian Job Offer Testing Corpus