WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”

Slides:



Advertisements
Similar presentations
University of Sheffield NLP Module 4: Machine Learning.
Advertisements

University of Sheffield NLP Module 11: Advanced Machine Learning.
© NCSR, Paris, December 5-6, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Enrich the lexicons for the 1 st domain based on partners remarks.
FNERC (towards final version v.3) Edinburgh, March 2002.
Problem Semi supervised sarcasm identification using SASI
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
1 Health Text Lexical Processing Mojtaba Sabbagh.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Institute of Informatics & Telecommunications – NCSR “Demokritos” Ellogon and the challenge of threads Georgios Petasis Software and Knowledge Engineering.
Automated Patent Classification By Yu Hu. Class 706 Subclass 12.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
FNERC OVERVIEW 05/12/2002. Lingway, of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway.
Kyoungryol Kim Extracting Schedule Information from Korean .
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
Midterm Progress Report Stanley Roberts July 17, 2009.
28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now.
Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios.
WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”
© NCSR, Frascati, July 18-19, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Use of PROTÉGÉ to generate ontology and lexicons for the 1 st domain.
IR Homework #3 By J. H. Wang May 10, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison Costas Spyropoulos & Vangelis Karkaletsis.
ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST Third meeting Rome November 2001.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Facilitating Document Annotation Using Content and Querying Value.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
WP3: Image Segmentation - OCR Stavros Perantonis, Vassilis Maragos Edinburgh, March 6-7, 2003 Institute of Informatics & Telecommunications NCSR “Demokritos”
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.
IR Homework #2 By J. H. Wang May 9, Programming Exercise #2: Text Classification Goal: to classify each document into predefined categories Input:
5 th -6 th December th Meeting Paris WP2: NERC.
Fact Extraction Dimitra Farmakiotou, Vangelis Karkaletsis Rome, November 15-16, 2001 Institute of Informatics & Telecommunications NCSR “Demokritos”
Named entities recognition Jana Kravalová. Content 1. Task 2. Data 3. Machine learning 4. SVM 5. Evaluation and results.
Copyright © 2014 Natural Environmental Research Council (NERC)1 Map Visualization, Symbology, Labels & Annotation in ArcGIS 10.1.
Language Identification and Part-of-Speech Tagging
Measuring Monolinguality
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
Arabic Text Categorization Based on Arabic Wikipedia
Types of Search Questions
Institute of Informatics & Telecommunications NCSR “Demokritos”
CRF &SVM in Medication Extraction
Developing Ellogon Components…
Institute of Informatics & Telecommunications
Tokenizer and Sentence Splitter CSCI-GA.2591
Computational and Statistical Methods for Corpus Analysis: Overview
Information Retrieval and Web Search
Natural Language Processing (NLP)
Compiler Construction
Supervised Machine Learning
Text Mining & Natural Language Processing
Sadov M. A. , NRU HSE, Moscow, Russia Kutuzov A. B
Text Mining & Natural Language Processing
Natural Language Processing (NLP)
Introduction to Sentiment Analysis
Natural Language Processing (NLP)
Presentation transcript:

WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”

© NCSR, Paris, December 5-6, 2002 HNERC  XHTML document(s) is(are) converted into Ellogon document(s) within an Ellogon Collection  Preprocessing  Tokenization – Zoning  Sentence Splitting  Lexical Analysis (POS Tagging, Lemmatization)  Gazetteer Look Up  NERC  1 st Pass: Identification and 1 st Classification  2 nd Pass: Classification using classified NEs

© NCSR, Paris, December 5-6, 2002 Tokenization: Tokenization:  domain specific tokenization problems have been solved for several names, expressions, terms or combinations of them appearing in the text without space, punctuation or symbol characters between them  14TFT  14 TFT  PIII300  PIII 300  1024X768X16  1024 X 768 X 16 HNERC v.2: major improvements over HNERC v.1.x

© NCSR, Paris, December 5-6, 2002 Gazetteer Look-up: Gazetteer Look-up:  lists have been updated so as to include a larger number of OS, Manuf and Software names (109 more names) NERC Patterns: NERC Patterns:  addition of new patterns in the following categories:  patterns for filtering names that are not part of laptop descriptions  patterns for names that have been affected by the changes of the tokenizer  evaluation in the new corpus HNERC v.2: major improvements over HNERC v.1.x

© NCSR, Paris, December 5-6, 2002 HNERC v.2: major improvements over HNERC v.1.x HNERC Evaluation Results 1 (without Dermarcation) PrecisionRecallF-measure MANUF MODEL SOFT_OS PROCESSOR DATE0,921.00,96 DURATION TIME

© NCSR, Paris, December 5-6, 2002 HNERC v.2: major improvements over HNERC v.1.x HNERC Evaluation Results 2 (without Dermarcation) PrecisionRecallF-measure SPEED CAPACITY LENGTH RESOLUTION MONEY PERCENT WEIGHT

© NCSR, Paris, December 5-6, 2002 HNERC v.2: major improvements over HNERC v.1.x HNERC has been tested in a corpus that presented greater diversity in terms of product description categories (more pages with many laptop products & more pages with laptop and non-laptop products than previous corpus) HNERC has been tested in a corpus that presented greater diversity in terms of product description categories (more pages with many laptop products & more pages with laptop and non-laptop products than previous corpus) Results are comparable to the results of the previous evaluation for categories that were commonly found in the corpus Results are comparable to the results of the previous evaluation for categories that were commonly found in the corpus Differences were observed in the results of categories that have very low frequency in the corpora (RESOLUTION, PERCENT) Differences were observed in the results of categories that have very low frequency in the corpora (RESOLUTION, PERCENT) An evaluation has also been conducted using Demarcation input, this improved results slightly for a few categories, but lowered Recall and F-measure for MONEY significantly An evaluation has also been conducted using Demarcation input, this improved results slightly for a few categories, but lowered Recall and F-measure for MONEY significantly

© NCSR, Paris, December 5-6, 2002 Demarcation Tool: Evaluation Conducted for the Hellenic and French Testing corpora that had been annotated for names and products Page categories and their frequency in the Training corpus played an important role in the performance of the tool (better performance for most common categories)

© NCSR, Paris, December 5-6, 2002 Evaluation: Hellenic Testing Corpus (1) PrecisionRecall F- measure ALLNE NUMEX TIMEX

© NCSR, Paris, December 5-6, 2002 Evaluation: Hellenic Testing Corpus (2) PrecisionRecall F- measure A1 NE NUMEX TIMEX B1 NE NUMEX TIMEX B2 NE NUMEX TIMEX

© NCSR, Paris, December 5-6, 2002 Evaluation: French Testing Corpus (1) PrecisionRecall F- measure ALLNE NUMEX TIMEX

© NCSR, Paris, December 5-6, 2002 Evaluation: French Testing Corpus (2) PrecisionRecall F- measure A1NE NUMEX TIMEX B1NE NUMEX TIMEX B2NE NUMEX TIMEX

© NCSR, Paris, December 5-6, 2002 HNERC v.2: Name Matching recognizes instances of the same name within a single laptop description using pattern matching recognizes instances of the same name within a single laptop description using pattern matching  matching is conducted for MANUF, MODEL, PROCESSOR, OS, CAPACITY, SPEED, MONEY names and expressions evaluation: evaluation:  uses manual annotations for names and product descriptions  corpus has not been manually annotated for name matching, but name annotations combined with norm, and product_no, attributes have been used for determining coreferential names, and for the creation of the key collection  was conducted for MANUF, PROCESSOR, OS