Institute of Informatics & Telecommunications

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

© NCSR, Paris, December 5-6, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Enrich the lexicons for the 1 st domain based on partners remarks.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
FNERC (towards final version v.3) Edinburgh, March 2002.
M I S Dr. Ernst-Gerd vom Kolke 1 Web Design - Introduction n Design for printed and electronic information isn’t very different n Special aspects for web.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Automatic Web Page Categorization by Link and Context Analysis Giuseppe Attardi Antonio Gulli Fabrizio Sebastiani.
Research Paper Presentation – CS572 Summer 2011 Presented by Donghee Sung Paper by Paul Clough (University of Sheffield Western Bank)
A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.
 2008 Pearson Education, Inc. All rights reserved. 1 Introduction to HTML.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
 A data processing system is a combination of machines and people that for a set of inputs produces a defined set of outputs. The inputs and outputs.
Institute of Informatics & Telecommunications – NCSR “Demokritos” Ellogon and the challenge of threads Georgios Petasis Software and Knowledge Engineering.
NLP superficial and lexic level1 Superficial & Lexical level 1 Superficial level What is a word Lexical level Lexicons How to acquire lexical information.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
CP2022 Multimedia Internet Communication1 HTML and Hypertext The workings of the web Lecture 7.
HTML 4 Foundation Level Course HyperText Markup Language Most common language used in creating Web documents. You can use HTML to create cross-platform.
Digital Image Processing & Analysis Spring Definitions Image Processing Image Analysis (Image Understanding) Computer Vision Low Level Processes:
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.
FNERC OVERVIEW 05/12/2002. Lingway, of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
How the Web Works Building a Website – Lesson 1. How People Access the Web Browsers People access websites using software called a web browser. To view.
28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now.
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
Maintaining Information Integration Ontologies Georgios Paliouras, Alexandros Valarakos, Georgios Paliouras, Vangelis Karkaletsis, Georgios Sigletos, Georgios.
WP3: FE Architecture Progress Report CROSSMARC Seventh Meeting Edinburgh 6-7 March 2003 University of Rome “Tor Vergata”
© NCSR, Frascati, July 18-19, 2002 WP1: Plan for the remainder (1) Ontology Ontology  Use of PROTÉGÉ to generate ontology and lexicons for the 1 st domain.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison Costas Spyropoulos & Vangelis Karkaletsis.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
>lingway█ >Lingway Fact Extractor (LFE)█ >Introduction >Goals Crossmarc / Lingway >Lingway adaptation of the NHLRT approach >Rule induction >(ongoing work)
ICDCRome November 2001CROSSMARC Third meeting French NERC (first version and results) CROSSMARC Project IST Third meeting Rome November 2001.
`. Lecture Overview HTML Body Elements Linking techniques HyperText references Linking images Linking to locations on a page Linking to a fragment on.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
WP1: Plan for the remainder (1) Ontology –Finalise ontology and lexicons for the 2 nd domain (RTV) Changes agreed in Heraklion –Improvement to existing.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.
5 th -6 th December th Meeting Paris WP2: NERC.
Fact Extraction Dimitra Farmakiotou, Vangelis Karkaletsis Rome, November 15-16, 2001 Institute of Informatics & Telecommunications NCSR “Demokritos”
WP2: Hellenic NERC Vangelis Karkaletsis, Dimitra Farmakiotou Paris, December 5-6, 2002 Institute of Informatics & Telecommunications NCSR “Demokritos”
 Corpus Formation [CFT]  Web Pages Annotation [Web Annotator]  Web sites detection [NEACrawler]  Web pages collection [NEAC]  IE Remote.
System Software Unit-1 (Language Processors) A TOY Compiler
Institute of Informatics & Telecommunications NCSR “Demokritos”
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Software Specification Tools
Text Based Information Retrieval
CRF &SVM in Medication Extraction
Developing Ellogon Components…
Natural Language Processing (NLP)
Multimedia Information Retrieval
Lexical and Syntax Analysis
Chapter 3 Hardware and software 1.
CS 3304 Comparative Languages
CS 3304 Comparative Languages
Compiler design.
Introduction to XHTML Cont:.
Chapter 3 Hardware and software 1.
Text Mining & Natural Language Processing
Chap 2. Computer Fundamentals
Text Mining & Natural Language Processing
Spreadsheets, Modelling & Databases
Introduction to HTML.
Natural Language Processing (NLP)
SANSKRIT ANALYZING SYSTEM
Natural Language Processing (NLP)
Presentation transcript:

Institute of Informatics & Telecommunications NCSR “Demokritos” Hellenic NERC Dimitra Farmakiotou, Vangelis Karkaletsis Rome, November 15-16, 2001

Contents Architecture Lexical Preprocessing Gazetteer Lookup Identification &Classification Normalization Schedule © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

CROSSMARC Architecture E-retailers Web sites Multilingual NERC and Name Matching XHTML pages Web Pages Collection XHTML pages annotated with NEs Domain Ontology Products Database End user Product Comparison Multilingual and Multimedia Fact Extraction © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

HNERC: Architecture © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Lexical Preprocessing: Tokenization Token categories include: HTML tokens and their subtypes, HTML entities, words belonging to different categories according to their writing, punctuation marks, symbols and numbers Separates raw text from HTML. Raw text is used in the subsequent stages of lexical analysis and name recognition. HTML information for page layout is used for zoning and sentence splitting. © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Lexical Preprocessing: Tokenization Token Token Type <tr> HTML subtype= <tr> <td width="259"> HTML subtype= <td width="259"> <font face="Arial"> HTML subtype= <font face="Arial"> Διαθέτει GFW μνήμη GLW μέχρι GLW <b> HTML subtype=<b> , PUNC 512Mb NUMW </b> HTML subtype=</b> Modem EFW <b> HTML subtype=<b> 56Κ NUMW DVD EUW - SYMBOL Source Code <tr> <td width="259"><font face="Arial">Διαθέτει μνήμη μέχρι <b>512Mb</b>, Modem <b>56Κ</b>, DVD-ROM και τόσα άλλα χαρακτηριστικά. </font></td> © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Lexical Preprocessing: Zoning Types of zones: image, title, paragraph, list, table Textual part of zones is separated from HTML Table and list zones are annotated with information about their constituents (table cells, list elements) © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Zoning: Image Annotation © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Zoning: Table Annotation © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Lexical Preprocessing: Sentence Splitting NERC within the boundaries of a sentence. Sentences in hypertext differ from sentences in raw text: smaller in size and usually fragmented. Types of sentences: lsentence (sentence comprising a list entry) psentence (sentence found within a paragraph) titsentence (sentence within a title) tsentence (sentence found within a cell of table) hsentence (a sentence comprised of all the tsentences found in a line of a table) vsentence (a sentence comprised of all the tsentences belonging to the same column of a table) © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Lexical Preprocessing: Lexical Analysis Includes POS Tagging and Lemmatization. POS Tagger: Machine Learning based Lemmatizer: returns a lemma for words contained in its lexicon and the words in lower case otherwise © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Gazetteer Look Up The Gazetteer contains known 706 entries names and terms Lists of names: Model Processor Manufacturer Software Operating System Lists of terms: Video Output Sound Screen Removables Pointing Device Input/Output Card Battery © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Gazetteer Look Up Annotation © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Text Representation Example Psentence: Overdose S Each sentence is represented token by token Each token representation contains the following: token as appears in source token type information POS lemma gazetteer lookup tag gazetteer marker start of token in text end of token in text Example Psentence: Overdose S Internal Representation: Overdose::EFW::FW::overdose::0::0::248::256 S::EUW::FW::s::0::0::257::258 © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

NE Identification and Classification Modules NE Module: MANUF, MODEL, PROCESSOR, SOFT_OS and TERM NUMEX Module: SPEED, CAPACITY, LENGTH, RESOLUTION, WEIGHT, MONEY, SIMPLE TIMEX Module: DATE, TIME, DURATION Numeric and time expressions can be better recognized when the boundaries of Names and Terms are known © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Identification and Classification Patterns Tcl regular expressions matched against the text representation. The first type of patterns conducts both Identification and Classification. Works for cases of names and expressions that can be unambiguously identified. The context of names and expressions that have already been classified by Identification and Classification patterns is further examined by Exclusion patterns. In every domain there are cases where a certain entity name may not function as such (e.g. Toshiba Software, the Intel Pentium Processor logo) © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Identification General Identification patterns recognize possible names and expressions that are likely to belong to different categories. General identification patterns may recognize token strings that belong to more than one names or expressions or token sequences that belong to different types of names or expressions. For this reason patterns that separate names or exclude certain strings from possible names are applied. Finally, classification rules are applied that make use of internal and contextual evidence © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Initial Results Speed Capacity Money Processor Model Manufacturer Precision 0.852 0.744 0.820 0.540 0.564 0.538 Recall 0.609 0.797 0.533 0.825 0.594 0.913 F-measure 0.710 0.770 0.646 0.653 0.578 0.677 © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Normalization Normalization of numeric and temporal expressions is based on the laptop ontology matching heuristics © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

[pentium ii] [pent ii] [p ii] [p2] [pentium 2] [pent 2] Name Matching Conversion of classified name strings to a neutral form lower case, unstressed for Greek words removal of punctuation marks, symbols and words commonly found in different names of a category e.g. Intel Pentium® II  {pentium ii} Converted names are stored in a temporary list that is used for their comparison against lists of equivalent classes of names, such classes are created from the Laptop ontology and the Training Corpora e.g. equivalence class for Pentium II [pentium ii] [pent ii] [p ii] [p2] [pentium 2] [pent 2] © IIT, NCSR “Demokritos”, Rome 15-16 November 2001

Work schedule Currently working on improving : Identification & Classification. Name matching. Step 2: Normalisation of temporal and numeric expressions. Step 3: Exploitation of machine learning techniques (grammar induction, decision tree and rule induction). © IIT, NCSR “Demokritos”, Rome 15-16 November 2001