Institute of Informatics & Telecommunications

Institute of Informatics & Telecommunications
NCSR “Demokritos” Hellenic NERC Dimitra Farmakiotou, Vangelis Karkaletsis Rome, November 15-16, 2001

Contents Architecture Lexical Preprocessing Gazetteer Lookup
Identification &Classification Normalization Schedule © IIT, NCSR “Demokritos”, Rome November 2001

CROSSMARC Architecture
E-retailers Web sites Multilingual NERC and Name Matching XHTML pages Web Pages Collection XHTML pages annotated with NEs Domain Ontology Products Database End user Product Comparison Multilingual and Multimedia Fact Extraction © IIT, NCSR “Demokritos”, Rome November 2001

HNERC: Architecture © IIT, NCSR “Demokritos”, Rome November 2001

Lexical Preprocessing: Tokenization
Token categories include: HTML tokens and their subtypes, HTML entities, words belonging to different categories according to their writing, punctuation marks, symbols and numbers Separates raw text from HTML. Raw text is used in the subsequent stages of lexical analysis and name recognition. HTML information for page layout is used for zoning and sentence splitting. © IIT, NCSR “Demokritos”, Rome November 2001

Lexical Preprocessing: Tokenization
Token Token Type <tr> HTML subtype= <tr> <td width="259"> HTML subtype= <td width="259"> HTML subtype= Διαθέτει GFW μνήμη GLW μέχρι GLW HTML subtype= , PUNC 512Mb NUMW HTML subtype= Modem EFW HTML subtype= 56Κ NUMW DVD EUW - SYMBOL Source Code <tr> <td width="259">Διαθέτει μνήμη μέχρι 512Mb, Modem 56Κ, DVD-ROM και τόσα άλλα χαρακτηριστικά. </td> © IIT, NCSR “Demokritos”, Rome November 2001

Lexical Preprocessing: Zoning
Types of zones: image, title, paragraph, list, table Textual part of zones is separated from HTML Table and list zones are annotated with information about their constituents (table cells, list elements) © IIT, NCSR “Demokritos”, Rome November 2001

Zoning: Image Annotation
© IIT, NCSR “Demokritos”, Rome November 2001

Zoning: Table Annotation

Lexical Preprocessing: Sentence Splitting
NERC within the boundaries of a sentence. Sentences in hypertext differ from sentences in raw text: smaller in size and usually fragmented. Types of sentences: lsentence (sentence comprising a list entry) psentence (sentence found within a paragraph) titsentence (sentence within a title) tsentence (sentence found within a cell of table) hsentence (a sentence comprised of all the tsentences found in a line of a table) vsentence (a sentence comprised of all the tsentences belonging to the same column of a table) © IIT, NCSR “Demokritos”, Rome November 2001

Lexical Preprocessing: Lexical Analysis
Includes POS Tagging and Lemmatization. POS Tagger: Machine Learning based Lemmatizer: returns a lemma for words contained in its lexicon and the words in lower case otherwise © IIT, NCSR “Demokritos”, Rome November 2001

Gazetteer Look Up The Gazetteer contains known 706 entries names and terms Lists of names: Model Processor Manufacturer Software Operating System Lists of terms: Video Output Sound Screen Removables Pointing Device Input/Output Card Battery © IIT, NCSR “Demokritos”, Rome November 2001

Gazetteer Look Up Annotation

Text Representation Example Psentence: Overdose S
Each sentence is represented token by token Each token representation contains the following: token as appears in source token type information POS lemma gazetteer lookup tag gazetteer marker start of token in text end of token in text Example Psentence: Overdose S Internal Representation: Overdose::EFW::FW::overdose::0::0::248::256 S::EUW::FW::s::0::0::257::258 © IIT, NCSR “Demokritos”, Rome November 2001

NE Identification and Classification
Modules NE Module: MANUF, MODEL, PROCESSOR, SOFT_OS and TERM NUMEX Module: SPEED, CAPACITY, LENGTH, RESOLUTION, WEIGHT, MONEY, SIMPLE TIMEX Module: DATE, TIME, DURATION Numeric and time expressions can be better recognized when the boundaries of Names and Terms are known © IIT, NCSR “Demokritos”, Rome November 2001

Identification and Classification Patterns
Tcl regular expressions matched against the text representation. The first type of patterns conducts both Identification and Classification. Works for cases of names and expressions that can be unambiguously identified. The context of names and expressions that have already been classified by Identification and Classification patterns is further examined by Exclusion patterns. In every domain there are cases where a certain entity name may not function as such (e.g. Toshiba Software, the Intel Pentium Processor logo) © IIT, NCSR “Demokritos”, Rome November 2001

Identification General Identification patterns recognize possible names and expressions that are likely to belong to different categories. General identification patterns may recognize token strings that belong to more than one names or expressions or token sequences that belong to different types of names or expressions. For this reason patterns that separate names or exclude certain strings from possible names are applied. Finally, classification rules are applied that make use of internal and contextual evidence © IIT, NCSR “Demokritos”, Rome November 2001

Initial Results Speed Capacity Money Processor Model Manufacturer
Precision 0.852 0.744 0.820 0.540 0.564 0.538 Recall 0.609 0.797 0.533 0.825 0.594 0.913 F-measure 0.710 0.770 0.646 0.653 0.578 0.677 © IIT, NCSR “Demokritos”, Rome November 2001

Normalization Normalization of numeric and temporal expressions is based on the laptop ontology matching heuristics © IIT, NCSR “Demokritos”, Rome November 2001

[pentium ii] [pent ii] [p ii] [p2] [pentium 2] [pent 2]
Name Matching Conversion of classified name strings to a neutral form lower case, unstressed for Greek words removal of punctuation marks, symbols and words commonly found in different names of a category e.g. Intel Pentium® II  {pentium ii} Converted names are stored in a temporary list that is used for their comparison against lists of equivalent classes of names, such classes are created from the Laptop ontology and the Training Corpora e.g. equivalence class for Pentium II [pentium ii] [pent ii] [p ii] [p2] [pentium 2] [pent 2] © IIT, NCSR “Demokritos”, Rome November 2001

Work schedule Currently working on improving :
Identification & Classification. Name matching. Step 2: Normalisation of temporal and numeric expressions. Step 3: Exploitation of machine learning techniques (grammar induction, decision tree and rule induction). © IIT, NCSR “Demokritos”, Rome November 2001

Institute of Informatics & Telecommunications

Similar presentations

Presentation on theme: "Institute of Informatics & Telecommunications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Institute of Informatics & Telecommunications

Similar presentations

Presentation on theme: "Institute of Informatics & Telecommunications"— Presentation transcript:

Similar presentations

About project

Feedback