Institute of Informatics & Telecommunications NCSR “Demokritos” Hellenic NERC Dimitra Farmakiotou, Vangelis Karkaletsis Rome, November 15-16, 2001
Contents Architecture Lexical Preprocessing Gazetteer Lookup Identification &Classification Normalization Schedule © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
CROSSMARC Architecture E-retailers Web sites Multilingual NERC and Name Matching XHTML pages Web Pages Collection XHTML pages annotated with NEs Domain Ontology Products Database End user Product Comparison Multilingual and Multimedia Fact Extraction © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
HNERC: Architecture © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Lexical Preprocessing: Tokenization Token categories include: HTML tokens and their subtypes, HTML entities, words belonging to different categories according to their writing, punctuation marks, symbols and numbers Separates raw text from HTML. Raw text is used in the subsequent stages of lexical analysis and name recognition. HTML information for page layout is used for zoning and sentence splitting. © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Lexical Preprocessing: Tokenization Token Token Type <tr> HTML subtype= <tr> <td width="259"> HTML subtype= <td width="259"> <font face="Arial"> HTML subtype= <font face="Arial"> Διαθέτει GFW μνήμη GLW μέχρι GLW <b> HTML subtype=<b> , PUNC 512Mb NUMW </b> HTML subtype=</b> Modem EFW <b> HTML subtype=<b> 56Κ NUMW DVD EUW - SYMBOL Source Code <tr> <td width="259"><font face="Arial">Διαθέτει μνήμη μέχρι <b>512Mb</b>, Modem <b>56Κ</b>, DVD-ROM και τόσα άλλα χαρακτηριστικά. </font></td> © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Lexical Preprocessing: Zoning Types of zones: image, title, paragraph, list, table Textual part of zones is separated from HTML Table and list zones are annotated with information about their constituents (table cells, list elements) © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Zoning: Image Annotation © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Zoning: Table Annotation © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Lexical Preprocessing: Sentence Splitting NERC within the boundaries of a sentence. Sentences in hypertext differ from sentences in raw text: smaller in size and usually fragmented. Types of sentences: lsentence (sentence comprising a list entry) psentence (sentence found within a paragraph) titsentence (sentence within a title) tsentence (sentence found within a cell of table) hsentence (a sentence comprised of all the tsentences found in a line of a table) vsentence (a sentence comprised of all the tsentences belonging to the same column of a table) © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Lexical Preprocessing: Lexical Analysis Includes POS Tagging and Lemmatization. POS Tagger: Machine Learning based Lemmatizer: returns a lemma for words contained in its lexicon and the words in lower case otherwise © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Gazetteer Look Up The Gazetteer contains known 706 entries names and terms Lists of names: Model Processor Manufacturer Software Operating System Lists of terms: Video Output Sound Screen Removables Pointing Device Input/Output Card Battery © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Gazetteer Look Up Annotation © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Text Representation Example Psentence: Overdose S Each sentence is represented token by token Each token representation contains the following: token as appears in source token type information POS lemma gazetteer lookup tag gazetteer marker start of token in text end of token in text Example Psentence: Overdose S Internal Representation: Overdose::EFW::FW::overdose::0::0::248::256 S::EUW::FW::s::0::0::257::258 © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
NE Identification and Classification Modules NE Module: MANUF, MODEL, PROCESSOR, SOFT_OS and TERM NUMEX Module: SPEED, CAPACITY, LENGTH, RESOLUTION, WEIGHT, MONEY, SIMPLE TIMEX Module: DATE, TIME, DURATION Numeric and time expressions can be better recognized when the boundaries of Names and Terms are known © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Identification and Classification Patterns Tcl regular expressions matched against the text representation. The first type of patterns conducts both Identification and Classification. Works for cases of names and expressions that can be unambiguously identified. The context of names and expressions that have already been classified by Identification and Classification patterns is further examined by Exclusion patterns. In every domain there are cases where a certain entity name may not function as such (e.g. Toshiba Software, the Intel Pentium Processor logo) © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Identification General Identification patterns recognize possible names and expressions that are likely to belong to different categories. General identification patterns may recognize token strings that belong to more than one names or expressions or token sequences that belong to different types of names or expressions. For this reason patterns that separate names or exclude certain strings from possible names are applied. Finally, classification rules are applied that make use of internal and contextual evidence © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Initial Results Speed Capacity Money Processor Model Manufacturer Precision 0.852 0.744 0.820 0.540 0.564 0.538 Recall 0.609 0.797 0.533 0.825 0.594 0.913 F-measure 0.710 0.770 0.646 0.653 0.578 0.677 © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Normalization Normalization of numeric and temporal expressions is based on the laptop ontology matching heuristics © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
[pentium ii] [pent ii] [p ii] [p2] [pentium 2] [pent 2] Name Matching Conversion of classified name strings to a neutral form lower case, unstressed for Greek words removal of punctuation marks, symbols and words commonly found in different names of a category e.g. Intel Pentium® II {pentium ii} Converted names are stored in a temporary list that is used for their comparison against lists of equivalent classes of names, such classes are created from the Laptop ontology and the Training Corpora e.g. equivalence class for Pentium II [pentium ii] [pent ii] [p ii] [p2] [pentium 2] [pent 2] © IIT, NCSR “Demokritos”, Rome 15-16 November 2001
Work schedule Currently working on improving : Identification & Classification. Name matching. Step 2: Normalisation of temporal and numeric expressions. Step 3: Exploitation of machine learning techniques (grammar induction, decision tree and rule induction). © IIT, NCSR “Demokritos”, Rome 15-16 November 2001