Presentation is loading. Please wait.

Presentation is loading. Please wait.

FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway.

Similar presentations


Presentation on theme: "FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway."— Presentation transcript:

1 FNERC OVERVIEW 05/12/2002

2 Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway entered the project while CDC had already worked on FNERC Decision to use own tools : XTIRP Extraction Tool Decision to use own tools : XTIRP Extraction Tool System Available at : System Available at : http://hugo.lingway.com/CmBin/cmCgi.exe?_rule=CrossmarcD1&_url=

3 Lingway, 05-06 of December 2002 FNERC Version 2.0 FNERC System Description Architecture NE / TIMEX, NUMEX / TERM Annotation Name Matching and Normalisation Evaluation Ellogon Measures FNERC Future Developments Ontology Matters

4 Lingway, 05-06 of December 2002 FNERC System Description : architecture (1)

5 Lingway, 05-06 of December 2002 XTIRP To_XML module : Ensures that the input is XML-conformant : If the case, it process the input into a tree-structure with all tags kept; If not the case, it applies a tidy-like module to create a XML-conformant structure and process it into a tree structure. CROSSMARC : this module is normally not used, as the input of FNERC are XHTML files XTIRP Tokenizer : Enables to split the input text into sequences, either corresponding to Logical structure (such as a sentence, a paragraph, a section etc.), Strong tags (such as td, p, br etc.), CROSSMARC First domain, we decided to keep the tag splitting XTIRP NE / NUMEX / TIMEX Annotator : Set of Regular Expression rules enabling to identify patterns and add annotation (tags and attributes) to recognized sequences Use of Ontology and Lexicon Name Matching / Normalisation : Name matching : match coreferential ne, numex and timex, Normalisation : normalise the ne, numex and timex filling the slots. FNERC System Description : architecture (2)

6 Lingway, 05-06 of December 2002 FNERC System Description : Annotator (1) Rule Format [AVAILABILITY] RegularExpression = '[0-9\/]+ *(heures|h\.|jours|j\.|mois)' Tag_1 = "timex4(OA-d0e2145,OF-d0e2143, OV-d0e2141, DURATION)" Where : First Line : Rule Title Second Line : Perl-like Regular Expression for what is to be annotated Third Line : Action(s) to be taken. Refers to a general action sequence, refered by timex4. Actions [Timex4] Name = "TIMEX" Attributes = "Feature=@1 Attribute=@2 Value=@3 Type=@4" POSITION = MATCH Where : Second Line : the tag name (Name=”TIMEX”), Third Line : the attributes of this tag (@1, @2, @3, @4 variables corresponding to the values in the rule Fourth Line : the position of the tag

7 Lingway, 05-06 of December 2002 FNERC System Description : Annotator (2) Automatic Generation of Rules from Ontology and Lexicon Nodes information Ontology : Identifiers Lexicon : Regular Expressions XSLT Stylesheet to generate the Rule File Manual Checking of the Rules in the generated Rule File (corrections, adding of generic rules) Currently 194 Rules Ambiguity Handling In some cases several rules can apply (ex. NUMEX- CAPACITY, applying to Hard Disk Capacity, and Memory Capacity) Generation of an embedding AMBIG Tag in FNERC : 24 MO Resolution in FE Module (using contextual information (for example, using the TERM Mémoire vive on the left) Terms : a lot of recognition, to be used in FE

8 Lingway, 05-06 of December 2002 FNERC : Name Matching / Normalisation Name Matching Matching co-referential NE, NUMEX and TIMEX inside a same product description. Needs the demarcator process before being applied. Lingway : use the attribute “value” that we add during the FNERC module Example : if, in the same product description, we annotate twice a PROCESSOR (say Intel PIII and Intel Pentium III) => they will have the same value Id, => when filling the NE – PROCESSOR slot, the module will just add one to the slot. Run with a XSLT style-sheet against the XHTML input file. Normalisation Enables to display extracted information in CROSSMARC various languages Lingway : use the attribute “value” by processing Ontology and Lexicons display the Synonym in one or the other language Run with a XSLT style-sheet against the XHTML input file.

9 Lingway, 05-06 of December 2002 FNERC : Evaluation Ellogon Evaluation Still to be done due to format problems and delays Discussion about : XHTML Format Specific Output vs Human Annotators Output Lingway Evaluation : Developments still to be done Compare one to one test files and Human Annotated File / FNERC Annotated File Precision Recall Miscellaneous Ontology Matters Missing ontology items Missing ontology attributes Processing of Specific Information : Textual additionnal information Fuzzy numerical Values Binary Values

10 Lingway, 05-06 of December 2002 FNERC : Lingway Evaluation (1) See XSLT File NE – MODEL : Current system quite silent Due to : No General Regular Expression (or noise with other rules) ? NUMEX - MONEY : Some problems with format (€ in UTF-8, ISO-Latin + ) NUMEX – DATE / TIME : Date / Time Numex have not yet been implemented Accordance with Human Annotators Decisions : 1024x768 (NUMEX – RESOLUTION), 10/100 (Numex- Speed) VS 1024x768 pixels and 10/100 (TERM) Misspellings : Compact (NE – MANUF) / Automatic System to take it into account ? Redundant extraction : Fujitsu-Siemens in the Title, in secondary frame etc. whereas human annotators just tagged the occurrence in the main table Demarcator / Name Matching will handle theses cases

11 Lingway, 05-06 of December 2002 FNERC : Ontology Matters (1) Missing ontology items : * Cards (F) => Graphical Card, Sound Card, Mother Card, Network cards, Controler Card * Memory (F) => Cache memory, Flash memory, Video memory etc. Missing ontology attributes : Hard disk Example : Disque dur Maxtor 40 Gb 7200 tours/s UDMA Proposition : DD =>Type (SCSI, IDE, External), Brand (list of brands), Capacity, Speed, UDMA (yes/no), Internal/External Screen Example : écran 14.1" TFT XGA/SVGA dp (pitch) 0.25 Proposition : Screen => (Screen Size, Screen Resolution, Screen Pitch Type) Removables Example : DVD-ROM - 17 Go - 8x - module enfichable Proposition : Removables => Type (DVD-ROM Reader / CD-ROM Reader / CD Writer), Capacity, Speed, External/ Internal

12 Lingway, 05-06 of December 2002 Processing of additional information : Additional textual information : moniteur NON INCLUS (screen not included) / (waranty details) garantie deux ans SUR SITE / RETOUR ATELIER Fuzzy numerical Values : garantie ILLIMITEE (unlimited warranty) Binary Values : Mémoire flash installé(e) ( max )Aucun(e) Network card No Ontology Evolutivity : pentium 4 / Pentium III-M : value not present in the ontology, poses the evolutivity problem. Perhaps we should imagine some rules to cover theses cases. FNERC : Ontology Matters (2)


Download ppt "FNERC OVERVIEW 05/12/2002. Lingway, 05-06 of December 2002 FNERC : introduction Lingway entered the project while CDC had already worked on FNERC Lingway."

Similar presentations


Ads by Google