Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Italian FE Component CROSSMARC Eighth Meeting Crete 24 June 2003.

Similar presentations


Presentation on theme: "1 Italian FE Component CROSSMARC Eighth Meeting Crete 24 June 2003."— Presentation transcript:

1 1 Italian FE Component CROSSMARC Eighth Meeting Crete 24 June 2003

2 2 WHISK_ Main WHISK_Evaluator Training Set XHTML Pages FE Surrogates Testing Set XHTML Pages FE Surrogates WHISK Train Set Tokenised XML Text Product Descr WHISK Test Set Tokenised XML Text Product Descr WHISK Output Product Descr Ruleset WHISK_Main Evaluation Results WHISK Training WHISK Testing  WHISK C.Feeder  WHISK Main  A WHISK Evaluator WHISK_Corpus_Feeder FE component overview

3 3 WHISK improvements  Laplacian Expected Error versus Precision:  Laplacian Expected Error [(e+1)/(n+1)] expresses a good trade-off between rule precision and recall; using this measure instead of Precision during the training phase, prevents Whisk from choosing a multitude of very specific rules that apply correctly, but only in very few cases.  Once a rule is accepted, it is nonsense to consider its degree of generality, so pure Precision must be considered

4 4 WHISK improvements  Some Heuristics  Preparing Ruleset 1.A ruleset is induced from instances from the training set, every rule is characterized by its Ruletype (the kind of fact that it extracts), its Laplacian Expected Error and its Precision 2.rules from the ruleset are sorted by Precision 3.A threshold is set on the Precision: induced rules with Precision under this threshold are discarded; eventually, another threshold for the Laplacian Expected Error could be considered (it is not determinant for system accuracy, but it could be useful to remove too specialized rules in order to raise system speed)

5 5 WHISK improvements  Some Heuristics  Applying rules For every rule, ordering by Precision, apply the rule and consider every extraction as a “candidate extraction”; for every candidate extraction, then: 1.delete the extraction if another candidate extraction (whichever the RuleType it belongs) exists in the same range of tokens, else proceed to the next step 2.extract the product information from the PDemarcator tag. 3.delete the extraction if another candidate extraction from a rule of the same RuleType exists for the same product, else, proceed to the next step. 4.confirm the candidate extraction as a valid extraction.

6 6 WHISK improvements  WHISK_Corpus_Feeder  Multiple Product Reference  If a single named entity refers to multiple product descriptions, we used the following syntax of PDemarcator to represent this phenomenon:  WHISK will handle this form of the attribute and test against every single product that is referred by the TAG  We propose agreement over this syntax and use it for new PDemarcator releases

7 7 WHISK evaluation  Here follows evaluation of WHISK based FE component for the 4 different languages adopted in CROSSMARC:

8 8 Released Version Information  RTV FE comes in two releases  A standalone version, featuring a user-friendly SWI-Prolog/XPCE GUI.  This version takes as input tokenized xhtml files transformed into Prolog lists.  Transformation is actuated via a callback to a java tokenizer  Can also be invoked via command line  A java application  incorporates html-to-token_list trasformation class and wraps prolog engine for –direct integration into a java environment –external call via command line

9 9 Screenshot of Prolog Standalone version (1) RuleSet selection RuleSet selection Input/Output Pages selection Start Processing Change modality

10 10 Screenshot of Prolog Standalone version (2) These buttons clear the ruleset amd the file containing the testing extractions (recovery of badly interrupted training and extraction processes is provided) RuleSet selection Convert html pages and surrogate files to Prolog lists of tokens and Prolog facts containing annotations (a java converter is called) Start the extractions to be evaluated Clears the file containing the extractions These buttons clear the ruleset and the file containing the testing extractions (recovery of badly interrupted training and extraction processes is provided) Starts training process

11 11 Concluding remarks  Two Possible Improvements  Self-adaptive cut on the ruleset:  Our FE System adopts a hierarchical strategy of extraction. As a consequence of this, rules with low precision could be masked by others performing better. For this reason, the contribution of a specific rule to the system performance can only be tested dynamically upon the system itself, adding and removing the rule on two distinct tests.  A possible improvement consists in a first cut on the ruleset using a first threshold, and then, using another threshold, in a second cut that locate a bunch of rules to be confirmed upon their influence on system’s performances,

12 12 Concluding remarks  Two Possible Improvements  Addition of general-purpose rules:  With the intent of producing a sort of baseline for the system, an experiment has been done using a minimal set of rules (one or few more for every type of fact to be extracted) composed only of PD tags and in some cases of other generic information regarding the specific files to be extracted. These rules generally perform with low precision and higher recall values. What we expect is that adding general-purpose rules to the rulesets and marking all of them as low precision rules, could raise Recall statistics without a great loss in Precision, thanks to the error prevention mechanism provided by the hierarchical extraction strategy.  Ruleset Creation Methodology Precise rules Rules to be tested Discarded rules First Threshold Second Threshold Precise rules Accepted rules Generic rules


Download ppt "1 Italian FE Component CROSSMARC Eighth Meeting Crete 24 June 2003."

Similar presentations


Ads by Google