Information Extractors Hassan A. Sleiman
Author Cuba Spain Lebanon
Presenting Gretel
Roadmap Introduction What is an IE? IE classification IE framework Conclusions
We are talking about Wrapper Form Filler Navigator Information Extractor Ontologiser Verifier Endow data islands with APIs Ease implementing web agents
Look out! Wrappers are usually mistaken for information extractors.
The beginning DARPA Message Understanding Conferences (MUC).
Example Message ID: MUC-0001 Message Template:Court resolution Date of Event:April, Charge:Terrorist attack Person Charged:Salahuddin Amin Person Charged:Anthony Garcia Person Charged:Waheed Mahmood Person Charged: Omar Khyam … Message ID: MUC-0002 Message Template:News Date of Event:April, Date of Public.:April, Author:Jane Perlez Location:London Text:A British court… … …
Web has changed Increasing number Generated under user demand Telegraphic language HTML templates
What is an IE system? IE is the task of identifying the specific fragments of a single document that constitute its core semantic content. IE Systems rely on a set of extraction patterns that are used in order to retrieve relevant information from each document. “Muslea” “Kushmerick”
IE in action Input: Web pages Rules/patterns Output: Extracted data Extraction rules Information extractor Document Data The Da Vinci Code Dan Brown € 2006 Robert Langdon… Doubleday
FormFiller + Navigator
keywords Learning processes Domain Rules Extraction algorithm
Our goals Compare IE techniques. A survey.
Classification Categories Input Algorithm Rules Efficiency and Effectiveness User interaction Other features Cat3 Cat1 CatN Cat2 Cat4
Input features Target pages: Free Text Semi-Structured Structured Target slots Page Record Tuple Attribute
Input features (2) Case target slots are Page, record or tuple: Multi-slot? Attribute permutation? Multi-formatted attributes? Pre Processing: Tidy POS tagging Zone detection Tokenisation.
Algorithm Degree of automation: Hand crafted Semi-Supervised Supervised Unsupervised Case of Supervised/Semi- supervised/Unsupervised: Number of input pages. Case of Supervised/Semi- supervised/Unsupervised: Tagging?
Algorithm (2) Algorithm type: Logic programming String alignment Tree alignment Clustering
Rules Fixed. XPointer Offset Based on Landmarks: Regular expressions Context-free grammars FOL (First Order Logic) FSA (Finite State Automata) Based on keywords Tree Patterns
Complexity Precision Recall Accuracy F-measure β Exist comparable results for the tool? Efficiency and Effectiveness
User interaction Target Audience : Developer Non-technical. Interface: API. Command Line. Configuration File. GUI.
Other features Commercialisation: Commercial Non Commercial URL Strong features Weak features
Idea IE framework. Reusable. Comparable results.
Identified parts
Identified parts (2)
Conclusion Verifier Ontologiser Knowledge Base Extractor Information retrieval Ontology Dataset
Conclusions High degree of variability Inexistence of a comparative framework. Our goal: Reduce Comparing costs.
Thanks! Hassan A. Sleiman