Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob

Overview  Introduction and Motivation  Wrapper Generation  Extraction Language/Mechanisms  Testing Lixto  Results  Strengths & Weakness  Current/Future Work

HTML vs. XML  HTML & XML represent semi-structured data  HTML mainly presentation oriented  Web content typically formatted in HTML  HTML lacks data querying

XML Advantages  XML structure/layout separation  XML provides suitable data representation  XML sets act as database  XML sets queried via, XML-GL, XML-QL, XQuery

eBay Example  No data querying ability increases cost and time to retrieve information from web pages  Example: watch interesting eBay offers of notebooks  Criteria: –Auction contains the word “notebook” –Current value between GBP 1500 and 3000 –Received at least 3 bids

eBay Problems  eBay does not support complex queries  Similar sites do not give restricted queries  Large number of results returned with no possibility to further restrict the results  Only one site can be queried at a time  Results from different queries cannot be compiled into a single structured file

eBay Solution  Lixto introduces new ideas and programming language concepts for wrapper generation  Lixto translates HTML to XML  Resulting XML can then be queried and further processed  Wrappers applied automatically to extract information from changing web pages

Lixto Advantages  Easy to learn  Full visual and interactive UI provided  No fine tuning required  No knowledge of internal language necessary  No knowledge of HTML necessary  Graphical region marking and selection  Works directly on browser-display pages, no additional view necessary

Lixto Advantages  Extraction of target patterns based on: –Surrounding landmarks –Actual content –HTML attributes –Order of appearance –Semantic and syntactic concepts  Extraction from flat strings possible  Semi-automatic wrapper generation

Advanced Lixto Features  Disjunctive pattern definitions  Crawling page links during extraction  Recursive wrapping  Extracted data can have disjoint structure from HTML source page  Internal data structure language Elog

Implemented Lixto System

Architecture and Implementation  Lixto created with Java using Swing, OroMather and JDOM  Lixto toolkit contains three modules: –Interactive Pattern Builder –Extractor –XML Generator

Creating Wrappers  Lixto wrappers created interactively using patterns in a hierarchical order  Patterns names act as default XML elements  Sub patterns express 1:* relationships  Each pattern characterizes one kind of information  Each pattern is defined by one or more filters

Filter Creation  User highlights desired target –Internally Elog rule created describing filter  Add restrictive conditions to filter –Goals added to Elog rule body  Filter conditions: –Before/after –Not before/not after –Internal –Range

Pattern Creation Algorithm  Loading initial document creates a pattern  User highlights instance of the pattern  Lixto displays all matched instances of the pattern

Pattern Creation Algorithm  User can add filters to limit the matched targets  The set of filters is added to the pattern  Test if pattern extracts exactly the desired set of data  If yes, save the pattern, if no select new instance of the pattern

Generation of a New Pattern

The Lixto Browser

Conditional Generation

Visual Interface  Visual tree pattern construction  Regular expression string patterns  XML visualization tool  Concept generator –Regular expression / database driven –Creates “isCity”, “isDate” –Requires no regular expression knowledge

Main Menu / Pattern Generation Menu

Elog  Internal data storage language  Data-log like syntax and semantics  Invisible to the user  Specifically designed for hierarchical and modular data extraction  Flexible, intuitive, easily extensible  Patterns stored as narrowing (logical and) and broadening (logical or) steps  Elog rules are implementations of the visually defined filters

Elog Extraction Program for eBay Example

Document Model  Brackets specify character offsets  Nodes numbered in depth-first left-to-right fashion  HTML tags refer to element sets containing attribute names and values – tag contains attributes {(name,body), (bgcolor,FFFFFF),(elementtext,…)}

HTML Example Page

XML Translation

Extraction Mechanisms  Tree extraction –Elements identified by tree path (*.table*.tr) –Attribute constraints reduce matched elements –Element path definition (epd): tree path + attribute constraints  String extraction –Strings stored in ‘context’ nodes –Regular expression matching

HTML Tree Extraction

Lixto Test Sites

Results

Strengths & Weakness  Intuitive UI (If it needs a manual it’s not a good program)  Highly customizable  Supports crawling across web sites  No tree output after crawling  Slow  Extracts only one target type at a time

Current/Future Work  Extend tree structure to support crawling across multiple sites (crawling is currently supported)  Server based Lixto system  Automated heuristics  Support for multiple example targets at once  Embedding Lixto wrappers into information channel system

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Similar presentations

Presentation on theme: "Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Similar presentations

Presentation on theme: "Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob."— Presentation transcript:

Similar presentations

About project

Feedback