Presentation is loading. Please wait.

Presentation is loading. Please wait.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

Similar presentations


Presentation on theme: "Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob."— Presentation transcript:

1 Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob

2 Overview  Introduction and Motivation  Wrapper Generation  Extraction Language/Mechanisms  Testing Lixto  Results  Strengths & Weakness  Current/Future Work

3 HTML vs. XML  HTML & XML represent semi-structured data  HTML mainly presentation oriented  Web content typically formatted in HTML  HTML lacks data querying

4 XML Advantages  XML structure/layout separation  XML provides suitable data representation  XML sets act as database  XML sets queried via, XML-GL, XML-QL, XQuery

5 eBay Example  No data querying ability increases cost and time to retrieve information from web pages  Example: watch interesting eBay offers of notebooks  Criteria: –Auction contains the word “notebook” –Current value between GBP 1500 and 3000 –Received at least 3 bids

6 eBay Problems  eBay does not support complex queries  Similar sites do not give restricted queries  Large number of results returned with no possibility to further restrict the results  Only one site can be queried at a time  Results from different queries cannot be compiled into a single structured file

7 eBay Solution  Lixto introduces new ideas and programming language concepts for wrapper generation  Lixto translates HTML to XML  Resulting XML can then be queried and further processed  Wrappers applied automatically to extract information from changing web pages

8 Lixto Advantages  Easy to learn  Full visual and interactive UI provided  No fine tuning required  No knowledge of internal language necessary  No knowledge of HTML necessary  Graphical region marking and selection  Works directly on browser-display pages, no additional view necessary

9 Lixto Advantages  Extraction of target patterns based on: –Surrounding landmarks –Actual content –HTML attributes –Order of appearance –Semantic and syntactic concepts  Extraction from flat strings possible  Semi-automatic wrapper generation

10 Advanced Lixto Features  Disjunctive pattern definitions  Crawling page links during extraction  Recursive wrapping  Extracted data can have disjoint structure from HTML source page  Internal data structure language Elog

11 Implemented Lixto System

12 Architecture and Implementation  Lixto created with Java using Swing, OroMather and JDOM  Lixto toolkit contains three modules: –Interactive Pattern Builder –Extractor –XML Generator

13 Creating Wrappers  Lixto wrappers created interactively using patterns in a hierarchical order  Patterns names act as default XML elements  Sub patterns express 1:* relationships  Each pattern characterizes one kind of information  Each pattern is defined by one or more filters

14 Filter Creation  User highlights desired target –Internally Elog rule created describing filter  Add restrictive conditions to filter –Goals added to Elog rule body  Filter conditions: –Before/after –Not before/not after –Internal –Range

15 Pattern Creation Algorithm  Loading initial document creates a pattern  User highlights instance of the pattern  Lixto displays all matched instances of the pattern

16 Pattern Creation Algorithm  User can add filters to limit the matched targets  The set of filters is added to the pattern  Test if pattern extracts exactly the desired set of data  If yes, save the pattern, if no select new instance of the pattern

17 Generation of a New Pattern

18 The Lixto Browser

19 Conditional Generation

20 Visual Interface  Visual tree pattern construction  Regular expression string patterns  XML visualization tool  Concept generator –Regular expression / database driven –Creates “isCity”, “isDate” –Requires no regular expression knowledge

21 Main Menu / Pattern Generation Menu

22 Elog  Internal data storage language  Data-log like syntax and semantics  Invisible to the user  Specifically designed for hierarchical and modular data extraction  Flexible, intuitive, easily extensible  Patterns stored as narrowing (logical and) and broadening (logical or) steps  Elog rules are implementations of the visually defined filters

23 Elog Extraction Program for eBay Example

24 Document Model  Brackets specify character offsets  Nodes numbered in depth-first left-to-right fashion  HTML tags refer to element sets containing attribute names and values – tag contains attributes {(name,body), (bgcolor,FFFFFF),(elementtext,…)}

25 HTML Example Page

26 XML Translation

27 Extraction Mechanisms  Tree extraction –Elements identified by tree path (*.table*.tr) –Attribute constraints reduce matched elements –Element path definition (epd): tree path + attribute constraints  String extraction –Strings stored in ‘context’ nodes –Regular expression matching

28 HTML Tree Extraction

29 Lixto Test Sites

30 Results

31 Strengths & Weakness  Intuitive UI (If it needs a manual it’s not a good program)  Highly customizable  Supports crawling across web sites  No tree output after crawling  Slow  Extracts only one target type at a time

32 Current/Future Work  Extend tree structure to support crawling across multiple sites (crawling is currently supported)  Server based Lixto system  Automated heuristics  Support for multiple example targets at once  Embedding Lixto wrappers into information channel system


Download ppt "Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob."

Similar presentations


Ads by Google