Download presentation
Presentation is loading. Please wait.
1
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob
2
Overview Introduction and Motivation Wrapper Generation Extraction Language/Mechanisms Testing Lixto Results Strengths & Weakness Current/Future Work
3
HTML vs. XML HTML & XML represent semi-structured data HTML mainly presentation oriented Web content typically formatted in HTML HTML lacks data querying
4
XML Advantages XML structure/layout separation XML provides suitable data representation XML sets act as database XML sets queried via, XML-GL, XML-QL, XQuery
5
eBay Example No data querying ability increases cost and time to retrieve information from web pages Example: watch interesting eBay offers of notebooks Criteria: –Auction contains the word “notebook” –Current value between GBP 1500 and 3000 –Received at least 3 bids
6
eBay Problems eBay does not support complex queries Similar sites do not give restricted queries Large number of results returned with no possibility to further restrict the results Only one site can be queried at a time Results from different queries cannot be compiled into a single structured file
7
eBay Solution Lixto introduces new ideas and programming language concepts for wrapper generation Lixto translates HTML to XML Resulting XML can then be queried and further processed Wrappers applied automatically to extract information from changing web pages
8
Lixto Advantages Easy to learn Full visual and interactive UI provided No fine tuning required No knowledge of internal language necessary No knowledge of HTML necessary Graphical region marking and selection Works directly on browser-display pages, no additional view necessary
9
Lixto Advantages Extraction of target patterns based on: –Surrounding landmarks –Actual content –HTML attributes –Order of appearance –Semantic and syntactic concepts Extraction from flat strings possible Semi-automatic wrapper generation
10
Advanced Lixto Features Disjunctive pattern definitions Crawling page links during extraction Recursive wrapping Extracted data can have disjoint structure from HTML source page Internal data structure language Elog
11
Implemented Lixto System
12
Architecture and Implementation Lixto created with Java using Swing, OroMather and JDOM Lixto toolkit contains three modules: –Interactive Pattern Builder –Extractor –XML Generator
13
Creating Wrappers Lixto wrappers created interactively using patterns in a hierarchical order Patterns names act as default XML elements Sub patterns express 1:* relationships Each pattern characterizes one kind of information Each pattern is defined by one or more filters
14
Filter Creation User highlights desired target –Internally Elog rule created describing filter Add restrictive conditions to filter –Goals added to Elog rule body Filter conditions: –Before/after –Not before/not after –Internal –Range
15
Pattern Creation Algorithm Loading initial document creates a pattern User highlights instance of the pattern Lixto displays all matched instances of the pattern
16
Pattern Creation Algorithm User can add filters to limit the matched targets The set of filters is added to the pattern Test if pattern extracts exactly the desired set of data If yes, save the pattern, if no select new instance of the pattern
17
Generation of a New Pattern
18
The Lixto Browser
19
Conditional Generation
20
Visual Interface Visual tree pattern construction Regular expression string patterns XML visualization tool Concept generator –Regular expression / database driven –Creates “isCity”, “isDate” –Requires no regular expression knowledge
21
Main Menu / Pattern Generation Menu
22
Elog Internal data storage language Data-log like syntax and semantics Invisible to the user Specifically designed for hierarchical and modular data extraction Flexible, intuitive, easily extensible Patterns stored as narrowing (logical and) and broadening (logical or) steps Elog rules are implementations of the visually defined filters
23
Elog Extraction Program for eBay Example
24
Document Model Brackets specify character offsets Nodes numbered in depth-first left-to-right fashion HTML tags refer to element sets containing attribute names and values – tag contains attributes {(name,body), (bgcolor,FFFFFF),(elementtext,…)}
25
HTML Example Page
26
XML Translation
27
Extraction Mechanisms Tree extraction –Elements identified by tree path (*.table*.tr) –Attribute constraints reduce matched elements –Element path definition (epd): tree path + attribute constraints String extraction –Strings stored in ‘context’ nodes –Regular expression matching
28
HTML Tree Extraction
29
Lixto Test Sites
30
Results
31
Strengths & Weakness Intuitive UI (If it needs a manual it’s not a good program) Highly customizable Supports crawling across web sites No tree output after crawling Slow Extracts only one target type at a time
32
Current/Future Work Extend tree structure to support crawling across multiple sites (crawling is currently supported) Server based Lixto system Automated heuristics Support for multiple example targets at once Embedding Lixto wrappers into information channel system
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.