Download presentation
Presentation is loading. Please wait.
Published byTrevor Grant Modified over 9 years ago
1
A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University of Minas Gerais Belo Horizonte MG Brazil SIGMOD Record, June 2002 Presented by Young-Seok Lim January 9 th, 2009
2
Contents Introduction A taxonomy for characterizing Web data extraction tools Overview of web data extraction tools Qualitative analysis Conclusions 2
3
Introduction A wealth of data on many different subjects with the explosion of the World Wide Web Users retrieve Web data by Browsing not suitable for locating particular items of data, because following links is tedious and it is easy to get lost Keyword searching sometimes more efficient than browsing, but often returns vast amounts of data 3
4
Introduction Ideas taken from the database area Database require structured data XML is standard for structuring data But, existing Web data? Unstructured or semistructured data Enormous and still increasing Possible strategy is to extract data from Web sources to populate databases Specialized programs, called wrappers for extracting data from Web sources 4
5
Introduction Given a Web page S containing a set of implicit objects, A wrapper is a program that executes the mapping W that populates a data repository R with the objects in S 5
6
A taxonomy for characterizing Web data extraction tools Languages for Wrapper Development Development of languages specially designed to assist users in constructing wrappers Minerva, TSIMMIS, Web-OQL HTML-aware Tools Relying on inherent structural features of HTML documents(parsing tree) for accomplishing data extraction W4F, XWRAP, RoadRunner 6
7
A taxonomy for characterizing Web data extraction tools NLP-based Tools Natural language processing(filtering, part-of- speech tagging, and lexical semmantic tagging) to learn extraction rules RAPIER, SRV, WHISK Wrapper Induction Tools Generate delimiter-based extraction rules derived from a given set of training examples WIEN, SoftMealy, STALKER 7
8
A taxonomy for characterizing Web data extraction tools Modeling-based Tools Given a target, trying to locate in Web pages portions of data that implicitly conform to that structure NoDoSE, DEByE Ontology-based Tools Given a specific domain application, an ontology can be used to locate constants present in the page and to construct objects with them Brigham Young University Data Extraction Group 8
9
Overview of Web data extraction tools - Languages for wrapper development Minerva Combines a declarative grammar-based approach with features typical of procedural programming languages A set of productions Each production defines the structure of non- terminal symbol of the grammar, in terms of terminal symbols and other non-terminals Exception clause 9
10
Overview of Web data extraction tools - Languages for wrapper development TSIMMIS Specification files composed by a sequence of commands that define extraction steps Form [variables, source, pattern] Variables represents a set of variables that hold the extraction results Web-OQL Declarative query language that is capable of locating selected pieces of data in HTML pages Abstract HTML syntax tree, called a hypertree 10
11
Overview of Web data extraction tools - HTML-aware Tools W4F(World Wide Web Wrapper Factory) Three phase Describe how to access the document Describe what pieces of data to extract Declare what target structure to use for storing the data extracted HEL that define extraction rules XWRAP Semiautomatic construction of wrapper Cleans up bad HTML tags Outputs a wrapper coded in Java Six heuristics 11
12
Overview of Web data extraction tools - HTML-aware Tools RoadRunner Compare the HTML structure of two (or more) given sample pages belonging to a same “page class” Grammar is inferred from schema Fully automatic and no user intervention 12
13
Overview of Web data extraction tools - NLP-based Tools RAPIER(Robust Automated Production of Information Extraction Rules) From free text Template indicating the data to be extracted To learn data extration patterns to extract data for populating its slots Constraints on the words and part-of-speech tags Single-slot 13
14
Overview of Web data extraction tools - NLP-based Tools SRV Based on a given set of training examples Relies on a set of token-oriented features that can be either simple or relational Single-slot WHISK A set of extraction rules is induced from a given set of training example documents On iteration user add tag Multi-slot 14
15
Overview of Web data extraction tools – Wrapper Induction Tools WIEN A pioneer wrapper induction tool A set of pages where data of interest is labeled to serve as examples Don’t deal with nested structures or with variations typical of semistructured data SoftMealy Uses a special kind of automata called finite- state transducers(FST) 15
16
Overview of Web data extraction tools – Wrapper Induction Tools STALKER Can deal with hierarchical data extractiono 2 inputs A set of training examples in the form of a sequence of tokens representing the surrounding of the data to be extracted A description of the page structure, called an Embedded Catalog Tree(ECT) Disjunctive rules 16
17
Overview of Web data extraction tools – Modeling-based Tools NoDoSE(Northwestern Document Structure Extractor) Interactive tool for semi-automatically determining the structure of documents Mining component DEByE(Data Extraction By Example) Interactive tool that receives as input a set of example objects taken from a sample Web page Object extraction patterns(OEP) Bottom-up extraction algorithm 17
18
Overview of Web data extraction tools – Ontology-based Tools The work of the Data Extraction Group at Brigham Young University(BYU) Ontologies constructed to describe the data of interest If representative enough, fully automated Inherently resilient and adaptable 18
19
Qualitative analysis - Degree of automation Related to the amount of work left to the user during the process of generating a wrapper Approaches based on lanugage Require the writing of code HTML-aware tools Higher degree To be really effective, must be a very consistent use of HTML tag in the target page NLP-based, induction-based, modeling-based tools Semi-automated User has to provide examples Ontology-base tools Manually Requires the construction of an ontology 19
20
Qualitative analysis – Support for Complex Objects Approaches based on lanugage coding HTML-aware tools W4F use HEL coding NLP-based, induction-based, modeling- based tools SoftMealy allows the representation of structural variations SoftMealy doesn’t deal with nested structures STALKER, NoDoSE, DEByE represent hierarchical structure and structural variations 20
21
Qualitative analysis – Page Contents Two kinds of pages Semi-structured data semi-structured text 21
22
Qualitative analysis – Ease of Use HTML-aware tools, NLP-based tools, wrapper induction tools, and modeling- based tools usually present a GUI In BYU tool, the ontology creation process must also be done manually by the user 22
23
Qualitative analysis – XML Output In Minerva, the user has to explicitly write code to generate an output in XML In W4F, “mapping wizard” XWRAP and DEByE natively provide NoDoSE supports a variety of formats 23
24
Qualitative analysis – Support for Non- HTML Sources NLP-based tools and the BYU tools specially suitable for non-HTML sources Wrapper induction tools and the modeling- based tools don’t rely uniquely on HTML tags 24
25
Qualitative analysis – Resilience and Adaptiveness As the structural and presentation features of Web pages are prone to frequent changes Resilience – the capacity of continuing to work properly in the occurrence of changes in the pages Adaptiveness – the property of working properly with pages from another source in the same application domain 25
26
Conclusion 26
27
Conclusion 27
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.