A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.

A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University of Minas Gerais Belo Horizonte MG Brazil SIGMOD Record, June 2002 Presented by Young-Seok Lim January 9 th, 2009

Contents  Introduction  A taxonomy for characterizing Web data extraction tools  Overview of web data extraction tools  Qualitative analysis  Conclusions 2

Introduction  A wealth of data on many different subjects with the explosion of the World Wide Web  Users retrieve Web data by  Browsing  not suitable for locating particular items of data, because following links is tedious and it is easy to get lost  Keyword searching  sometimes more efficient than browsing, but often returns vast amounts of data 3

Introduction  Ideas taken from the database area  Database require structured data  XML is standard for structuring data  But, existing Web data?  Unstructured or semistructured data  Enormous and still increasing  Possible strategy is to extract data from Web sources to populate databases  Specialized programs, called wrappers for extracting data from Web sources 4

Introduction  Given a Web page S containing a set of implicit objects, A wrapper is a program that executes the mapping W that populates a data repository R with the objects in S 5

A taxonomy for characterizing Web data extraction tools  Languages for Wrapper Development  Development of languages specially designed to assist users in constructing wrappers  Minerva, TSIMMIS, Web-OQL  HTML-aware Tools  Relying on inherent structural features of HTML documents(parsing tree) for accomplishing data extraction  W4F, XWRAP, RoadRunner 6

A taxonomy for characterizing Web data extraction tools  NLP-based Tools  Natural language processing(filtering, part-of- speech tagging, and lexical semmantic tagging) to learn extraction rules  RAPIER, SRV, WHISK  Wrapper Induction Tools  Generate delimiter-based extraction rules derived from a given set of training examples  WIEN, SoftMealy, STALKER 7

A taxonomy for characterizing Web data extraction tools  Modeling-based Tools  Given a target, trying to locate in Web pages portions of data that implicitly conform to that structure  NoDoSE, DEByE  Ontology-based Tools  Given a specific domain application, an ontology can be used to locate constants present in the page and to construct objects with them  Brigham Young University Data Extraction Group 8

Overview of Web data extraction tools - Languages for wrapper development  Minerva  Combines a declarative grammar-based approach with features typical of procedural programming languages  A set of productions  Each production defines the structure of non- terminal symbol of the grammar, in terms of terminal symbols and other non-terminals  Exception clause 9

Overview of Web data extraction tools - Languages for wrapper development  TSIMMIS  Specification files composed by a sequence of commands that define extraction steps  Form [variables, source, pattern]  Variables represents a set of variables that hold the extraction results  Web-OQL  Declarative query language that is capable of locating selected pieces of data in HTML pages  Abstract HTML syntax tree, called a hypertree 10

Overview of Web data extraction tools - HTML-aware Tools  W4F(World Wide Web Wrapper Factory)  Three phase  Describe how to access the document  Describe what pieces of data to extract  Declare what target structure to use for storing the data extracted  HEL that define extraction rules  XWRAP  Semiautomatic construction of wrapper  Cleans up bad HTML tags  Outputs a wrapper coded in Java  Six heuristics 11

Overview of Web data extraction tools - HTML-aware Tools  RoadRunner  Compare the HTML structure of two (or more) given sample pages belonging to a same “page class”  Grammar is inferred from schema  Fully automatic and no user intervention 12

Overview of Web data extraction tools - NLP-based Tools  RAPIER(Robust Automated Production of Information Extraction Rules)  From free text  Template indicating the data to be extracted  To learn data extration patterns to extract data for populating its slots  Constraints on the words and part-of-speech tags  Single-slot 13

Overview of Web data extraction tools - NLP-based Tools  SRV  Based on a given set of training examples  Relies on a set of token-oriented features that can be either simple or relational  Single-slot  WHISK  A set of extraction rules is induced from a given set of training example documents  On iteration user add tag  Multi-slot 14

Overview of Web data extraction tools – Wrapper Induction Tools  WIEN  A pioneer wrapper induction tool  A set of pages where data of interest is labeled to serve as examples  Don’t deal with nested structures or with variations typical of semistructured data  SoftMealy  Uses a special kind of automata called finite- state transducers(FST) 15

Overview of Web data extraction tools – Wrapper Induction Tools  STALKER  Can deal with hierarchical data extractiono  2 inputs  A set of training examples in the form of a sequence of tokens representing the surrounding of the data to be extracted  A description of the page structure, called an Embedded Catalog Tree(ECT)  Disjunctive rules 16

Overview of Web data extraction tools – Modeling-based Tools  NoDoSE(Northwestern Document Structure Extractor)  Interactive tool for semi-automatically determining the structure of documents  Mining component  DEByE(Data Extraction By Example)  Interactive tool that receives as input a set of example objects taken from a sample Web page  Object extraction patterns(OEP)  Bottom-up extraction algorithm 17

Overview of Web data extraction tools – Ontology-based Tools  The work of the Data Extraction Group at Brigham Young University(BYU)  Ontologies constructed to describe the data of interest  If representative enough, fully automated  Inherently resilient and adaptable 18

Qualitative analysis - Degree of automation  Related to the amount of work left to the user during the process of generating a wrapper  Approaches based on lanugage  Require the writing of code  HTML-aware tools  Higher degree  To be really effective, must be a very consistent use of HTML tag in the target page  NLP-based, induction-based, modeling-based tools  Semi-automated  User has to provide examples  Ontology-base tools  Manually  Requires the construction of an ontology 19

Qualitative analysis – Support for Complex Objects  Approaches based on lanugage  coding  HTML-aware tools  W4F use HEL coding  NLP-based, induction-based, modeling- based tools  SoftMealy allows the representation of structural variations  SoftMealy doesn’t deal with nested structures  STALKER, NoDoSE, DEByE represent hierarchical structure and structural variations 20

Qualitative analysis – Page Contents  Two kinds of pages  Semi-structured data  semi-structured text 21

Qualitative analysis – Ease of Use  HTML-aware tools, NLP-based tools, wrapper induction tools, and modeling- based tools usually present a GUI  In BYU tool, the ontology creation process must also be done manually by the user 22

Qualitative analysis – XML Output  In Minerva, the user has to explicitly write code to generate an output in XML  In W4F, “mapping wizard”  XWRAP and DEByE natively provide  NoDoSE supports a variety of formats 23

Qualitative analysis – Support for Non- HTML Sources  NLP-based tools and the BYU tools specially suitable for non-HTML sources  Wrapper induction tools and the modeling- based tools don’t rely uniquely on HTML tags 24

Qualitative analysis – Resilience and Adaptiveness  As the structural and presentation features of Web pages are prone to frequent changes  Resilience – the capacity of continuing to work properly in the occurrence of changes in the pages  Adaptiveness – the property of working properly with pages from another source in the same application domain 25

Conclusion 26

Conclusion 27

A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.

Similar presentations

Presentation on theme: "A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.

Similar presentations

Presentation on theme: "A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University."— Presentation transcript:

Similar presentations

About project

Feedback