Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.

Similar presentations


Presentation on theme: "A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University."— Presentation transcript:

1 A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University of Minas Gerais Belo Horizonte MG Brazil SIGMOD Record, June 2002 Presented by Young-Seok Lim January 9 th, 2009

2 Contents  Introduction  A taxonomy for characterizing Web data extraction tools  Overview of web data extraction tools  Qualitative analysis  Conclusions 2

3 Introduction  A wealth of data on many different subjects with the explosion of the World Wide Web  Users retrieve Web data by  Browsing  not suitable for locating particular items of data, because following links is tedious and it is easy to get lost  Keyword searching  sometimes more efficient than browsing, but often returns vast amounts of data 3

4 Introduction  Ideas taken from the database area  Database require structured data  XML is standard for structuring data  But, existing Web data?  Unstructured or semistructured data  Enormous and still increasing  Possible strategy is to extract data from Web sources to populate databases  Specialized programs, called wrappers for extracting data from Web sources 4

5 Introduction  Given a Web page S containing a set of implicit objects, A wrapper is a program that executes the mapping W that populates a data repository R with the objects in S 5

6 A taxonomy for characterizing Web data extraction tools  Languages for Wrapper Development  Development of languages specially designed to assist users in constructing wrappers  Minerva, TSIMMIS, Web-OQL  HTML-aware Tools  Relying on inherent structural features of HTML documents(parsing tree) for accomplishing data extraction  W4F, XWRAP, RoadRunner 6

7 A taxonomy for characterizing Web data extraction tools  NLP-based Tools  Natural language processing(filtering, part-of- speech tagging, and lexical semmantic tagging) to learn extraction rules  RAPIER, SRV, WHISK  Wrapper Induction Tools  Generate delimiter-based extraction rules derived from a given set of training examples  WIEN, SoftMealy, STALKER 7

8 A taxonomy for characterizing Web data extraction tools  Modeling-based Tools  Given a target, trying to locate in Web pages portions of data that implicitly conform to that structure  NoDoSE, DEByE  Ontology-based Tools  Given a specific domain application, an ontology can be used to locate constants present in the page and to construct objects with them  Brigham Young University Data Extraction Group 8

9 Overview of Web data extraction tools - Languages for wrapper development  Minerva  Combines a declarative grammar-based approach with features typical of procedural programming languages  A set of productions  Each production defines the structure of non- terminal symbol of the grammar, in terms of terminal symbols and other non-terminals  Exception clause 9

10 Overview of Web data extraction tools - Languages for wrapper development  TSIMMIS  Specification files composed by a sequence of commands that define extraction steps  Form [variables, source, pattern]  Variables represents a set of variables that hold the extraction results  Web-OQL  Declarative query language that is capable of locating selected pieces of data in HTML pages  Abstract HTML syntax tree, called a hypertree 10

11 Overview of Web data extraction tools - HTML-aware Tools  W4F(World Wide Web Wrapper Factory)  Three phase  Describe how to access the document  Describe what pieces of data to extract  Declare what target structure to use for storing the data extracted  HEL that define extraction rules  XWRAP  Semiautomatic construction of wrapper  Cleans up bad HTML tags  Outputs a wrapper coded in Java  Six heuristics 11

12 Overview of Web data extraction tools - HTML-aware Tools  RoadRunner  Compare the HTML structure of two (or more) given sample pages belonging to a same “page class”  Grammar is inferred from schema  Fully automatic and no user intervention 12

13 Overview of Web data extraction tools - NLP-based Tools  RAPIER(Robust Automated Production of Information Extraction Rules)  From free text  Template indicating the data to be extracted  To learn data extration patterns to extract data for populating its slots  Constraints on the words and part-of-speech tags  Single-slot 13

14 Overview of Web data extraction tools - NLP-based Tools  SRV  Based on a given set of training examples  Relies on a set of token-oriented features that can be either simple or relational  Single-slot  WHISK  A set of extraction rules is induced from a given set of training example documents  On iteration user add tag  Multi-slot 14

15 Overview of Web data extraction tools – Wrapper Induction Tools  WIEN  A pioneer wrapper induction tool  A set of pages where data of interest is labeled to serve as examples  Don’t deal with nested structures or with variations typical of semistructured data  SoftMealy  Uses a special kind of automata called finite- state transducers(FST) 15

16 Overview of Web data extraction tools – Wrapper Induction Tools  STALKER  Can deal with hierarchical data extractiono  2 inputs  A set of training examples in the form of a sequence of tokens representing the surrounding of the data to be extracted  A description of the page structure, called an Embedded Catalog Tree(ECT)  Disjunctive rules 16

17 Overview of Web data extraction tools – Modeling-based Tools  NoDoSE(Northwestern Document Structure Extractor)  Interactive tool for semi-automatically determining the structure of documents  Mining component  DEByE(Data Extraction By Example)  Interactive tool that receives as input a set of example objects taken from a sample Web page  Object extraction patterns(OEP)  Bottom-up extraction algorithm 17

18 Overview of Web data extraction tools – Ontology-based Tools  The work of the Data Extraction Group at Brigham Young University(BYU)  Ontologies constructed to describe the data of interest  If representative enough, fully automated  Inherently resilient and adaptable 18

19 Qualitative analysis - Degree of automation  Related to the amount of work left to the user during the process of generating a wrapper  Approaches based on lanugage  Require the writing of code  HTML-aware tools  Higher degree  To be really effective, must be a very consistent use of HTML tag in the target page  NLP-based, induction-based, modeling-based tools  Semi-automated  User has to provide examples  Ontology-base tools  Manually  Requires the construction of an ontology 19

20 Qualitative analysis – Support for Complex Objects  Approaches based on lanugage  coding  HTML-aware tools  W4F use HEL coding  NLP-based, induction-based, modeling- based tools  SoftMealy allows the representation of structural variations  SoftMealy doesn’t deal with nested structures  STALKER, NoDoSE, DEByE represent hierarchical structure and structural variations 20

21 Qualitative analysis – Page Contents  Two kinds of pages  Semi-structured data  semi-structured text 21

22 Qualitative analysis – Ease of Use  HTML-aware tools, NLP-based tools, wrapper induction tools, and modeling- based tools usually present a GUI  In BYU tool, the ontology creation process must also be done manually by the user 22

23 Qualitative analysis – XML Output  In Minerva, the user has to explicitly write code to generate an output in XML  In W4F, “mapping wizard”  XWRAP and DEByE natively provide  NoDoSE supports a variety of formats 23

24 Qualitative analysis – Support for Non- HTML Sources  NLP-based tools and the BYU tools specially suitable for non-HTML sources  Wrapper induction tools and the modeling- based tools don’t rely uniquely on HTML tags 24

25 Qualitative analysis – Resilience and Adaptiveness  As the structural and presentation features of Web pages are prone to frequent changes  Resilience – the capacity of continuing to work properly in the occurrence of changes in the pages  Adaptiveness – the property of working properly with pages from another source in the same application domain 25

26 Conclusion 26

27 Conclusion 27


Download ppt "A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University."

Similar presentations


Ads by Google