To structure or not to structure, is that the question? Paolo Atzeni Based on work done with (or by!) G. Mecca, P.Merialdo, P. Papotti, and many others.

Slides:

Advertisements

Similar presentations

Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.

Advertisements

TU e technische universiteit eindhoven / department of mathematics and computer science Modeling User Input and Hypermedia Dynamics in Hera Databases and.

XML: Extensible Markup Language

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.

Flint: exploiting redundant information to wring out value from Web data Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti.

Slide 1 Web-Base Management Systems Aaron Brown and David Oppenheimer CS294-7 February 11, 1999.

Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.

Search Engines and Information Retrieval

RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

XML Views El Hazoui Ilias Supervised by: Dr. Haddouti Advanced XML data management.

Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.

1 COS 425: Database and Information Management Systems XML and information exchange.

FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.

1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.

Supplement 02CASE Tools1 Supplement 02 - Case Tools And Franchise Colleges By MANSHA NAWAZ.

Automatic Data Ramon Lawrence University of Manitoba

INTEGRATION INTEGRATION Ramon Lawrence University of Iowa

Introduction to Systems Analysis and Design

Overview of Search Engines

Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.

A Brief Survey of Web Data Extraction Tools Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran S. da Silva, Juliana S. Teixeira Federal University.

IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.

Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.

Search Engines and Information Retrieval Chapter 1.

Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.

Copyright © 2012 Accenture All Rights Reserved.Copyright © 2012 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are.

Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.

Social scope: Enabling Information Discovery On Social Content Sites

Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.

Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.

1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Presenter: Shanshan Lu 03/04/2010

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Historical Aspects Origin of software engineering –NATO study group coined the term in 1967 Software crisis –Low quality, schedule delay, and cost overrun.

Data Mining for Web Intelligence Presentation by Julia Erdman.

Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.

Characterizing the Uncertainty of Web Data: Models and Experiences Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi.

SupervisorStudent Prof. Atilla ElciHussam Hussein ABUAZAB June 2007 Using ORACLE XML Parser to Access Ontology CMPE 588 Engineering Semantic for.

Elucidative Programming Kurt Nørmark Aalborg University Denmark SIGDOC September 2000.

Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.

Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.

Search Engine using Web Mining COMS E Web Enhanced Information Mgmt Prof. Gail Kaiser Presented By: Rupal Shah (UNI: rrs2146)

JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.

Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli.

1 Chapter 12 Configuration management This chapter is extracted from Sommerville’s slides. Text book chapter 29 1.

Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.

Fall CSE330/CIS550: Introduction to Database Management Systems Prof. Susan Davidson Office: 278 Moore Office hours: TTh

Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.

Lecture VIII: Software Architecture

September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.

CASE Tools and their Effect on Software Quality

Data mining in web applications

Information Retrieval in Practice

Building Enterprise Applications Using Visual Studio®

Databases (CS507) CHAPTER 2.

Basic Concepts in Software Design

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.

Design and Maintenance of Web Applications in J2EE

Chapter 2 Database Environment Pearson Education © 2009.

Lecture 1: Multi-tier Architecture Overview

Data Integration for Relational Web

Probabilistic Databases

CSE591: Data Mining by H. Liu

Supporting High-Performance Data Processing on Flat-Files

Chapter 2 Database Environment Pearson Education © 2009.

Presentation transcript:

To structure or not to structure, is that the question? Paolo Atzeni Based on work done with (or by!) G. Mecca, P.Merialdo, P. Papotti, and many others Barcelona, 18 January 2011

Content A personal (and questionable …) perspective on the role structure (and its discovery) has in the Web, especially as a replacement for semantics Two observations –Especially important if we want automatic processing –Something similar happened for databases many years ago and a disclaimer: –Personal means that I will refer to projects in my group (with some contradiction, as I have been personally involved only in some of them…) –Based on a talk given at the USETIM (Using Search Engine Technology for Information Management) worskhop in September /01/2011Paolo Atzeni2

Content Questions on the Web Structures in the Web, transformations: Araneus Information extraction, wrappers: Minerva, RoadRunner Crawling, extraction and integration: Flint 18/01/2011Paolo Atzeni3

4 Web and Databases (1) The Web is browsed and searched –Usually one page or one item at the time (may be a ranked list) Databases are (mainly) queried –We are often interested in tables as results 18/01/2011

What do we do with the Web? Everything! We often search for specific items, and search engines are ok In other cases we need more complex things: –What could be the reason for which Frederic Joliot-Curie (a French physicist and Nobel laureate, son in law of P. and M. Curie) is popular in Bulgaria (for example, the International House of Scientists in Varna is named after him)? –Find a list of qualified database researchers who have never been in the VLDB PC and deserve to belong to the next one I will not provide solutions to these problems, but give hints –Semantics –Structure 18/01/2011Paolo Atzeni5

18/01/2011Paolo Atzeni6 Questions on the Web: classification criteria (MOSES European project, FP5) Format of the response Involved sites: number (one or more) and relationship (of the same organization or not)Involved sites Structure of the source information Extraction process

18/01/2011Paolo Atzeni7 Format of the response A simple piece of data A web page (or fragment of it) A tuple of data A "concept" A set or list of one of the above Natural language (i.e., an explanation)

18/01/2011Paolo Atzeni8 Involved sites One or more Of the same organization or not Sometimes also sites and local databases

18/01/2011Paolo Atzeni9 Structure of the source information Information directly provided by the site(s) as a whole Information available in the sites and correlated but not aggregated Information available in the site in a unrelated way

18/01/2011Paolo Atzeni10 Extraction process Explicit service: this the case for example if the site offers an address book or a search facility for finding professors One-way navigation: here the user has to navigate, but without the need for going back and forth (no “trial and error”) Navigation with backtracking

Content Questions on the Web Structures in the Web, transformations: Araneus Information extraction, wrappers: Minerva, RoadRunner Crawling, extraction and integration: Flint 18/01/2011Paolo Atzeni11

Paolo Atzeni12 Web and Databases (2) The answer to some questions requires structure (we need to "compile" tables) –databases are structured and organized –how much structure and organization is there in the Web? 18/01/2011

Workshop on Semistructured Data at Sigmod /01/2011Paolo Atzeni13

18/01/2011Paolo Atzeni14 Workshop on Semistructured Data at Sigmod

18/01/2011Paolo Atzeni15 Workshop on Semistructured Data at Sigmod

Transformations bottom-up: accessing information from Web sources top-down: designing and maintaining Web sites global: integrating existing sites and offering the information through new ones 18/01/201116Paolo Atzeni

17 The Araneus Project At Università Roma Tre & Università della Basilicata Goals: extend database-like techniques to Web data management Several prototype systems 18/01/2011

Paolo Atzeni18 Example Applications The Integrated Web Museum a site integrating data coming from the Uffizi, Louvre and Capodimonte Web sites The ACM Sigmod Record XML Version XML+XSL Version of an existing (and currently running) Web site 18/01/2011

Paolo Atzeni19 The Integrated Web Museum A new site with data coming from the Web sites of various museums –Uffizi, Louvre and Capodimonte 18/01/2011

Paolo Atzeni20 The Integrated Web Museum DB 18/01/2011

Paolo Atzeni21 Integration of Web Sites: The Integrated Web Museum Data are re-organized: –Uffizi, paintings organized by rooms –Louvre, Capodimonte, works organized by collections –Integrated Museum, organized by author 18/01/2011

Paolo Atzeni22 Web Site Re-Engineering: ACM Sigmod Record in XML The ACM Sigmod Record Site (acm.org/sigmod/record) –an existing site in HTML The ACM Sigmod Record Site, XML Edition –a new site in XML with XSL Stylesheets 18/01/2011

Paolo Atzeni2318/01/2011

Paolo Atzeni2418/01/2011

Paolo Atzeni25 Identification of sites of interest Mediator Layer Navigation of site ad extraction of data Integration Layer Integration of data Site Generation Layer Generation of new sites Wrapping of sites to extract information Wrapper Layer Wrapper Layer Wrapper Layer Wrapper Layer Araneus Wrapper Toolkit (Minerva) Araneus Query System (Ulixes) SQL and Java Araneus Web Site Development System (Homer) The Araneus Approach 18/01/2011

Paolo Atzeni26 Building Applications in Araneus Phase A: Reverse Engineering Existing Sites Phase B: Data Integration Phase C: Developing New Integrated Sites 18/01/2011

Paolo Atzeni27 Building Applications in Araneus Phase A: Reverse Engineering First Step: Deriving the logical structure of data in the site  ADM Scheme Second Step: Wrapping pages in order to map physical HTML sources to database objects Third Step: Extracting Data from the Site by Queries and Navigation 18/01/2011

Content Questions on the Web Structures in the Web, transformations: Araneus Information extraction, wrappers: Minerva, RoadRunner Crawling, extraction and integration: Flint 18/01/2011Paolo Atzeni28

Paolo Atzeni29 Information Extraction Task Information extraction task –source format: plain text with HTML markup (no semantics) –target format: database table or XML file (adding structure, i.e., “semantics”) –extraction step: parse the HTML and return data items in the target format “Wrapper” –piece of software designed to perform the extraction step 18/01/2011

Paolo Atzeni30 Notion of Wrapper …S. Abiteboul, P.Buneman, D.Suciu Data on the Web...R. Elmasri, S. Navathe Database Systems...A. TannenbaumComputer Networks...J. GrahamThe HTML Sourcebook EditorAuthorBookTitle database table(s) (or XML docs) HTML page Wrapper Intuition: use extraction rules based on HTML markup 18/01/2011

Intuition Behind Extraction 18/01/2011Paolo Atzeni31 teamNametown AtalantaBergamo InterMilano JuventusTorino MilanMilano …… Italian Football Teams Atalanta - Bergamo Inter - Milano Juventus - Torino Milan - Milano Wrapper Procedure: Scan document for While there are more occurrences scan until extract teamName between and scan until extract town between and output [teamName, town]

Paolo Atzeni32 The Araneus Wrapper Toolkit Flexible in presence of irregularities and exceptionsFlexible in presence of irregularities and exceptions Allows document restructuringsAllows document restructurings Concise and declarativeConcise and declarative Simple in the case of strong regularitySimple in the case of strong regularity Tedious even with document strongly structuredTedious even with document strongly structured Difficult to maintainDifficult to maintain Not flexible in presence of irregularitiesNot flexible in presence of irregularities Allow for limited restructuringsAllow for limited restructurings Procedural LanguagesGrammars Advantages Drawbacks 18/01/2011

Paolo Atzeni33 The Araneus Wrapper Toolkit: Minerva + Editor Minerva is a Wrapper generator It couples a declarative, grammar-based approach with the flexibility of procedural programming: –grammar-based –procedural exception handling 18/01/2011

Paolo Atzeni34 Development of wrappers Effective tools: drag and drop, example based, flexible Learning and automation (or semiautomation) Maintenance support: –Web sites change quite frequently –Minor changes may disrupt the extraction rules –maintaining a wrapper is very costly –this has motivated the study of (semi)- automatic wrapper generation techniques 18/01/2011

Observation Can XML or the Semantic Web help much? Probably not –the Web is a giant “legacy system” and many sites will never move to the new technology (we said this ten years ago and it still seems correct …) –many organizations will not be willing to expose their data in XML 18/01/2011Paolo Atzeni35

A related topic: Grammar Inference The problem –given a collection of sample strings, find the language they belong to The computational model: identification in the limit [Gold, 1967] –a learner is presented successively with a larger and larger corpus of examples –it simultaneously makes a sequence of guesses about the language to learn –if the sequence eventually converges, the inference is correct 18/01/2011Paolo Atzeni36

The RoadRunner Approach RoadRunner –[Crescenzi, Mecca, Merialdo 2001] a research project on information extraction from Web sites The goal –developing fully automatic, unsupervised wrapper induction techniques Contributions –wrapper induction based on grammar inference techniques (suitably adapted) 18/01/2011Paolo Atzeni37

The Schema Finding Problem 18/01/2011Paolo Atzeni38 books.com John Smith Database Primer First Edition, Paperback, … books.com Paul Jones XML at Work First Edition, Paperback, … Input HTML Sources books.com #PCDATA ( ( #PCDATA ( #PCDATA )? ( #PCDATA #PCDATA )+ )+… Wrapper SET ( TUPLE (A : #PCDATA; B : SET ( TUPLE ( C : #PCDATA; D : #PCDATA; E : SET ( TUPLE ( F: #PCDATA; G: #PCDATA; H: #PCDATA))) Target Schema Wrapper Output ……………… 50$2000nullA must in …JavaScripts 45$1999 Second Edition, … 30$1993null An useful HTML …HTML and Scripts 30$1999First Edition, …A comprehensive …XML at Work Paul Jones 40$1995First Edition, …An undergraduate …Computer Systems 30$2000Second Edition, … 20$1998First Edition, … This book …Database Primer John Smith B A E DC HGF

39 The Schema Finding Problem Page class –collection of pages generated by the same script from a common dataset –these pages typically share a common structure Schema finding problem –given a set of sample HTML pages belonging to the same class, automatically recover the source dataset –i.e., generate a wrapper capable of extracting the source dataset from the HTML code 18/01/2011Paolo Atzeni

The RoadRunner Approach The target –large Web sites with HTML pages generated by programs (scripts) based on the content of an underlying data store –the data store is usually a relational database The process –data are extracted from the relational tables and possibly joined and nested –the resulting dataset is exported in HTML format by attaching tags to values 18/01/2011Paolo Atzeni40

41 Contributions of RoadRunner An unsupervised learning algorithm for identifying prefix-markup languages in the limit –given a rich sample of HTML pages, the algorithm correctly identifies the language –the algorithm runs in polynomial time –from the language, when can infer the underlying schema –and then extract the source dataset 18/01/2011Paolo Atzeni

Content Questions on the Web Structures in the Web, transformations: Araneus Information extraction, wrappers: Minerva, RoadRunner Crawling, extraction and integration: Flint 18/01/2011Paolo Atzeni42

Data Extraction and Integration from Imprecise Web Sources Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre

Flint A novel approach for the automatic extraction and integration of web data Flint aims at exploiting an unexplored publishing pattern that frequently occurs in data intensive web sites: –large amounts of data are usually offered by pages that encode one flat tuple for each page –for many disparate real world entities (e.g. stock quotes, people, travel packages, movies, books, etc) there are hundreds of web sites that deliver their data following this publishing strategy –These collections of pages can be though as HTML encodings of a relation (e.g. "stock quote" relation) 18/01/201144Paolo Atzeni

Example … 18/01/201145Paolo Atzeni

Flint goals … StockQuote LastMinMax Volume52highOpen 18/01/201146Paolo Atzeni

Flint goals … StockQuote LastMinMax Volume52highOpen Q Q q1q1 q2q2 qnqn 18/01/201147Paolo Atzeni

Extraction and Integration Flint builds on the observation that although structured information is spread across a myriad of sources, the web scale implies a relevant redundancy, which is apparent –at the intensional level: many sources share a core set of attributes –at the extensional level: many instances occur in a number of sources The extraction and integration algorithm takes advantage of the coupling of the wrapper inference and the data integration tasks 18/01/201148Paolo Atzeni

Flint main components a crawling algorithm that gathers collections of pages containing data of interest a wrapper inference and data integration algorithm that automatically extracts and integrates data from pages collected by the crawler a framework to evaluate (in probabilistic terms) the quality of the extracted data and the reliability/trustiness of the sources 18/01/201149Paolo Atzeni

Crawling Given as input a small set of sample pages from distinct web sites The system automatically discovers pages containing data about other instances of the conceptual entity exemplified by the input samples 18/01/2011Paolo Atzeni50

The crawler, approach Given a bunch of sample pages crawl the web sites of the sample pages to gather other pages offering the same type of information extract a set of keywords that describe the underlying entity do –launch web searches to find other sources with pages that contain instances of the target entity –analyze the results to filter out irrelevant pages –crawl the new sources to gather new pages while new pages are found 18/01/2011Paolo Atzeni51

Data and Sources Reliability Redundancy among autonomous and heterogeneous sources implies inconsistencies and conflicts in the integrated data because sources can provide different values for the same property of a given object A concrete example: on April 21th 2009, the open trade for the Sun Microsystem Inc. stock quote published by the CNN Money, Google Finance, and Yahoo! Finance web sites, was 9.17, 9.15 and 9.15, respectively. 18/01/201152Paolo Atzeni

Flint System architecture Web Search Data Extraction Data Integration The Web 18/01/201153Paolo Atzeni

Flint System architecture Web search Extraction Integration Probability The Web 18/01/201154Paolo Atzeni

Summary The exploitation of structure is essential for extracting information from the Web In the Web world, task automation is needed to get scalability Interaction of different activities (extraction, search, transformation, integration) is often important 18/01/2011Paolo Atzeni55