Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003.

Slides:

Advertisements

Similar presentations

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.

Advertisements

XML: Extensible Markup Language

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

MP IP Strategy Stateye-GUI Provided by Edotronik Munich, May 05, 2006.

IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.

Data Mining Association Analysis: Basic Concepts and Algorithms

1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.

RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.

Aki Hecht Seminar in Databases (236826) January 2009

ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.

Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.

Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.

A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.

Temporal Data Mining Claudio Bettini, X.Sean Wang and Sushil Jajodia Presented by Zhuang Liu.

25/06/2015Marius Mikucionis, AAU SSE1/22 Principles and Methods of Testing Finite State Machines – A Survey David Lee, Senior Member, IEEE and Mihalis.

Annotation Free Information Extraction

Database Systems More SQL Database Design -- More SQL1.

1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.

R OAD R UNNER : Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo VLDB 2001.

Semi-Automatic Generation of Mini-Ontologies from Canonicalized Relational Tables Chris Hathaway.

1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.

Manohar – Why XML is Required Problem: We want to save the data and retrieve it further or to transfer over the network. This.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Main challenges in XML/Relational mapping Juha Sallinen Hannes Tolvanen.

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

XML-to-Relational Schema Mapping Algorithm ODTDMap Speaker: Artem Chebotko* Wayne State University Joint work with Mustafa Atay,

Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.

CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.

GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)

The main mathematical concepts that are used in this research are presented in this section. Definition 1: XML tree is composed of many subtrees of different.

DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.

Database Management System Lecture 4 The Relational Database Model- Introduction, Relational Database Concepts.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Querying Structured Text in an XML Database By Xuemei Luo.

CMPS 211 JavaScript Topic 2 Functions and Arrays.

1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.

RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ

1 Introduction  Extensible Markup Language (XML) –Uses tags to describe the structure of a document –Simplifies the process of sharing information –Extensible.

2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.

Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.

Propositional Calculus CS 270: Mathematical Foundations of Computer Science Jeremy Johnson.

LOGO 1 Corroborate and Learn Facts from the Web Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Shubin Zhao, Jonathan Betz (KDD '07 )

XML and Database.

LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hongkun Zhao, Weiyi.

CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.

ODE: Ontology-Assisted Data Extraction Weifeng Su, Jiying Wang, Frederick H. Lochovsky Summarized by Joseph Park.

Computing & Information Sciences Kansas State University Friday, 20 Oct 2006CIS 560: Database System Concepts Lecture 24 of 42 Friday, 20 October 2006.

Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.

8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.

XML Validation II Advanced DTDs + Schemas Robin Burke ECT 360.

The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Understanding Web Query Interfaces: Best-Efforts Parsing with Hidden Syntax.

BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.

Implementing Automatic Value Extraction from Structured Web Pages Varun Ganapathi, Jonathan Pines, Josh Wiseman.

Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.

Language Translation Part 2: Finite State Machines.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:

Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.

Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.

Section Recursion  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.

More SQL: Complex Queries, Triggers, Views, and Schema Modification

Unit 4 Representing Web Data: XML

Database Management System

Web Data Extraction Based on Partial Tree Alignment

Chapter 7 Representing Web Data: XML

Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University

Presentation transcript:

Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003

Outline Introduction Model, Problem Formulation Equivalence Classes –Observations and Properties Build Template and Extract Values Experiments Conclusion

Introduction Keyword: Schema (Data having a structure) Problem Definition: automatically extracting schema encoded in a given collection of pages, without any human input Cue: characteristic of pages belonging to the same site and encoding data of the same schema, is that data encoding in a consistent manner = ＞ a common template by plugging-in value

Figuration

Goal and Challenge Previous IE Techniques rely on heuristic by human. ex. wrapper Goal: to deduce the template without human –Time consuming and error-prone –Optional attributes are ignored Challenge: – No obvious way of differentiating what text is template or data – The schema of data in pages isn’t flat but more complex and semi-structured of attributes

Model, Problem Formulation Structured Data Model of Page Creation Optionals and Disjunctions Problem Statement Miscellaneous Terminology, Definition

Structured Data Token: A token is some basic unit of text Structured Data: any set of data values conforming to a common schema or type Define “Type”: –1. Basic Type (β): string of tokens eg. ＜ html ＞, text –2. Ordered List Type: tuple constructor of order “n” eg. ＜ T 1, T 2, …, T n ＞, T 1, T 2, …, T n : type –3. Define Type: set constructor –eg. {T}, T: type

Define term value and example Define “instance”: –1. An instance of basic type, β, is any string of tokens –2. An instance of type ＜ T 1, T 2, …, T n ＞ is a tuple of the form ＜ i 1, i 2, …, i n ＞, where attributes i 1, i 2, …, i n are instances of typesT 1, T 2, …, T n –3. An instance of type {T}, is any set of elements {e 1, e 2, …, e m }, such e i is an instance of type T Instance → Value; String → a string of tokens Example: –Schema S 1 = –Value =

Schemas and Values as Trees

Model of Page Creation Definition: A template T for a schema S (as shown T S ), is defined as a function that maps each type constructor τ of S into an ordered set of strings T(τ ), such that, –Ifτis the tuple constructor of order n, T(τ) is an order set of n+1 string –Ifτis the set constructor, T(τ) is a string S τ

Example A template T for schema S1 is given by the mapping: –T(  1)= –T(  2)=H –T(  3)=

Encoding of a value x  S 1. if x  β, then λ (T,x)→x 2. if x  < x 1, x 2, …, x n ＞ τ t λ (T,x) → C 1 λ (T, x 1 ) C 2 … λ (T, x n ) C n+1 3. if x  { e 1, e 2, …, e m } τ s, τ s  S λ (T,x) → λ (T, e 1 ) S λ (T, e 2 ) ….S λ (T, e m )

Example of Schema S 1

Optionals and Disjunctions Optional: –If T is a type, optional type (T)?≡{T} τ |τ| = 0 or 1 Disjunction: –If T 1 and T 2 is type, disjunction type (T 1 | T 2 ) ≡ ＜ {T 1 } τ1, {T 2 } τ2 ＞ τ |τ 1 |+|τ 2 | = 1

Problem Statement Extract Problem: n pages, each page p i = λ(T, x i ) (1 ≤ i ≤ n), is created from some unknown deduction template T and values {x 1,...,x n } from the set of pages alone

Example of correct solution of EXTRACT

Example of correct solution of EXTRACT (cont.) T(  e1)= Reviewer Name, Rating, Text, T(  e2)=  T(  e3)= Book Name, Reviewers,

Miscellaneous Terminology, Definition A token is a word or a HTML tag An occurrence of a token in page (resp. value, template) is called a page-token (resp. value- token, template-token) Each page token is created from either a template-token or a value-token 2 page-token in P e have the same role iff they have been generated by the same template- token

Overview Approach - EXALG (ECGM) Stage 1 Stage 2

Equivalence Classes Pages P = { p 1, …, p n }, p i = λ(T S, xi) T S = {τ 1, …, τ k }: type constructor Definition (Occurrence Vector): –The occurrence-vector of a token t, is defined as the vector, where f i is the number of occurrences of t in p i Definition ( Equivalence Classes): All tokens of equivalence class have the same occurrence vector. –Ex. ε 1 : {,, Book, Reviews,,,, } –Ex. ε 2 : { Data, Mining, Jeff, 2, Jane, 6 } –Ex. ε 3 : {, Reviewer, Rating, Text, }

Equivalence Classes: Observations Observation1 : –Tokens associated with the same type constructor τ j in T that have unique-roles occur in the same equivalence class. ( used to decide EQ valid or not) Observation2: –For real pages, an equivalence class of large size and support is usually valid Definition –Support of token: #(page contain) –Size of EQ class: #(token of EQ)

Properties of EQ class Definition (Ordered Equivalence Classes): –An EQ class is ordered, if its tokens can be ordered, such that, for every page p i and every pair of t j, t k (1  j  k  m) If t j occurs at least l times in p i, the lth occurrence of t j in p i occurs before the lth occurrence of t k in p i and If t j occurs at least (l+1) times in p i, the (l+1)th occurrence of t j in p i is after the lth occurrence of t k in p i. Definition (Nesting of EQ classes): –A pair of EQ classes ε i and ε j is nested if, The span of any occurrence of ε i does not overlap with the span of any occurrence of ε j, or The span of all occurrences of ε i is within Pos(p) of some occurrence of ε j for some fixed p; or vice-versa.

EQ Classes: Observations (Cont.) Observation3 : –A valid equivalence class is ordered and a pair of two valid equivalence classes is nested. Handling Invalid Equivalence Classes –Detect the existence of invalid LFEQs using violation of ordered and nesting –Yes, discard some of LFEQs and break other into smaller LFEQs

Differentiating roles of tokens By Path –different roles of tokens are in different path of HTML parse tree By Position –different roles of tokens locates at different Position (non-empty) Observation4: –In practice, two page-tokens with different occurrence paths have different roles. Observation5: –For a valid EQ class . The role of an occurrence of t, which is within Pos(l) of some occurrence of  is different from the role of an occurrence of t which is within Pos(m) (m  l) of some occurrence of .

DIFFFORM (step1) and DIFFEQ (step4) These module are used to add more tokens to LFEQ by “differentiating” roles –Ex. Name has multiple “role”, one occurs in Book Name and the other occurs in Reviewer Name Differentiate the multiple roles : –The multiple tokens occur in different path from root in the HTML parse tree ( DIFFFORM ) –The multiple tokens occur in different “Position” with respect to LFEQ ε e1 ( DIFFEQ ) dtoken (differentiated tokens): –ex. Name 5 and Name 14 are regarded as different tokens Name A and Name B

Stage 1: ECGM Find dtoken from path in html parse tree Find LFEQ Detect and remove invalid LFEQ ( using violation of order and nesting ) Find dtoken from position in valid LFEQ

Running Example ECGM: –OUTPUT: set of LFEQs of dtokens and page represented as string of dtokens –Two parameters used to consider LFEQs SIZETHRES=3, SUPTHRES=3

Iteration 1: DiffFORM, FindEQ ={,, Book, Name, Reviews,,,, } ={, } : ={, Reviewer, Name, Rating, Text, } ={Database} ={Data, Mining, Jeff, Jane} ={Query, Opt.} ={Transactions} ={John} Use path Not LFEQ

Iteration 1: DiffEQ ={,, Book, Name, Reviews,,,, } : at pos 2 or pos 4 : at pos 4 or pos 5 ε e1 : = { Book Name, Reviews, } 8 → 13 ={, Reviewer, Name, Rating, Text, } : at pos 1 or pos 3 or pos 4 : at pos 3 or pos 4 or pos 5 ε e3 : ={ Reviewer Name, Rating, Text, } 6 → 12 Use position

Stage 2: Construct Schema from ECGM Input to this module is {ε 1,ε 2, …,ε m } The ANALYSIS consist of 2 modules – CONSTTEMP and EXVAL CONSTTEMP,ε i = { d 1, d 2, …, d l } –Start the basic ε 1 = {,, …,, } –recursively constructs a template T εi, corresponding toε i, and template T εi, p, corresponding to each non- empty position p ofε i –Checks if the set of strings, PosString(ε i,p), corresponding has some recognizable pattern

Construct Schema S’ fromε e1 ε e1 : {,,, Book, Name,,, Reviews,,,,, } → T(τ e1 ) =

Cont. PosString(ε e1 +,6) is a string of dtokens for every occurrence of ε e1 +, which matches Pattern 5 of table; →T(T e1,1 )= β PosString(ε e1 +,10) is always a string of 0 or more occurrences of ε e3 +, which matches Pattern 1 → T(T e1,2 ) ={τ e3 }  → T(τ e3 ) = Reviewer Name Rating Text

(Cont.) The three non-empty positions are all Basic Type β →T(T e3,1 )= β →T(T e3,2 )= β →T(T e3,3 )= β  S = τe3 }  > τe1

Example of correct solution of EXTRACT

Evaluation Data sets: Leaf attribute A m in schema S m Correct: the set of A m in the page is equal to the set of extracted value A e in the page Partially Correct: the set of A m in the page is not equal to the set of extracted value A e in the page, but as part of value of A e Incorrect: not correct and Partially correct

Assumption The 4 assumptions: (A1) A large number of tokens occurring in template have unique roles (A2) The EQ class derived from a type constructor is recognized as an LFEQ (A3) Irregularity in encoded data that leads to invalid EQ class (A4) The separators are around data values. In this model, strings associated with type construction are non-empty position

Result 18 or 40% of input collections our System correctly extracted all the attribute Around 80% of the attributes were extracted correctly Normalized average Input size <=10 Parameter = 3

Conclusion EXALG: use 2 novel concepts –equivalence classes and –differentiate roles, to discovery the template Impact of the failed assumption is limit to a few attributes Future work: –Develop techniques for crawling, indexing, and providing querying support for the structured pages in the web –Develop techniques for automatically annotating the extracted data, possibly using the words that appear in the template