Aki Hecht Seminar in Databases (236826) January 2009

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Tutorial 6 Creating a Web Form
Chapter 4: Trees Part II - AVL Tree
IEPAD: Information Extraction based on Pattern Discovery Chia-Hui Chang National Central University, Taiwan
RoadRunner: Towards Automatic Data Extraction from Large Web Sites Valter Crescenzi Giansalvatore Mecca Paolo Merialdo Presented by Lei Lei.
Xyleme A Dynamic Warehouse for XML Data of the Web.
ODE: Ontology-assisted Data Extraction WEIFENG SU et al. Presented by: Meher Talat Shaikh.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Structured Data Extraction Based on the slides from Bing Liu at UCI.
XHTML1 Tables and Lists. XHTML2 Objectives In this chapter, you will: Create basic tables Structure tables Format tables Create lists.
A Fully Automated Object Extraction System for the World Wide Web a paper by David Buttler, Ling Liu and Calton Pu, Georgia Tech.
 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Extracting Structured Data from Web Page Arvind Arasu, Hector Garcia-Molina ACM SIGMOD 2003.
1 B-Trees Disk Storage What is a multiway tree? What is a B-tree? Why B-trees? Comparing B-trees and AVL-trees Searching a B-tree Insertion in a B-tree.
B-Trees and B+-Trees Disk Storage What is a multiway tree?
Introduction To Form Builder
Creating Web Page Forms
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Towards Automatic Structured Web Data Extraction System Tomas Grigalis, 2nd year PhD student Scientific supervisor: prof. habil. dr. Antanas Čenys.
Review Binary Tree Binary Tree Representation Array Representation Link List Representation Operations on Binary Trees Traversing Binary Trees Pre-Order.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
 2008 Pearson Education, Inc. All rights reserved Document Object Model (DOM): Objects and Collections.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Concepts and Terminology Introduction to Database.
Automatically Extracting Data Records from Web Pages Presenter: Dheerendranath Mundluru
Session 1 SESSION 1 Working with Dreamweaver 8.0.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Searching: Binary Trees and Hash Tables CHAPTER 12 6/4/15 Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education,
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
For: CS590 Intelligent Systems Related Subject Areas: Artificial Intelligence, Graphs, Epistemology, Knowledge Management and Information Filtering Application.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
LOGO 1 Mining Templates from Search Result Records of Search Engines Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hongkun Zhao, Weiyi.
 2008 Pearson Education, Inc. All rights reserved Document Object Model (DOM): Objects and Collections.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
1 Enabling web browsers to augment web sites’ filtering and sorting functionalities David Huynh · Rob Miller · David Karger MIT Computer Science & Artificial.
Information Extraction and Integration Bing Liu Department of Computer Science University of Illinois at Chicago (UIC)
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Apriori Algorithm and the World Wide Web Roger G. Doss CIS 734.
Chapter 9: Structured Data Extraction Supervised and unsupervised wrapper generation.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Data mining in web applications
Advanced Sorting 7 2  9 4   2   4   7
COMP261 Lecture 23 B Trees.
Building a User Interface with Forms
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Web Data Extraction Based on Partial Tree Alignment
Restrict Range of Data Collection for Topic Trend Detection
Supervised and unsupervised wrapper generation
Chapter 9: Structured Data Extraction
HTML5 and Local Storage.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
B-Trees.
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Information Retrieval and Web Design
Presentation transcript:

Aki Hecht Seminar in Databases (236826) January 2009 Web Data Extraction Aki Hecht Seminar in Databases (236826) January 2009

Agenda Introduction Building Tag Trees Mining Data Regions Partial Tree Alignment Extraction Given Multiple Pages

Introduction Enormous amount of data is stored in open databases. Most databases retrieve web pages with structured data objects. Usually “Deep Web” pages Non trivial task to crawl those pages The data is important and useful for many applications: Price comparison engines Collecting individuals information

The goal Given a HTML page containing multiple data records – insert the data into a table. No assumptions allowed on the amount of data records in the page nor on their structure/content. The extraction should be done automatically Human intervention can help in getting more accurate results, but the cost is too high.

Example 1

Example 2 More than one data region!

General idea Given a Web page: Build the HTML tag tree Mine data regions Mining data records directly is hard Identify data records from each data region Learn the structure of a general data record A data record can contain optional fields Extract the data

Agenda Introduction Building Tag Trees Mining Data Regions Partial Tree Alignment Extraction Given Multiple Pages

Building a tag tree Most HTML tags work in pairs. Within each corresponding tag-pair, there can be other pairs of tags, resulting in a nested structure. Some tags do not require closing tags (e.g., <li>, <hr> and <p>) although they have closing tags. Additional closing tags need to be inserted to ensure all tags are balanced. Building a tag tree from a page using its HTML code is thus natural.

An example

The tag tree

Building trees using visual cues The HTML code can contain errors. Browsers are sophisticated enough to display pages with HTML errors. We can build the tag tree using the browser’s mechanism. Each HTML element is rendered as a rectangle. Containments of rectangles representing nesting.

An example

Agenda Introduction Building Tag Trees Mining Data Regions Partial Tree Alignment Extraction Given Multiple Pages

Tree Edit Distance Tree edit distance between two trees A and B is the cost associated with the minimum set of operations needed to transform A into B. The set of operations used to define tree edit distance includes three operations: node removal node insertion node replacement A cost is assigned to each of the operations.

Finding Tree Edit Distance Tree edit distance is very similar to string edit distance. Can be found in the same way Done by finding the minimal cost mapping between the two trees.

Finding Tree Edit Distance cont. The algorithm for finding the minimal cost mapping is identical for both trees and strings. Based on dynamic programming

Mining Data Regions Definition: A generalized node of length r consists of r (r  1) nodes in the tag tree with the following two properties: the nodes all have the same parent. the nodes are adjacent. Definition: A data region is a collection of two or more generalized nodes with the following properties: the generalized nodes all have the same parent. the generalized nodes all have the same length. the generalized nodes are all adjacent. the similarity between adjacent generalized nodes is greater than a fixed threshold.

An Example 1 The regions were found using tree edit distance. For example, nodes 5 and 6 are similar (low cost mapping), they also suit the above definition and therefore they define a data region 2 3 4 5 6 7 8 9 10 11 12 Region 1 Region 2 13 14 15 16 17 18 19 Region 3

Agenda Introduction Building Tag Trees Mining Data Regions Partial Tree Alignment Extraction Given Multiple Pages

Partial Tree Alignment For each data region we have found we need to understand the structure of the data records in the region. Not all data records contain the same fields (optional fields are possible) We will use (partial) tree alignment to gather the structure.

The algorithm Choose a seed tree: Tree matching: A seed tree, denoted by Ts, is picked with the maximum number of data items. Tree matching: For each unmatched tree Ti (i ≠ s), match Ts and Ti. Each pair of matched nodes are linked (aligned). For each unmatched node nj in Ti do expand Ts by inserting nj into Ts if a position for insertion can be uniquely determined in Ts. The expanded seed tree Ts is then used in subsequent matching.

Partial Tree Alignment of two trees Ts Ti p p a b c d e e b Insertion is possible New part of Ts p e a b c d Ts p Ti p Insertion is not possible e a b a x e

Full algorithm

A complete example T2 is matched again … g Ts = T1 p T2 p T3 p … x d b k b c d h k Ts p No node inserted … x b d p New Ts … x b c d h k T2 is matched again p T2 p … x b n c d h k g b n c k g

Output data table … x b n c d h k g T1 1 T2 T3 Different data records contain different fields!

Agenda Introduction Building Tag Trees Mining Data Regions Partial Tree Alignment Extraction Given Multiple Pages

Extraction given multiple pages The described technique is good for a single list page. It can clearly be used for multiple list pages. Templates from all input pages may be found separately and merged to produce a single refined pattern. Extraction results will get more accurate. In many applications, one needs to extract the data from the detail pages as they contain more information on the object.

Detail pages – an example More data in the detail pages A list page

Extraction from detail pages For extraction, we can treat each detail page as a data record, then extract using partial tree alignment. For instance, to apply the algorithm, we simply create a rooted tree as follows: create an artificial root node, and make the tag tree of each page as a child sub-tree of the artificial root node.

An example r … We already know how to extract data from a data region

Difficulty with detail pages Although a detail page focuses on a single object, the page may contain a large amount of “noise”, at the top, on the left and right and at the bottom. Mostly in commercial websites Since we treat each page as a data record, the algorithm will also extract the “noise”.

An example (a lot of noise)

The solution To start, a sample page is taken as the wrapper. The wrapper is then refined by solving mismatches between the wrapper and each sample page, which generalizes the wrapper. A mismatch occurs when some token in the sample does not match the grammar of the wrapper.

Wrapper generalization Different types of mismatches: Text string mismatches: indicate data fields (or items). Tag mismatches: indicate list of repeated patterns or optional elements. Find the last token of the mismatch position and identify some candidate repeated patterns from the wrapper and sample by searching forward.

An example

Summary Automatic extraction of data from a web page requires understanding of the data records’ structure. First step is finding the data records in the page. Second step is merging the different structures and build a generic template for a data record. Partial tree alignment is one method for building the template.

Summary cont. Automatic extraction Advantages: Disadvantages: It is scalable to a huge number of sites due to the automatic process. Disadvantages: It may extract a large amount of unwanted data because the system does not know what is interesting to the user. Domain heuristics or manual filtering may be needed to remove unwanted data. Extracted data from multiple sites need integration, i.e., their schemas need to be matched.

Thank you! Question?

Bibliography Y. Zhai, B. Liu “Web data extraction based on partial tree alignment”. International World Wide Web Conference (2005) Y. zhai, B. Liu "Structured data extraction from the web based on partial tree alignment," IEEE Transactions on Knowledge and Data Engineering (2006) DC Reis, PB Golgher, AS Silva, AF Laender “Automatic web news extraction using tree edit distance” Proceedings of the 13th international conference on World Wide Web Conference (2004)