Download presentation
Presentation is loading. Please wait.
1
Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer Science University of Science & Technology Hong Kong presentation from Max Arends
2
Data-rich Section Extraction from HTML pages – DSE Algorithm The problem: ● Given a web-page find the Data-rich Section of the page without any input What is it making difficult? ● Decoration and advertisement ● “human-oriented” HTML pages are difficult for computer programs to parse
3
Data-rich Section Extraction from HTML pages – DSE Algorithm Topic distillation: tries to distill a small number of high-quality pages that are most representative of the topic. Basic Idea ist that the number of links pointing to a page offers an assessment of its popularity and quality. Web Information Extraction: tries to extract data items from web pages, usually semi- structured, and return it in a structured data DSE – Algorithm improves both!
4
Data-rich Section Extraction from HTML pages – DSE Algorithm Overview: HITS Algorithm: ● One of the most well-known topic distillation algorithms. ● Given a set of web pages about one specific topic, the HITS algorithm calculates the authority score (indication for relevant links) ● Basically looking how many links are pointing to that page (Google)
5
Data-rich Section Extraction from HTML pages – DSE Algorithm ● The DSE Algorithm (Data-rich Section Extraction) ● Basic Idea: – Pages are simular or the same (same CMS, style) ● Basic method: – Find use structural information and identify the basic layout. – Find “neighboring” pages on the same site and compare them.
6
Data-rich Section Extraction from HTML pages – DSE Algorithm What is the Data-rich Section on a HTML page? ● Both sites share similar layout ● The key content is in the lower right section
7
Data-rich Section Extraction from HTML pages – DSE Algorithm 3 Phases: ● 1. Discover a set of pages as sample pages, that are simular to the target page ● 2. These HTML pages are parsed and converted into tag-trees ● 3. Compare the target page tree with the sample page tree to identify their common parts. The difference is the Data rich section
8
Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 1: Discovering sample URLs US(i,j) [URL similarity] estimates the similarity of two pages
9
Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 2: Tree creation ● The target page and the sample page are being parsed. ● The HTML page's layout is brought into a tree like structure (DOM) ● Unimportant tags are being ignored: FONT, SMALL, H1,H6 ● Unimportet arributes (like BACKGROUND) are being ignored, to avoid unnecessary computations and comparisons
10
Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 3: Tree Matching ● Given two DOM trees (one representing the target page and one the sample page), the similar structures have to be matched ● The two trees are being traversed using a depth- first order and compare them node-by-node ● The parts of the tree that don't match are the Data- rich Sections
11
Data-rich Section Extraction from HTML pages – DSE Algorithm
12
Applying DSE to HITS ● 28 queries are used ● for each quer we sent it to the Google search engine and require that the first 200 be returned ● Result pages are add to the root set ● Send each of the 200 results to Google again to retrieve at most 100 inlinks pointing to the result page and add them also to the root set. ● The root set ranges from 975 to 6,776 nodes
13
Data-rich Section Extraction from HTML pages – DSE Algorithm
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.