Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer.

Similar presentations


Presentation on theme: "Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer."— Presentation transcript:

1 Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer Science University of Science & Technology Hong Kong presentation from Max Arends

2 Data-rich Section Extraction from HTML pages – DSE Algorithm The problem: ● Given a web-page find the Data-rich Section of the page without any input What is it making difficult? ● Decoration and advertisement ● “human-oriented” HTML pages are difficult for computer programs to parse

3 Data-rich Section Extraction from HTML pages – DSE Algorithm Topic distillation: tries to distill a small number of high-quality pages that are most representative of the topic. Basic Idea ist that the number of links pointing to a page offers an assessment of its popularity and quality. Web Information Extraction: tries to extract data items from web pages, usually semi- structured, and return it in a structured data DSE – Algorithm improves both!

4 Data-rich Section Extraction from HTML pages – DSE Algorithm Overview: HITS Algorithm: ● One of the most well-known topic distillation algorithms. ● Given a set of web pages about one specific topic, the HITS algorithm calculates the authority score (indication for relevant links) ● Basically looking how many links are pointing to that page (Google)

5 Data-rich Section Extraction from HTML pages – DSE Algorithm ● The DSE Algorithm (Data-rich Section Extraction) ● Basic Idea: – Pages are simular or the same (same CMS, style) ● Basic method: – Find use structural information and identify the basic layout. – Find “neighboring” pages on the same site and compare them.

6 Data-rich Section Extraction from HTML pages – DSE Algorithm What is the Data-rich Section on a HTML page? ● Both sites share similar layout ● The key content is in the lower right section

7 Data-rich Section Extraction from HTML pages – DSE Algorithm 3 Phases: ● 1. Discover a set of pages as sample pages, that are simular to the target page ● 2. These HTML pages are parsed and converted into tag-trees ● 3. Compare the target page tree with the sample page tree to identify their common parts. The difference is the Data rich section

8 Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 1: Discovering sample URLs US(i,j) [URL similarity] estimates the similarity of two pages

9 Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 2: Tree creation ● The target page and the sample page are being parsed. ● The HTML page's layout is brought into a tree like structure (DOM) ● Unimportant tags are being ignored: FONT, SMALL, H1,H6 ● Unimportet arributes (like BACKGROUND) are being ignored, to avoid unnecessary computations and comparisons

10 Data-rich Section Extraction from HTML pages – DSE Algorithm Phase 3: Tree Matching ● Given two DOM trees (one representing the target page and one the sample page), the similar structures have to be matched ● The two trees are being traversed using a depth- first order and compare them node-by-node ● The parts of the tree that don't match are the Data- rich Sections

11 Data-rich Section Extraction from HTML pages – DSE Algorithm

12 Applying DSE to HITS ● 28 queries are used ● for each quer we sent it to the Google search engine and require that the first 200 be returned ● Result pages are add to the root set ● Send each of the 200 results to Google again to retrieve at most 100 inlinks pointing to the result page and add them also to the root set. ● The root set ranges from 975 to 6,776 nodes

13 Data-rich Section Extraction from HTML pages – DSE Algorithm

14


Download ppt "Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer."

Similar presentations


Ads by Google