Presentation is loading. Please wait.

Presentation is loading. Please wait.

Site-Level Web Template Extraction

Similar presentations


Presentation on theme: "Site-Level Web Template Extraction"— Presentation transcript:

1 Site-Level Web Template Extraction
Based on Hyperlink Analysis Josep Silva

2 Information Retrieval
Web Mining Content Extraction Template Extraction Block Detection

3

4

5

6

7

8

9

10

11

12

13 Information Retrieval
Web Mining Content Extraction Template Extraction Block Detection

14 Template Extraction

15 Why is Template Extraction useful?
Human reading. It has been measured that almost 40-50% of the components of a webpage can be considered irrelevant. Enhancing indexers and text analyzers to increase their performance by only processing relevant information. Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.

16 What is a webpage?

17 What is a webpage?

18 What is a webpage? Three different interpretations Rendered View
HTML Code DOM Tree

19 What is a webpage?

20 What is a webpage? Three different interpretations Rendered View
HTML Code DOM Tree

21 What is a webpage? Three different interpretations Rendered View
HTML Code DOM Tree Visual features classification… CETR, Content Code Vector… Site Style Tree, CNR…

22 What is a webpage? Three different interpretations Rendered View
HTML Code DOM Tree Visual features classification… CETR, Content Code Vector… Site Style Tree, CNR…

23 HTML Code approach CETR

24 HTML Code approach

25 What is a webpage? Three different interpretations Rendered View
HTML Code DOM Tree Visual features classification… CETR, Content Code Vector… Site Style Tree, CNR…

26 HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML

27 HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML

28 HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML

29 HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML

30 Template Extraction Exact Top-Down Mapping

31 Template Extraction Our method for template extraction in a nutsell:
Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. The template is the intersection between the initial webpage and all DOM trees in the subdigraph. The intersection is computed with a Top-Down Exact Mapping between the DOM trees. Both steps can be done with a cost linear with the size of the DOM trees.

32

33 Template Extraction Hyperlink Analysis

34

35 DEMO

36 Summary Main Ideas Use densitometric features (TR) to analyse HTML code Use Chars Nodes Ratio (CNR) to analyse DOM trees Use Top-Down Exact Mappings (TDEM) to isolate the template of webpages Use the menus of a website to extract the template Use a complete subdigraph to identify the main menu Use folder information inside URLs to direct the search

37 Thank you


Download ppt "Site-Level Web Template Extraction"

Similar presentations


Ads by Google