Download presentation
Presentation is loading. Please wait.
1
Site-Level Web Template Extraction
Based on Hyperlink Analysis Josep Silva
2
Information Retrieval
Web Mining Content Extraction Template Extraction Block Detection
13
Information Retrieval
Web Mining Content Extraction Template Extraction Block Detection
14
Template Extraction
15
Why is Template Extraction useful?
Human reading. It has been measured that almost 40-50% of the components of a webpage can be considered irrelevant. Enhancing indexers and text analyzers to increase their performance by only processing relevant information. Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.
16
What is a webpage?
17
What is a webpage?
18
What is a webpage? Three different interpretations Rendered View
HTML Code DOM Tree
19
What is a webpage?
20
What is a webpage? Three different interpretations Rendered View
HTML Code DOM Tree
21
What is a webpage? Three different interpretations Rendered View
HTML Code DOM Tree Visual features classification… CETR, Content Code Vector… Site Style Tree, CNR…
22
What is a webpage? Three different interpretations Rendered View
HTML Code DOM Tree Visual features classification… CETR, Content Code Vector… Site Style Tree, CNR…
23
HTML Code approach CETR
24
HTML Code approach
25
What is a webpage? Three different interpretations Rendered View
HTML Code DOM Tree Visual features classification… CETR, Content Code Vector… Site Style Tree, CNR…
26
HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML
27
HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML
28
HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML
29
HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML
30
Template Extraction Exact Top-Down Mapping
31
Template Extraction Our method for template extraction in a nutsell:
Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. The template is the intersection between the initial webpage and all DOM trees in the subdigraph. The intersection is computed with a Top-Down Exact Mapping between the DOM trees. Both steps can be done with a cost linear with the size of the DOM trees.
33
Template Extraction Hyperlink Analysis
35
DEMO
36
Summary Main Ideas Use densitometric features (TR) to analyse HTML code Use Chars Nodes Ratio (CNR) to analyse DOM trees Use Top-Down Exact Mappings (TDEM) to isolate the template of webpages Use the menus of a website to extract the template Use a complete subdigraph to identify the main menu Use folder information inside URLs to direct the search
37
Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.