Site-Level Web Template Extraction Based on Hyperlink Analysis Josep Silva
Information Retrieval Web Mining Content Extraction Template Extraction Block Detection
Information Retrieval Web Mining Content Extraction Template Extraction Block Detection
Template Extraction
Why is Template Extraction useful? Human reading. It has been measured that almost 40-50% of the components of a webpage can be considered irrelevant. Enhancing indexers and text analyzers to increase their performance by only processing relevant information. Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.
What is a webpage?
What is a webpage?
What is a webpage? Three different interpretations Rendered View HTML Code DOM Tree
What is a webpage?
What is a webpage? Three different interpretations Rendered View HTML Code DOM Tree
What is a webpage? Three different interpretations Rendered View HTML Code DOM Tree Visual features classification… CETR, Content Code Vector… Site Style Tree, CNR…
What is a webpage? Three different interpretations Rendered View HTML Code DOM Tree Visual features classification… CETR, Content Code Vector… Site Style Tree, CNR…
HTML Code approach CETR
HTML Code approach
What is a webpage? Three different interpretations Rendered View HTML Code DOM Tree Visual features classification… CETR, Content Code Vector… Site Style Tree, CNR…
HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML
HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML
HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML
HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML HTML
Template Extraction Exact Top-Down Mapping
Template Extraction Our method for template extraction in a nutsell: Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. The template is the intersection between the initial webpage and all DOM trees in the subdigraph. The intersection is computed with a Top-Down Exact Mapping between the DOM trees. Both steps can be done with a cost linear with the size of the DOM trees.
Template Extraction Hyperlink Analysis
DEMO
Summary Main Ideas Use densitometric features (TR) to analyse HTML code Use Chars Nodes Ratio (CNR) to analyse DOM trees Use Top-Down Exact Mappings (TDEM) to isolate the template of webpages Use the menus of a website to extract the template Use a complete subdigraph to identify the main menu Use folder information inside URLs to direct the search
Thank you