Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit

2 Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

3 Information Retrieval Web Mining Template Detection Content Extraction Block Detection Motivation

Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as: menus, status bars, advertisements, sponsored information, etc. 4 Motivation ¿What is content extraction? Discipline that tries to isolate every information block in a webpage. ¿What is block detection?

5 Motivation

7 The date is differentThe title is different

Component reuse. Web developers can automatically extract components from a webpage. Enhancing indexers and text analyzers to increase their performance by only processing relevant information. It has been measured that almost 40-50% of the components of a webpage represent the template. Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone Extraction of the relevant content to make the webpage more accessible for visually impaired or blind. 8 Motivation ¿Why is template extraction useful?

10 The Technique What is a webpage?

Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage 11 The Technique State of the Art Densitometric features: counting characters and tags Statistics on terms: Some terms are common in templates

12 The Technique

13 The Technique

Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage 14 The Technique State of the Art Position of elements: lateral menus, main content centered and visible Less studied: rendering webpages is computationally expensive

Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage 15 The Technique State of the Art Analysis of the DOM structure: Difficulty in analysing DIV based structures Comparing several webpages: Search for common structures

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags). Some assume that the main content text is continuous. Some assume that the system knows a priori the format of the webpage. Some need to (randomly) load many webpages (several dozens) to compare them. 16 The Technique Limitations of Current approaches

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10]. Some assume that the main content text is continuous [11]. Some assume that the system knows a priori the format of the webpage [10]. Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated. 17 The Technique Limitations of Current approaches Directory Vicente Ramos Software Development Atmosphere 118 La Piedad, México 59300 +52 352 52 68499 His Company Company Page

The main problem of these approaches is a big loss of generality. They require to previously know or parse the webpages, or they require the webpage to have a particular structure. This is very inconvenient because modern webpages are mainly based on tags that do not require to be hierarchically organized (as in the table-based design). Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori. 18 The Technique Limitations of Current approaches

19 The Technique Other approaches are able to work: + Online (i.e., with any webpage) + In real-time (i.e., without the need to preprocess the webpages or know their structure)

20 Motivation Content Extraction and Block Detection Template Extraction A Technique for Content Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

The Document Object Model (DOM) API that provides programmers with a standard set of objects for the representation of HTML and XML documents. Given a webpage, it is completely automatic to produce its associated DOM structure and vice versa. The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically. 21 The Technique Table Div Body H1TableImage Text

The Document Object Model (DOM) Nodes in the DOM tree can be of two types: tag nodes, and text nodes: Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes). Text nodes are always leaves in the DOM tree because they cannot contain other nodes. 22 The Technique I want to know more! http://www.w3.org/DOM/ Ta bl e Di v Bo dy H1Ta bl e Imag e Te xt

23 Motivation Content Extraction and Block Detection Template Extraction A Technique for Content Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

Our method for template extraction in a nutsell: 1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. 2.Solve conflicts between those webpages that implement different templates. Establishing a voting system between the webpages. 3.The template is the intersection between the initial webpage and the DOM trees in the subdigraph. The intersection is computed with an Equal Top-Down Mapping between the DOM trees. The three steps can be done with a linear cost with respect to the size of the DOM trees. 24 The Technique

1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. 25 The Technique Menu Submenu Domain A Domain B Domain C

The Technique 1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. Hyperlink distance

1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. The Technique Hyperlink distanceDOM distance

2.Solve conflicts between those webpages that implement different templates. Establishing a voting system between the webpages. The Technique

Our method for template extraction in a nutsell: 3.The template is the intersection between the initial webpage and the DOM trees in the subdigraph. 3. The intersection is computed with an Equal Top-Down Mapping between the DOM trees. 29 The Technique Tabl e Div Body H1Tabl e Image Text Tabl e Div Body H1Tabl e Image Text Tabl e Div Body H1Tabl e Image Text Tabl e Div Body H1Tabl e Image Text Tabl e Div Body H1Tabl e Image Text P1 P2 P3 P4 P5

Mapping: 30 The Technique HTML Body Div Table P P HTML Body Table Div P P P P P P

Top-Down Mapping: 31 The Technique HTML Body Div Table P P HTML Body Table Div P P P P P P

Equal Top-Down Mapping: 32 The Technique HTML Body Div Table P P HTML Body Table Div P P P P P P

Benchmarks: online heterogeneus webpages Domains with different layouts and page structures Company’s websites, news articles, forums, etc. Final evaluation set randomly selected We determined the actual template of each webpage by downloading it and manually selecting the template. The DOM tree of the selected elements was then produced and used for comparison evaluation later. F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall 34 Experiments

35 Experiments

GOLD STANDARD Downloading the complete website of each benchmark. Company’s websites, news articles, forums, etc. Four different engineers did the following independently: Manually exploring the original page and the webpages accessible from it to decide what part of the webpage is the template. Printing the key page in paper and marking the template. The four engineers met and together decided what the template was. Each element marked in the printed page was mapped to the DOM tree of the initial page. All elements in the DOM tree that did not belong to the template were included in an HTML class non-template (i.e., we enriched the HTML code of the key page with a new class). This class was later used by an algorithm that we programmed to evaluate the results obtained by our tool. 36 Experiments

38 Conclusions and future work Conclusions: New technique proposed for template extraction: 1.It does not make assumptions about the particular structure of webpages. 2.It only needs to process a single webpage (no templates, no other webpages of the same website are needed). 3.No preprocessing stages are needed. The technique can work online. 4.It is fully language independent (it can work with pages written in English, German, etc.). 5.The particular text formatting of the webpage does not influence the performance of the technique.

39 Conclusions and future work Future Work: 1.Consider that a website can implement several templates along the webpages: Extend the benchmark suite by labelling all templates. A new technique to detect all templates of a website. 1.Combine template extraction with content extraction: Firstly, apply template extraction to remove the template, and Secondly, look for the main content on the remaining webpage.

40 Thank You

Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Similar presentations

Presentation on theme: "Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Similar presentations

Presentation on theme: "Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit."— Presentation transcript:

Similar presentations

About project

Feedback