Based on Menu Information Template Extraction Based on Menu Information Josep Silva Technical University of Valencia Joint work done in colaboration with Julián Alarte, David Insa, Salvador Tamarit (WWV'2013)
Contents Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work
Motivation ¿What is content extraction? ¿What is block detection? Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as: menus, status bars, advertisements, sponsored information, etc. ¿What is block detection? Discipline that tries to isolate every information block in a webpage.
Motivation
Motivation
Motivation The date is different The title is different
Motivation ¿Why template extraction is useful? Component reuse. Web developers can automatically extract components from a webpage. Enhancing indexers and text analyzers to increase their performance by only processing relevant information. It has been measured that almost 40-50% of the components of a webpage represent the template. Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.
Contents Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work
The Technique State of the Art Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage Densitometric features: counting characters and tags Statistics on terms: Some terms are common in templates
The Technique
The Technique
The Technique State of the Art Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage Position of elements: lateral menus, main content centered and visible Less studied: rendering webpages is computationally expensive
The Technique State of the Art Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage Analysis of the DOM structure: Difficulty in analysing DIV based structures Comparing several webpages: Search for common structures
The Technique Limitations of Current approaches Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10]. Some assume that the main content text is continuous [11]. Some assume that the system knows a priori the format of the webpage [10]. Some need to (randomly) load many webpages (several dozens) to compare them [15].
The Technique Limitations of Current approaches <h2>Directory</h2> <div class="vcard"> <span class="fn">Vicente Ramos</span> <div class="org">Software Development </div> <div class="adr"> <div class="street-address">Atmosphere 118</div> <span class="locality">La Piedad, México</span> <span class="postal-code">59300</span> </div> <div class="tel">+52 352 52 68499</div> <h4>His Company</h4> <a class="url" href="page2.html"> Company Page </a> Limitations of Current approaches Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10]. Some assume that the main content text is continuous [11]. Some assume that the system knows a priori the format of the webpage [10]. Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated [12].
The Technique Limitations of Current approaches The main problem of these approaches is a big loss of generality. They require to previously know or parse the webpages, or they require the webpage to have a particular structure. This is very inconvenient because modern webpages are mainly based on <div> tags that do not require to be hierarchically organized (as in the table-based design). Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori.
The Technique Other approaches are able to work: + Online (i.e., with any webpage) + In real-time (i.e., without the need to preprocess the webpages or know their structure)
The Technique Site Style Tree 3 2 1 Table Div Body H1 Image Text Table
Contents Motivation Content Extraction and Block Detection Template Extraction A Technique for Content Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work
The Technique The Document Object Model (DOM) [17] API that provides programmers with a standard set of objects for the representation of HTML and XML documents. Given a webpage, it is completely automatic to produce its associated DOM structure and vice-versa. The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically. Table Div Body H1 Image Text
The Technique The Document Object Model (DOM) [17] Table Div Body H1 Image Text The Document Object Model (DOM) [17] Nodes in the DOM tree can be of two types: tag nodes, and text nodes: Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes). Text nodes are always leaves in the DOM tree because they cannot contain other nodes. I want to know more! http://www.w3.org/DOM/
Contents Motivation Content Extraction and Block Detection Template Extraction A Technique for Content Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work
The Technique Our method for template extraction in a nutsell: Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. The template is the intersection between the initial webpage and all DOM trees in the subdigraph. The intersection is computed with a Top-Down Exact Mapping between the DOM trees. Both steps can be done with a cost linear with the size of the DOM trees.
The Technique Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. Domain A Domain B Menu Submenu Domain C
The Technique Domain A Domain B Menu Size F1 Loads 1 62,03 2 76,16 3,4 78,35 5,75 4 78,65 7,45 5 78,85 9,3 6 78,11 14 7 78,13 16,15 8 21 Size F1 Loads 1 62,03 2 76,16 3,4 3 78,35 5,75 4 78,65 7,45 5 78,85 9,3 6 78,11 14 7 78,13 16,15 8 21 Submenu Domain C
The Technique Our method for template extraction in a nutsell: The template is the intersection between the initial webpage and all DOM trees in the subdigraph. The intersection is computed with a Top-Down Exact Mapping between the DOM trees. Table Div Body H1 Image Text P1 Table Div Body H1 Image Text P2 Table Div Body H1 Image Text P3 Table Div Body H1 Image Text P4 Table Div Body H1 Image Text P5
The Technique Mapping: HTML HTML Body Body Div Table Table Table P
The Technique Top-Down Mapping: HTML HTML Body Body Div Table Table
The Technique Top-Down Exact Mapping: HTML HTML Body Body Div Table
The Technique Experiments Benchmarks: online heterogeneus webpages Domains with different layouts and page structures Company’s websites, news articles, forums, etc. Final evaluation set randomly selected We determined the actual template of each webpage by downloading it and manually selecting the template. The DOM tree of the selected elements was then produced and used for comparison evaluation later. F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall
The Technique Experiments
The Technique Experiments Using CNR: The average recall is 94.39 and the average precision is 74.08. Using tag ratios: The average recall is 92.72 and the average precision is 71.93. Interesting Phenomenon (property): Either the recall, the precision, or both, are 100%. Body Recall 0% Precision 0% Recall 100% Precision <100% H1 Div Table Image Recall 100% Precision 100% Table Text Recall <100% Precision 100% Text Table Text
Motivation A demo is better than a hundred words
Contents Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work
The Technique Template extraction from Wikipedia Recall 100%, precision 100% (50% of the times).
The Technique Template extraction from FilmAffinity Recall >100% (sometimes forced by the designer).
The Technique Template extraction from FilmAffinity Recall <100% (6% of the times).
Contents Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work
Conclusions and future work New technique proposed for template extraction: It does not make assumptions about the particular structure of webpages. It only needs to process a single webpage (no templates, no other webpages of the same website are needed). No preprocessing stages are needed. The technique can work online. It is fully language independent (it can work with pages written in English, German, etc.). The particular text formatting of the webpage does not influence the performance of the technique.
Conclusions and future work Update the implementation to the new Firefox’s API 2004 -> Firefox 1 2006 -> Firefox 2 2008 -> Firefox 3 2011 -> Firefox 4 2012 -> Firefox 13!!!!! New algorithm that selects the top rated nodes based on the variance. I want to know more! http://users.dsic.upv.es/~jsilva/CNR