Julián ALARTE DAVID INSA JOSEP SILVA Webpage menu detection based on DOM Departamento de Sistemas Informáticos y Computación Universidad Politécnica de Valencia SOFSEM 2017 Julián ALARTE DAVID INSA JOSEP SILVA
Information Retrieval Web Mining Content Extraction Template Extraction Block Detection
Demo
Menu Detection
Why is menu detection useful? Website structure: The website menu usually includes the main pages of a website. Therefore, it is useful to know the main structure of the website. Indexers and crawlers: They usually judge the relevance of a webpage according to the frequency and distribution of terms and hyperlinks. The detection of the menu can help them to know the most relevant pages on a website. Template detection: A menu is always located inside the template of a webpage. Detecting the menu of a webpage is a great advantage in the template detection process.
What is a webpage?
What is a webpage? Three different interpretations: HTML code DOM Tree Rendered view
… … … … … … … … … BODY DIV H1 HR DIV H2 DIV A #text P DIV A TABLE IMG UL #text … LI LI LI … … A A UL UL … … … … LI LI LI LI LI #text #text … LI … … A … A …
Site-level vs Page-Level Technique HTML HTML HTML HTML HTML HTML
What is a website menu?
What is a website menu? A website menu is defined as a DOM node: At least two of its descendants are hyperlinks. It is the smallest subtree containing the hyperlinks. The same menu appears in at least another webpage of the website.
The technique in a nutshell Assign weights to the DOM nodes Selection of root nodes Selection of the menu node Phase 1 Phase 2 Phase 3
Node properties A weight is assigned to each node considering these properties: Node amplitude: Computed considering its number of children. (the more the better) Link ratio: Amount of link nodes among its descendants. (the more the better) Text ratio: Number of characters in its subtree w.r.t the total text. (the less the better) UL ratio: The node is an “UL” DOM node. (0 or 1) Representative tag: The classname or id of the node are “nav” or “menu” or its tagname is “nav”. (0 or 1) Node position: Its position in the DOM tree. (the higher the better)
Selection of candidates Once all the weights of the nodes are calculated. The ones with the highest weight are selected. Selection threshold = 0,85 * best weight in the DOM tree (based on experimentation). Output: A set of candidates.
Selection of root nodes
Selection of root nodes A node representing the menu often combines 2 or more candidates. Algorithm for each candidate in the set: Explore the ancestors of the candidates, and for each of them: Check the weight of its children. If more than half of its children have a weight higher than the root threshold multiplied by the weight of the candidate -> continue going up. Else stop and select the last node.
… … … … … Root threshold = 0.7 0.87 x 0.7 = 0.609 … … … … BODY DIV H1 0,4 0,38 0,17 0,36 0,21 A #text P DIV A TABLE 0,4 0,12 0,45 0,36 0,21 Root threshold = 0.7 0.87 x 0.7 = 0.609 IMG #text UL #text … 0,85 LI LI LI … … 0,39 0,62 0,87 A A UL UL … … … … 0,39 0,38 0,86 0,87 LI LI LI #text #text … LI LI LI 0,37 0,37 0,38 0,37 0,37 0,37 … … A … A … 0,37 0,37
Selection of the menu node Probably, there are several root nodes in the set. One of them should correspond to the menu. Algorithm: For each root node in the set: Compute the average weight of its descendants that have a weight over the menu threshold. The menu is the node with the highest average weight.
Training phase A suite of benchmarks has been developed. Executed experiments with a subset of the suite: 1,5M experiments. Computation time = 85 days. Measuring precision and recall of each combination. Selection of the best combination of thresholds and properties.
Training phase - Best combination Candidate threshold = 0,85 Root threshold = 0,7 Menu threshold = 0,8 Node weight: Node amplitude = 0,2 Link ratio = 0,1 Text ratio = 0,3 UL ratio = 0,2 Representative tag = 0,1 Node position = 0,1
Results - Evaluation Recall = Number of correctly obtained links divided by the number of links in the menu. Precision = Number of correctly obtained links divided by the number of obtained links. F1 = (2 * Precision * Recall) / (Precision + Recall)
Results Recall = 94,13% Precision = 98,21 % F1 = 94,46 % Time = 5,38 s.
Conclusions Page-level technique -> good performance. Almost 75% of the experiments retrieved the exact menu. Useful for template extraction techniques. Provides navigational information for site-level techniques.
Implementation Implemented as a Firefox Add-on. Published by Mozilla.
Thank you