Julián ALARTE DAVID INSA JOSEP SILVA

Slides:



Advertisements
Similar presentations
HTML Basics Customizing your site using the basics of HTML.
Advertisements

© 2011 Delmar, Cengage Learning Chapter 1 Getting Started with Dreamweaver.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Cascading Style Sheets CSS. What is it? Another file format ! its not like html Describes the looks of “selectors” in a.css file (example.css) in the.
Aki Hecht Seminar in Databases (236826) January 2009
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Efficient Web Browsing on Handheld Devices Using Page and Form Summarization Orkut Buyukkokten, Oliver Kaljuvee, Hector Garcia-Molina, Andreas Paepcke.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Learning Bit by Bit Search. Information Retrieval Census Memex Sea of Documents Find those related to “new media” Brute force.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
Supporting the Automatic Construction of Entity Aware Search Engines Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Dipartimento di Informatica.
Sheet 1XML Technology in E-Commerce 2001Lecture 6 XML Technology in E-Commerce Lecture 6 XPointer, XSLT.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.
HTML. WHAT IS HTML HTML stands for Hyper Text Markup Language HTML is not a programming language, it is a markup language A markup language is a set of.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Querying Structured Text in an XML Database By Xuemei Luo.
Objective Understand concepts used to web-based digital media. Course Weight : 5%
1 Visual Segmentation-Based Data Record Extraction from Web IEEE Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 13 - Dynamic HTML: Object Model and Collections Outline 13.1 Introduction 13.2 Object Referencing.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
 2008 Pearson Education, Inc. All rights reserved Document Object Model (DOM): Objects and Collections.
Emerging Trend Detection Shenzhi Li. Introduction What is an Emerging Trend? –An Emerging Trend is a topic area for which one can trace the growth of.
HTML and CSS HTML is used for the content of web pages CSS is used for the style of the web pages You are going to learn how they can be combined to create.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Search Engine Optimization
HTML Extra Markup CS 1150 Spring 2017.
Introduction to CSS: Selectors
DHTML.
Creating & Customizing Business for Sale Websites
Introduction to HTML:.
CSS Layouts: Positioning and Navbars
Lesson 14: Web Scraping TopHat Attendance
HTML: HyperText Markup Language
Site-Level Web Template Extraction
Lesson 14: Web Scraping Topic: Web Scraping.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Based on Menu Information
Information Retrieval and Web Search
Getting Started with Dreamweaver
Web Data Extraction Based on Partial Tree Alignment
Web Programming A different world! Three main languages/tools No Java
HTML A brief introduction HTML.
Lesson Objectives Aims You should know about: – Web Technologies
Chapter 11 Data Compression
HTML What is it? HTML is a computer language devised to allow website creation. These websites can then be viewed by anyone else connected to the Internet.
Secure Web Programming
Web Page Cleaning for Web Mining
HTML Structure.
Exercise 9 Skills You create and use styles to create formatting rules that can easily by applied to other pages in the Web site. You can create internal.
Understand basic HTML and CSS terminology, concepts, and basic operations. Objective 3.01.
HTML / CSS Mai Moustafa Senior Web Designer eSpace eSpace.
HTML 5 SEMANTIC ELEMENTS.
Computer communications
Web scraping tools, a real life application
PubMed Database Interface (Basic Course: Module 4)
5.00 Apply procedures to organize content by using Dreamweaver. (22%)
WJEC GCSE Computer Science
Web Programming and Design
Information Retrieval and Web Design
Monday, Sept. 24 Today we are going to update the html code to html5. It has some new features that we have not covered yet.
Presentation transcript:

Julián ALARTE DAVID INSA JOSEP SILVA Webpage menu detection based on DOM Departamento de Sistemas Informáticos y Computación Universidad Politécnica de Valencia SOFSEM 2017 Julián ALARTE DAVID INSA JOSEP SILVA

Information Retrieval Web Mining Content Extraction Template Extraction Block Detection

Demo

Menu Detection

Why is menu detection useful? Website structure: The website menu usually includes the main pages of a website. Therefore, it is useful to know the main structure of the website. Indexers and crawlers: They usually judge the relevance of a webpage according to the frequency and distribution of terms and hyperlinks. The detection of the menu can help them to know the most relevant pages on a website. Template detection: A menu is always located inside the template of a webpage. Detecting the menu of a webpage is a great advantage in the template detection process.

What is a webpage?

What is a webpage? Three different interpretations: HTML code DOM Tree Rendered view

… … … … … … … … … BODY DIV H1 HR DIV H2 DIV A #text P DIV A TABLE IMG UL #text … LI LI LI … … A A UL UL … … … … LI LI LI LI LI #text #text … LI … … A … A …

Site-level vs Page-Level Technique HTML HTML HTML HTML HTML HTML

What is a website menu?

What is a website menu? A website menu is defined as a DOM node: At least two of its descendants are hyperlinks. It is the smallest subtree containing the hyperlinks. The same menu appears in at least another webpage of the website.

The technique in a nutshell Assign weights to the DOM nodes Selection of root nodes Selection of the menu node Phase 1 Phase 2 Phase 3

Node properties A weight is assigned to each node considering these properties: Node amplitude: Computed considering its number of children. (the more the better) Link ratio: Amount of link nodes among its descendants. (the more the better) Text ratio: Number of characters in its subtree w.r.t the total text. (the less the better) UL ratio: The node is an “UL” DOM node. (0 or 1) Representative tag: The classname or id of the node are “nav” or “menu” or its tagname is “nav”. (0 or 1) Node position: Its position in the DOM tree. (the higher the better)

Selection of candidates Once all the weights of the nodes are calculated. The ones with the highest weight are selected. Selection threshold = 0,85 * best weight in the DOM tree (based on experimentation). Output: A set of candidates.

Selection of root nodes

Selection of root nodes A node representing the menu often combines 2 or more candidates. Algorithm for each candidate in the set: Explore the ancestors of the candidates, and for each of them: Check the weight of its children. If more than half of its children have a weight higher than the root threshold multiplied by the weight of the candidate -> continue going up. Else stop and select the last node.

… … … … … Root threshold = 0.7 0.87 x 0.7 = 0.609 … … … … BODY DIV H1 0,4 0,38 0,17 0,36 0,21 A #text P DIV A TABLE 0,4 0,12 0,45 0,36 0,21 Root threshold = 0.7 0.87 x 0.7 = 0.609 IMG #text UL #text … 0,85 LI LI LI … … 0,39 0,62 0,87 A A UL UL … … … … 0,39 0,38 0,86 0,87 LI LI LI #text #text … LI LI LI 0,37 0,37 0,38 0,37 0,37 0,37 … … A … A … 0,37 0,37

Selection of the menu node Probably, there are several root nodes in the set. One of them should correspond to the menu. Algorithm: For each root node in the set: Compute the average weight of its descendants that have a weight over the menu threshold. The menu is the node with the highest average weight.

Training phase A suite of benchmarks has been developed. Executed experiments with a subset of the suite: 1,5M experiments. Computation time = 85 days. Measuring precision and recall of each combination. Selection of the best combination of thresholds and properties.

Training phase - Best combination Candidate threshold = 0,85 Root threshold = 0,7 Menu threshold = 0,8 Node weight: Node amplitude = 0,2 Link ratio = 0,1 Text ratio = 0,3 UL ratio = 0,2 Representative tag = 0,1 Node position = 0,1

Results - Evaluation Recall = Number of correctly obtained links divided by the number of links in the menu. Precision = Number of correctly obtained links divided by the number of obtained links. F1 = (2 * Precision * Recall) / (Precision + Recall)

Results Recall = 94,13% Precision = 98,21 % F1 = 94,46 % Time = 5,38 s.

Conclusions Page-level technique -> good performance. Almost 75% of the experiments retrieved the exact menu. Useful for template extraction techniques. Provides navigational information for site-level techniques.

Implementation Implemented as a Firefox Add-on. Published by Mozilla.

Thank you