Based on Menu Information

Slides:



Advertisements
Similar presentations
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Advertisements

DT228/3 Web Development JSP: Directives and Scripting elements.
HTML syntax By Ana Drinceanu. Definition: Syntax refers to the spelling and grammar of a programming language. Computers are inflexible machines that.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
Webpage Understanding: an Integrated Approach
Problemsolving 2 Problem Solving: Designing a website solution Identifying how a solution will function Taking into account the technical constraints a.
Lecturer: Ghadah Aldehim
DHTML - Introduction Introduction to DHTML, the DOM, JS review.
DHTML. What is DHTML?  DHTML is the combination of several built-in browser features in fourth generation browsers that enable a web page to be more.
Different ways to implement CSS. There are four different ways to use CSS in your web pages: – Inline CSS – Embedded CSS/Internal CSS – Linked CSS/External.
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
Javascript II DOM & JSON. In an effort to create increasingly interactive experiences on the web, programmers wanted access to the functionality of browsers.
Copyright 2007, Information Builders. Slide 1 Understanding Basic HTML Amanda Regan Technical Director June, 2008.
>> HTML: Structure Elements. Elements in HTML are either Inline or Block. Block-level Elements – Begins on a new line – Occupy the whole width – Stacks.
Chapter 7 Web Design.. HTML  Hypertext Markup Language  Using HTML, text is formatted by wrapping it in a tag.  The tags provide instructions to the.
Feature Detection in Ajax-enabled Web Applications Natalia Negara Nikolaos Tsantalis Eleni Stroulia 1 17th European Conference on Software Maintenance.
INT222 - Internet Fundamentals Shi, Yue (Sunny) Office: T2095 SENECA COLLEGE.
HTML.
SEO Friendly Website Building a visually stunning website is not enough to ensure any success for your online presence.
Targeted Bottleneck #1: Rule Matching EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Parallel Cascading Style Sheets Leo Meyerovich,
Presentation On HTML & Podcast Done by: Shamelia Young & Sheriece Williamson.
Positioning Objects with CSS and Tables
Agenda 1)Modern web standards overview 2)JavaScript library overview 3)Building a Single Page Application SPA.
Project: Web Designer. Phase 1: The World Wide Web.
HTML LAYOUTS. CONTENTS Layouts Example Layout Using Element Example Using Table Example Output Summary Exercise.
INTRODUCTION ABOUT DIV Most websites have put their content in multiple columns. Multiple columns are created by using or elements. The div element is.
Week 1: Introduction to HTML and Web Design
Getting Started with HTML
Fall 2016 CSULA Saloni Chacha
DHTML.
The Role of Tool Support in Public Policies and Accessibility
Web Architecture & HTML
Egyptian Language School General Questions Prep.2
Introduction to HTML:.
HTML5 Basics.
Objective % Select and utilize tools to design and develop websites.
Internet of the Past.
Bare boned notes.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
IGEM Wiki Workshop 11/05/2017.
Julián ALARTE DAVID INSA JOSEP SILVA
>> Introduction to CSS
Site-Level Web Template Extraction
Basic HTML PowerPoint How Hyper Text Markup Language Works
Article Authors – Oleksii Starov & Nick Nikiforakas
Objective % Select and utilize tools to design and develop websites.
UNIT 15 Webpage Creator.
HTML Vocabulary.
Section 10.1 YOU WILL LEARN TO… Define scripting
Web Data Extraction Based on Partial Tree Alignment
Basic HTML PowerPoint How Hyper Text Markup Language Works
TOPICS Chrome Dev Tools Process for Building a Static Website
Week 11 Web site: XML DOM Week 11 Web site:
Introducing HTML & XHTML:
Creating a Successful Web Presence
Web Programming A different world! Three main languages/tools No Java
Browser Support for HTML5
Computers and Scientific Thinking David Reed, Creighton University
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
Secure Web Programming
Structuring Content in a Web Document
Introduction to HTML5.
Understand basic HTML and CSS terminology, concepts, and basic operations. Objective 3.01.
Document Structure & HTML
HTML 5 SEMANTIC ELEMENTS.
5.00 Apply procedures to organize content by using Dreamweaver. (22%)
And Mobile Web Browsers
Web Programming and Design
Introduction Dataset search
Presentation transcript:

Based on Menu Information Template Extraction Based on Menu Information Josep Silva Technical University of Valencia Joint work done in colaboration with Julián Alarte, David Insa, Salvador Tamarit (WWV'2013)

Contents Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work

Motivation ¿What is content extraction? ¿What is block detection? Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as: menus, status bars, advertisements, sponsored information, etc. ¿What is block detection? Discipline that tries to isolate every information block in a webpage.

Motivation

Motivation

Motivation The date is different The title is different

Motivation ¿Why template extraction is useful? Component reuse. Web developers can automatically extract components from a webpage. Enhancing indexers and text analyzers to increase their performance by only processing relevant information. It has been measured that almost 40-50% of the components of a webpage represent the template. Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone Extraction of the relevant content to make the webpage more accessible for visually impaired or blind.

Contents Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work

The Technique State of the Art Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage Densitometric features: counting characters and tags Statistics on terms: Some terms are common in templates

The Technique

The Technique

The Technique State of the Art Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage Position of elements: lateral menus, main content centered and visible Less studied: rendering webpages is computationally expensive

The Technique State of the Art Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage Analysis of the DOM structure: Difficulty in analysing DIV based structures Comparing several webpages: Search for common structures

The Technique Limitations of Current approaches Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10]. Some assume that the main content text is continuous [11]. Some assume that the system knows a priori the format of the webpage [10]. Some need to (randomly) load many webpages (several dozens) to compare them [15].

The Technique Limitations of Current approaches <h2>Directory</h2> <div class="vcard"> <span class="fn">Vicente Ramos</span> <div class="org">Software Development </div> <div class="adr"> <div class="street-address">Atmosphere 118</div> <span class="locality">La Piedad, México</span> <span class="postal-code">59300</span> </div> <div class="tel">+52 352 52 68499</div> <h4>His Company</h4> <a class="url" href="page2.html"> Company Page </a> Limitations of Current approaches Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10]. Some assume that the main content text is continuous [11]. Some assume that the system knows a priori the format of the webpage [10]. Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated [12].

The Technique Limitations of Current approaches The main problem of these approaches is a big loss of generality. They require to previously know or parse the webpages, or they require the webpage to have a particular structure. This is very inconvenient because modern webpages are mainly based on <div> tags that do not require to be hierarchically organized (as in the table-based design). Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori.

The Technique Other approaches are able to work: + Online (i.e., with any webpage) + In real-time (i.e., without the need to preprocess the webpages or know their structure)

The Technique Site Style Tree 3 2 1 Table Div Body H1 Image Text Table

Contents Motivation Content Extraction and Block Detection Template Extraction A Technique for Content Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work

The Technique The Document Object Model (DOM) [17] API that provides programmers with a standard set of objects for the representation of HTML and XML documents. Given a webpage, it is completely automatic to produce its associated DOM structure and vice-versa. The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically. Table Div Body H1 Image Text

The Technique The Document Object Model (DOM) [17] Table Div Body H1 Image Text The Document Object Model (DOM) [17] Nodes in the DOM tree can be of two types: tag nodes, and text nodes: Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes). Text nodes are always leaves in the DOM tree because they cannot contain other nodes. I want to know more! http://www.w3.org/DOM/

Contents Motivation Content Extraction and Block Detection Template Extraction A Technique for Content Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work

The Technique Our method for template extraction in a nutsell: Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. The template is the intersection between the initial webpage and all DOM trees in the subdigraph. The intersection is computed with a Top-Down Exact Mapping between the DOM trees. Both steps can be done with a cost linear with the size of the DOM trees.

The Technique Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. Domain A Domain B Menu Submenu Domain C

The Technique Domain A Domain B Menu Size F1 Loads 1 62,03 2 76,16 3,4 78,35 5,75 4 78,65 7,45 5 78,85 9,3 6 78,11 14 7 78,13 16,15 8 21 Size F1 Loads 1 62,03 2 76,16 3,4 3 78,35 5,75 4 78,65 7,45 5 78,85 9,3 6 78,11 14 7 78,13 16,15 8 21 Submenu Domain C

The Technique Our method for template extraction in a nutsell: The template is the intersection between the initial webpage and all DOM trees in the subdigraph. The intersection is computed with a Top-Down Exact Mapping between the DOM trees. Table Div Body H1 Image Text P1 Table Div Body H1 Image Text P2 Table Div Body H1 Image Text P3 Table Div Body H1 Image Text P4 Table Div Body H1 Image Text P5

The Technique Mapping: HTML HTML Body Body Div Table Table Table P

The Technique Top-Down Mapping: HTML HTML Body Body Div Table Table

The Technique Top-Down Exact Mapping: HTML HTML Body Body Div Table

The Technique Experiments Benchmarks: online heterogeneus webpages Domains with different layouts and page structures Company’s websites, news articles, forums, etc. Final evaluation set randomly selected We determined the actual template of each webpage by downloading it and manually selecting the template. The DOM tree of the selected elements was then produced and used for comparison evaluation later. F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall

The Technique Experiments

The Technique Experiments Using CNR: The average recall is 94.39 and the average precision is 74.08. Using tag ratios: The average recall is 92.72 and the average precision is 71.93. Interesting Phenomenon (property): Either the recall, the precision, or both, are 100%. Body Recall 0% Precision 0% Recall 100% Precision <100% H1 Div Table Image Recall 100% Precision 100% Table Text Recall <100% Precision 100% Text Table Text

Motivation A demo is better than a hundred words

Contents Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work

The Technique Template extraction from Wikipedia Recall 100%, precision 100% (50% of the times).

The Technique Template extraction from FilmAffinity Recall >100% (sometimes forced by the designer).

The Technique Template extraction from FilmAffinity Recall <100% (6% of the times).

Contents Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work

Conclusions and future work New technique proposed for template extraction: It does not make assumptions about the particular structure of webpages. It only needs to process a single webpage (no templates, no other webpages of the same website are needed). No preprocessing stages are needed. The technique can work online. It is fully language independent (it can work with pages written in English, German, etc.). The particular text formatting of the webpage does not influence the performance of the technique.

Conclusions and future work Update the implementation to the new Firefox’s API 2004 -> Firefox 1 2006 -> Firefox 2 2008 -> Firefox 3 2011 -> Firefox 4 2012 -> Firefox 13!!!!! New algorithm that selects the top rated nodes based on the variance. I want to know more! http://users.dsic.upv.es/~jsilva/CNR