Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.

Slides:



Advertisements
Similar presentations
HTML Basics Customizing your site using the basics of HTML.
Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
Hypertext Markup Language. Platform: - Independent  This means it can be interpreted on any computer regardless of the hardware or operating system.
Project 1 Introduction to HTML.
Information Retrieval in Practice
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
CM143 - Web Week 2 Basic HTML. Links and Image Tags.
Representation of Web Data in a Web Warehouse Ragini A.S. & Shipra Dutta November 20 th, 2001.
CORE 2: Information systems and Databases HYPERTEXT/ HYPERMEDIA.
Overview of Search Engines
Introducing HTML & XHTML:. Goals  Understand hyperlinking  Understand how tags are formed and used.  Understand HTML as a markup language  Understand.
Web Design Basic Concepts.
CPSC 203 Introduction to Computers Lab 39, 40 By Jie (Jeff) Gao.
INTRODUCTION TO CLIENT-SIDE WEB PROGRAMMING ACM 511 ACM 262 Course Notes.
Webpage Understanding: an Integrated Approach
Problemsolving 2 Problem Solving: Designing a website solution Identifying how a solution will function Taking into account the technical constraints a.
 Using Microsoft Expression Web you can: › Create Web pages and Web sites › Set what you site will look like as you design it › Add text, images, multimedia.
Lecturer: Ghadah Aldehim
What is Web Design?  Web design is the creation of a Web page using hypertext or hypermedia to be viewed on the World Wide Web.
DHTML. What is DHTML?  DHTML is the combination of several built-in browser features in fourth generation browsers that enable a web page to be more.
GIS technologies and Web Mapping Services
INTRODUCTION TO FRONTPAGE. TOPICS TO BE DISCUSSED……….  Introduction Introduction  Features Features  Starting Front Page Starting Front Page  Components.
Name Teacher: Group: 1 Unit 2 – Webpage Creation.
10 Adding Interactivity to a Web Site Section 10.1 Define scripting Summarize interactivity design guidelines Identify scripting languages Compare common.
Introduction to HTML5. History of HTML HTML first published – Tim Berners-Lee HTML 2.0 HTML 3.2 HTML 4.01 XHTML 1.0 XHTML 2.0.
HTML history, Tags, Element. HTML: HyperText Markup Language Hello World Welcome to the world!
Using a Template to Create a Resume and Sharing a Finished Document
Programming in HTML.  Programming Language  Used to design/create web pages  Hyper Text Markup Language  Markup Language  Series of Markup tags 
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
Tutorial 1: Browser Basics.
HTML | DOM. Objectives  HTML – Hypertext Markup Language  Sematic markup  Common tags/elements  Document Object Model (DOM)  Work on page | HTML.
The Document Object Model. The Web B.D, A.D. They aren’t web pages, they’re document objects A web browser interprets structured information. A server.
CSCI 1101 Intro to Computers 7.1 Learning HTML. 2 Introduction Web pages are written using HTML Two key concepts of HTML are:  Hypertext (links Web pages.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
Basic HTML PowerPoint How Hyper Text Markup Language Works
Creating Webpage Using HTML
Copyright 2007, Information Builders. Slide 1 Understanding Basic HTML Amanda Regan Technical Director June, 2008.
Feature Detection in Ajax-enabled Web Applications Natalia Negara Nikolaos Tsantalis Eleni Stroulia 1 17th European Conference on Software Maintenance.
INT222 - Internet Fundamentals Shi, Yue (Sunny) Office: T2095 SENECA COLLEGE.
TOPIC II Dynamic HTML Prepared by: Nimcan Cabd Cali.
Jozef Goetz, STEM Summer Camp Dr. Jozef Goetz.
SEO Friendly Website Building a visually stunning website is not enough to ensure any success for your online presence.
Website design and structure. A Website is a collection of webpages that are linked together. Webpages contain text, graphics, sound and video clips.
Effects of Visualization and Interface Design on User Comprehensibility of Composite Data Asheem Chhetri, Apoorv Wairagade, Mahesh Gorantla, Hanye Xu,
Targeted Bottleneck #1: Rule Matching EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Parallel Cascading Style Sheets Leo Meyerovich,
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Headings are defined with the to tags. defines the largest heading. defines the smallest heading. Note: Browsers automatically add an empty line before.
Basic HTML Document Structure. Slide 2 Goals (XHTML HTML5) XHTML Separate document structure and content from document formatting HTML 5 Create a formal.
Understanding Web-Based Digital Media Production Methods, Software, and Hardware Objective
XML Schema – XSLT Week 8 Web site:
Project: Web Designer. Phase 1: The World Wide Web.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Source of website: “Text/css rel=“styles heet” This is an external style sheet link. This means that the.
INTRODUCTION ABOUT DIV Most websites have put their content in multiple columns. Multiple columns are created by using or elements. The div element is.
Week 1: Introduction to HTML and Web Design
DHTML.
Objective % Select and utilize tools to design and develop websites.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Julián ALARTE DAVID INSA JOSEP SILVA
Site-Level Web Template Extraction
Based on Menu Information
Objective % Select and utilize tools to design and develop websites.
Web Data Extraction Based on Partial Tree Alignment
Unit 2 – Webpage Creation
Objective Understand web-based digital media production methods, software, and hardware. Course Weight : 10%
Structuring Content in a Web Document
Understand basic HTML and CSS terminology, concepts, and basic operations. Objective 3.01.
Web Programming and Design
Presentation transcript:

Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit

2 Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

3 Information Retrieval Web Mining Template Detection Content Extraction Block Detection Motivation

Content extraction is the process of determining what parts of a webpage contain the main textual content, thus ignoring additional context such as: menus, status bars, advertisements, sponsored information, etc. 4 Motivation ¿What is content extraction? Discipline that tries to isolate every information block in a webpage. ¿What is block detection?

5 Motivation

6

7 The date is differentThe title is different

Component reuse. Web developers can automatically extract components from a webpage. Enhancing indexers and text analyzers to increase their performance by only processing relevant information. It has been measured that almost 40-50% of the components of a webpage represent the template. Extraction of the main content of a webpage to be suitably displayed in a small device such as a PDA or a mobile phone Extraction of the relevant content to make the webpage more accessible for visually impaired or blind. 8 Motivation ¿Why is template extraction useful?

9 Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

10 The Technique What is a webpage?

Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage 11 The Technique State of the Art Densitometric features: counting characters and tags Statistics on terms: Some terms are common in templates

12 The Technique

13 The Technique

Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage 14 The Technique State of the Art Position of elements: lateral menus, main content centered and visible Less studied: rendering webpages is computationally expensive

Three main different ways to solve the problem: Using the textual information of the webpage (i.e., the HTML code) Using the rendered image of the webpage in the browser Using the DOM tree of the webpage 15 The Technique State of the Art Analysis of the DOM structure: Difficulty in analysing DIV based structures Comparing several webpages: Search for common structures

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags). Some assume that the main content text is continuous. Some assume that the system knows a priori the format of the webpage. Some need to (randomly) load many webpages (several dozens) to compare them. 16 The Technique Limitations of Current approaches

Some are based on the assumption that the webpage has a particular structure (e.g., based on table markuptags) [10]. Some assume that the main content text is continuous [11]. Some assume that the system knows a priori the format of the webpage [10]. Some assume that the whole website to which the webpage belongs is based on the use of some template that is repeated. 17 The Technique Limitations of Current approaches Directory Vicente Ramos Software Development Atmosphere 118 La Piedad, México His Company Company Page

The main problem of these approaches is a big loss of generality. They require to previously know or parse the webpages, or they require the webpage to have a particular structure. This is very inconvenient because modern webpages are mainly based on tags that do not require to be hierarchically organized (as in the table-based design). Moreover, nowadays, many webpages are automatically and dynamically generated and thus it is often impossible to analyze the webpages a priori. 18 The Technique Limitations of Current approaches

19 The Technique Other approaches are able to work: + Online (i.e., with any webpage) + In real-time (i.e., without the need to preprocess the webpages or know their structure)

20 Motivation Content Extraction and Block Detection Template Extraction A Technique for Content Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

The Document Object Model (DOM) API that provides programmers with a standard set of objects for the representation of HTML and XML documents. Given a webpage, it is completely automatic to produce its associated DOM structure and vice versa. The DOM structure of a given webpage is a tree where all the elements of the webpage are represented (included scripts and CSS styles) hierarchically. 21 The Technique Table Div Body H1TableImage Text

The Document Object Model (DOM) Nodes in the DOM tree can be of two types: tag nodes, and text nodes: Tag nodes represent the HTML tags of a HTML document and they contain all the information associated with the tags (e.g., its attributes). Text nodes are always leaves in the DOM tree because they cannot contain other nodes. 22 The Technique I want to know more! Ta bl e Di v Bo dy H1Ta bl e Imag e Te xt

23 Motivation Content Extraction and Block Detection Template Extraction A Technique for Content Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

Our method for template extraction in a nutsell: 1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. 2.Solve conflicts between those webpages that implement different templates. Establishing a voting system between the webpages. 3.The template is the intersection between the initial webpage and the DOM trees in the subdigraph. The intersection is computed with an Equal Top-Down Mapping between the DOM trees. The three steps can be done with a linear cost with respect to the size of the DOM trees. 24 The Technique

1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. 25 The Technique Menu Submenu Domain A Domain B Domain C

The Technique 1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. Hyperlink distance

1.Identify a set of webpages in the website topology. Select those nodes that belong to the menu. Use a complete subdigraph. The Technique Hyperlink distanceDOM distance

2.Solve conflicts between those webpages that implement different templates. Establishing a voting system between the webpages. The Technique

Our method for template extraction in a nutsell: 3.The template is the intersection between the initial webpage and the DOM trees in the subdigraph. 3. The intersection is computed with an Equal Top-Down Mapping between the DOM trees. 29 The Technique Tabl e Div Body H1Tabl e Image Text Tabl e Div Body H1Tabl e Image Text Tabl e Div Body H1Tabl e Image Text Tabl e Div Body H1Tabl e Image Text Tabl e Div Body H1Tabl e Image Text P1 P2 P3 P4 P5

Mapping: 30 The Technique HTML Body Div Table P P HTML Body Table Div P P P P P P

Top-Down Mapping: 31 The Technique HTML Body Div Table P P HTML Body Table Div P P P P P P

Equal Top-Down Mapping: 32 The Technique HTML Body Div Table P P HTML Body Table Div P P P P P P

33 Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

Benchmarks: online heterogeneus webpages Domains with different layouts and page structures Company’s websites, news articles, forums, etc. Final evaluation set randomly selected We determined the actual template of each webpage by downloading it and manually selecting the template. The DOM tree of the selected elements was then produced and used for comparison evaluation later. F1 metric is computed as (2*P*R)/(P+R) being P the precision and R the recall 34 Experiments

35 Experiments

GOLD STANDARD Downloading the complete website of each benchmark. Company’s websites, news articles, forums, etc. Four different engineers did the following independently: Manually exploring the original page and the webpages accessible from it to decide what part of the webpage is the template. Printing the key page in paper and marking the template. The four engineers met and together decided what the template was. Each element marked in the printed page was mapped to the DOM tree of the initial page. All elements in the DOM tree that did not belong to the template were included in an HTML class non-template (i.e., we enriched the HTML code of the key page with a new class). This class was later used by an algorithm that we programmed to evaluate the results obtained by our tool. 36 Experiments

37 Motivation Content Extraction and Block Detection Template Extraction A Technique for Template Extraction State of the art The DOM tree Template extraction based on DOM Experiments Firefox plugin online DEMO Conclusions and Future Work Contents

38 Conclusions and future work Conclusions: New technique proposed for template extraction: 1.It does not make assumptions about the particular structure of webpages. 2.It only needs to process a single webpage (no templates, no other webpages of the same website are needed). 3.No preprocessing stages are needed. The technique can work online. 4.It is fully language independent (it can work with pages written in English, German, etc.). 5.The particular text formatting of the webpage does not influence the performance of the technique.

39 Conclusions and future work Future Work: 1.Consider that a website can implement several templates along the webpages: Extend the benchmark suite by labelling all templates. A new technique to detect all templates of a website. 1.Combine template extraction with content extraction: Firstly, apply template extraction to remove the template, and Secondly, look for the main content on the remaining webpage.

40 Thank You