Assuming Accurate Layout Information is Available: How do we Interpret the Content Flow in HTML Documents? Hassan Alam and Fuad Rahman Human Computer Interaction.

Slides:



Advertisements
Similar presentations
XML-XSL Introduction SHIJU RAJAN SHIJU RAJAN Outline Brief Overview Brief Overview What is XML? What is XML? Well Formed XML Well Formed XML Tag Name.
Advertisements

XML III. Learning Objectives Formatting XML Documents: Overview Using Cascading Style Sheets to format XML documents Using XSL to format XML documents.
Advanced XSLT. Branching in XSLT XSLT is functional programming –The program evaluates a function –The function transforms one structure into another.
Section 16.1 Create a basic table using HTML Define borders Merge cells Align content in tables Section 16.2 Create a frames-based Web page using HTML.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
1 XSLT – eXtensible Stylesheet Language Transformations Modified Slides from Dr. Sagiv.
XSLT (eXtensible Stylesheet Language Transformation) 1.
XSL XSLT and XPath 11-Apr-17.
1/18 ITApplications XML Module Session 5: Extensible Stylesheet Language (XSL)
Create a table Resize, split and merge cells Insert and align graphics within table cells Insert text and format cell content Maintain Web site Working.
16 HTML Tables and Frames Section 16.1 Create a basic table using HTML Define borders Merge cells Align content in tables Section 16.2 Create a frames-based.
1 CP3024 Lecture 9 XML revisited, XSL, XSLT, XPath, XSL Formatting Objects.
Timing in XML XML and XSL Timing framework in XML Approaches Inline syntax (SMIL) Styled Timing Timesheets Timesheets and SMIL comparison.
XHTML and CSS Overview. Hypertext Markup Language A set of markup tags and associated syntax rules Unlike a programming language, you cannot describe.
ModelicaXML A Modelica XML representation with Applications Adrian Pop, Peter Fritzson Programming Environments Laboratory Linköping University.
Timing in XML Timing framework in XML Approaches Inline syntax (SMIL) Styled Timing Timesheets Timesheets and SMIL comparison.
Chapter 2 Web Site Design Principles Principles of Web Design, Third Edition.
XHTML and CSS Overview. Hypertext Markup Language A set of markup tags and associated syntax rules Unlike a programming language, you cannot describe.
Chapter 2 Web Site Design Principles Principles of Web Design, 4 th Edition.
Assuming Accurate Layout Information for Web Documents is Available, What Now? Hassan Alam, Rachmat Hartono, Aman Kumar, Fuad Rahman, Yuliya Tarnikova.
XML Technologies and Applications Rajshekhar Sunderraman Department of Computer Science Georgia State University Atlanta, GA 30302
September 15, 2003Houssam Haitof1 XSL Transformation Houssam Haitof.
17 Apr 2002 XML Stylesheets Andy Clark. What Is It? Extensible Stylesheet Language (XSL) Language for document transformation – Transformation (XSLT)
Sheet 1XML Technology in E-Commerce 2001Lecture 6 XML Technology in E-Commerce Lecture 6 XPointer, XSLT.
DHTML. What is DHTML?  DHTML is the combination of several built-in browser features in fourth generation browsers that enable a web page to be more.
Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman and Yuliya Tarnikova Human.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
Introduction technology XSL. 04/11/2005 Script of the presentation Introduction the XSL The XSL standard Tools for edition of codes XSL Necessary resources.
XP New Perspectives on XML Tutorial 6 1 TUTORIAL 6 XSLT Tutorial – Carey ISBN
CIS 451: XSL Dr. Ralph Westfall February, Problems With XML no formatting capabilities contra formatting tags like, etc. in HTML CSS can be used.
Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,
CONCEPTS FOR FLUID LAYOUT Web Page Layout. Website Layouts Most websites have organized their content in multiple columns (formatted like a magazine or.
CSCI 1101 Intro to Computers 3. Common Productivity Software.
XML BIS4430 – unit 10. XML Origins Extensible Markup Language (XML) 1998 Inspired by Standard Generalized Markup Language (SGML) and HTML. SGML defines.
Sheet 1XML Technology in E-Commerce 2001Lecture 7 XML Technology in E-Commerce Lecture 7 XSL Formatting Objects, Java Data Binding.
1 CIS336 Website design, implementation and management (also Semester 2 of CIS219, CIS221 and IT226) Lecture 6 XSLT (Based on Møller and Schwartzbach,
Design and Construction of Accessible Web Sites Michael Burks Chairman Internet Society SIG For Internet Accessibility for People with Disabilities June.
XML About XML Things to be known Related Technologies XML DOC Structure Exploring XML.
Challenges in Web Document Summarization: Some Myths and Reality A. Rahman H. Alam Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa.
ECA 228 Internet/Intranet Design I XSLT Example. ECA 228 Internet/Intranet Design I 2 CSS Limitations cannot modify content cannot insert additional text.
WEB BASED DATA TRANSFORMATION USING XML, JAVA Group members: Darius Balarashti & Matt Smith.
FYP: LYU0001 Wireless-based Mobile E-Commerce on the Web Supervisor: Prof. Michael R. Lyu By: Tony, Wat Hong Fai Harris, Yan Wai Keung.
Chapter 2 Web Site Design Principles Principles of Web Design, Third Edition.
Chapter 2 Web Site Design Principles Principles of Web Design, 4 th Edition.
Understanding the Flow of Content in Summarizing HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Applying eXtensible Style Sheets (XSL) Ellen Pearlman Eileen Mullin Programming.
HTML ( HYPER TEXT MARK UP LANGUAGE ). What is HTML HTML describes the content and format of web pages using tags. Ex. Title Tag: A title It’s the job.
INTRODUCTORY Tutorial 5 Using CSS for Layout and Printing.
XP Review 1 New Perspectives on JavaScript, Comprehensive1 Introducing HTML and XHTML Creating Web Pages with HTML.
Session: 8. © Aptech Ltd. 2Creating Navigational Aids and Division-Based Layout / Session 8  Explain HTML5 semantic tags  Explain HTML5 semantic tag.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
XSLT, XML Schema, and XPath Matt McClelland. Introduction XML Schema ▫Defines the content and structure of XML data. XSLT ▫Used to transform XML documents.
CH 5 Attributes, Empty-Element Tags, and XSL. Objective An attribute is a name-value pair included in an element’s start-tag Attributes typically hold.
1 Extensible Stylesheet Language (XSL) Extensible Stylesheet Language (XSL)
Glencoe Introduction to Web Design Chapter 4 XHTML Basics 1 Review Do you remember the vocabulary terms from this chapter? Use the following slides to.
W eb Document Manipulation for Small Screen Devices: A Review Hassan Alam, and Fuad Rahman Human Computer Interaction Group BCL Technologies Inc. Santa.
Laying out Elements with CSS
Unit 4 Representing Web Data: XML
Concepts for fluid layout
Introduction to web design discussing which languages is used for website designing
Layout - you need to understand that a simple navigation bar:
Database Processing with XML
Prepared for Md. Zakir Hossain Lecturer, CSE, DUET Prepared by Miton Chandra Datta
Chapter 7 Representing Web Data: XML
Styles and the Box Model
Page plans A01 Design.
HTML 5 SEMANTIC ELEMENTS.
Attributes, Empty-Element Tags, and XSL
Concepts for fluid layout
Various mobile devices
Presentation transcript:

Assuming Accurate Layout Information is Available: How do we Interpret the Content Flow in HTML Documents? Hassan Alam and Fuad Rahman Human Computer Interaction Group BCL Technologies Inc. Santa Clara, CA

Overview of the Talk Content Flow in Web Pages Structural Flow vs. Logical Flow Language Independence Independence for Semantics Content Flow from Purely Geometrical Information Conclusion and Future Work

Related Work Handcrafting TranscodingAdaptive Re-authoring Handcrafting involves typically crafting web pages by hand by a set of content experts for device specific output. Transcoding replaces HTML tags with suitable device specific tags, such as HDML, WML and others. The research on web page re- authoring can explicitly use natural language processing or use non- NLP techniques.

The HTML Table based Structure

Rows are only used to arrange content How is the Table Structure Exploited? Most HTML source use table as the principal organizational method We assume that a geometric parser will give us exact positioning of each table and sub-table Content is in the Columns. We assume that content flow is language independent, or is it?

Calculate Inclusion Criterion How is the Table Structure Exploited? Calculate xPreference list Calculate yPreference list Perform Proximity analysis: Know thy neighbors! Quantify each table: Calculate area Calculate table hierarchy based on Inclusion criterion and proximity analysis Continued …

Same Inclusion Criterion How is the Table Structure Exploited? Calculate TOC Calculate Level of TOC Calculate Merging Criterion Lowest first Sharing identical sides Not if a border exists

The HTML Table based Structure

Map of Table Layout

What is the Advantage of this Analysis ? What is the Advantage of this Analysis ? Relative importance of content can be assessed, resulting in better re-authoring. It becomes possible to capture the contextual relationship among various components within the document, such as what is a side bar, what is an advertisement, what is a top bar etc. If needed, it is possible to use other natural language techniques to correlate tables by using semantics or other criteria.

Current Work XML is being successfully used in many applications to mark up important information according to application- specific vocabularies. Two W3C Recommendations, XSLT (the Extensible Stylesheet Language Transformations) and XPath (the XML Path Language), meet that need. This is an exploratory paper offering a specific pathway to the future of web page re-authoring provided accurate layout information is available. It is probably better to use the XSLT language, which itself uses XPath, to specify how an implementation of an XSLT processor is to create a desired output from a given marked-up input.

Future Work Exact location of each block, in rectangular coordinates, equivalent to rendition using a standard browser. Size of each block of content. Type of content, e.g. text, graphics etc. Weight of content, in terms of size and placement within a page. Continuity information, derived from physical association in terms of geometrical collocation. Classification of content into a set of pre-defined classes, e.g. main story, sidebars, links and so on. Linkage information from the XML representation, indicating the layers of information that can be hidden at a level of summary. This can represent the content in many levels, but more than two or three levels are unsuitable for easy navigation.

Conclusions A specific pathway to the future of web page re- authoring provided accurate layout information is available. This in no way represents a state of the art discussion about the possible use of layout information. Rather, it focuses on one small part within an array of possibilities. It will be interesting to discuss other possibilities in this space during the DLIA workshop.