Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman and Yuliya Tarnikova Human.

Slides:



Advertisements
Similar presentations
WEB DESIGN TABLES, PAGE LAYOUT AND FORMS. Page Layout Page Layout is an important part of web design Why do you think your page layout is important?
Advertisements

MMDE5011 – INTERACTIVE MEDIA PRACTICE 1 WEEK 1: INTRODUCTION TO HTML5
Project 1 Introduction to HTML.
Search Engines and Information Retrieval
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Program Flow Charting How to tackle the beginning stage a program design.
Assuming Accurate Layout Information for Web Documents is Available, What Now? Hassan Alam, Rachmat Hartono, Aman Kumar, Fuad Rahman, Yuliya Tarnikova.
1st Project Introduction to HTML.
Overview of Search Engines
THE BASICS OF THE WEB Davison Web Design. Introduction to the Web Main Ideas The Internet is a worldwide network of hardware. The World Wide Web is part.
Introduction to JavaScript. Aim To enable you to write you first JavaScript.
XP New Perspectives on Microsoft Access 2002 Tutorial 71 Microsoft Access 2002 Tutorial 7 – Integrating Access With the Web and With Other Programs.
HTML 1 Introduction to HTML. 2 Objectives Describe the Internet and its associated key terms Describe the World Wide Web and its associated key terms.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
CPSC 203 Introduction to Computers Lab 39, 40 By Jie (Jeff) Gao.
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 13 Slide 1 Application architectures.
Computer Concepts 2014 Chapter 7 The Web and .
Chapter 16 The World Wide Web. 2 Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Write basic HTML.
Evaluating Websites.
Claudia Marzi Institute for Computational Linguistics (ILC) National Research Council (CNR) - Italy.
_______________________________________________________________________________________________________________ E-Commerce: Fundamentals and Applications1.
Classroom User Training June 29, 2005 Presented by:
Get more out of 11i with Oracle ADI Richard Byrom Oracle Applications Consultant Appsworld January 2003.
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
Search Engines and Information Retrieval Chapter 1.
Assuming Accurate Layout Information is Available: How do we Interpret the Content Flow in HTML Documents? Hassan Alam and Fuad Rahman Human Computer Interaction.
Introducing Dreamweaver MX 2004
Tutorial 1 Getting Started with Adobe Dreamweaver CS3
1 Web Basics Section 1.1 Compare the Internet and the Web Compare Web sites and Web pages Identify Web browser components Describe types of Web sites Section.
Content Extraction from HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif,
1 California State University, Fullerton Chapter 8 Personal Productivity and Problem Solving.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
OBJECTIVES  What is HTML  What tools are needed  Creating a Web drive on campus (done only once)  HTML file layout  Some HTML tags  Creating and.
Exploring a Hybrid of Support Vector Machines (SVMs) and a Heuristic Based System in Classifying Web Pages Santa Clara, California, USA Ahmad Rahman, Yuliya.
© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
The Internet and World Wide Web
The Internet 8th Edition Tutorial 4 Searching the Web.
1 HTML Frames
Understanding the Flow of Content in Summarizing HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers.
CPSC 203 Introduction to Computers Lab 66 By Jie Gao.
Caprock Internet Services, INC. 1 Creating a Web Site with FrontPage Pasewark LTD.
Strategies for subject navigation of linked Web sites using RDF topic maps Carol Jean Godby Devon Smith OCLC Online Computer Library Center Knowledge Technologies.
CONTENTS  Definition And History  Basic services of INTERNET  The World Wide Web (W.W.W.)  WWW browsers  INTERNET search engines  Uses of INTERNET.
 Web pages originally static  Page is delivered exactly as stored on server  Same information displayed for all users, from all contexts  Dynamic.
HTML Concepts and Techniques Fifth Edition Chapter 1 Introduction to HTML.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,
ASP-2-1 SERVER AND CLIENT SIDE SCRITPING Colorado Technical University IT420 Tim Peterson.
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
JavaScript Introduction and Background. 2 Web languages Three formal languages HTML JavaScript CSS Three different tasks Document description Client-side.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
McGraw-Hill Technology Education © 2004 by the McGraw-Hill Companies, Inc. All rights reserved. Office Word 2003 Working Together 1 Word 2003 and Your.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
The Internet Salihu Ibrahim Dasuki (PhD) CSC102 INTRODUCTION TO COMPUTER SCIENCE.
W eb Document Manipulation for Small Screen Devices: A Review Hassan Alam, and Fuad Rahman Human Computer Interaction Group BCL Technologies Inc. Santa.
Project 1 Introduction to HTML.
Objective % Select and utilize tools to design and develop websites.
HTML, XHTML, and the World Wide Web
Chapter 1 Introduction to HTML
Physical Data Model – step-by-step instructions and template
Objective % Select and utilize tools to design and develop websites.
Chapter 27 WWW and HTTP.
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
Intro Project Introduction to HTML.
Presentation transcript:

Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman and Yuliya Tarnikova Human Computer Interaction Group BCL Technologies Inc. Santa Clara, CA

Overview of the talk Web document re-authoring HTML data structure and segmentation Merging and the “mess” Semantic Relatedness of Textual Segments Spoken Language User Interface Toolkit (SLUITK) How do we do it? Some applications Conclusion and future work

Web Page Data Structure

Merging R Us While merging two segments, the only information available to the merging algorithm is the proximity map and broad content classification. It is not uncommon that sometimes totally unrelated content can easily meet these tests, resulting in the failure of the merging algorithm.

eMerging Questions? How do we determine if two separate web document segments contain related information? What is the definition of 'relatedness'? If other segments are geometrically embedded within closely related segments, can we determine if this segment is also related to the surrounding segments? When a hyperlink is followed and a new page is accessed, how do we know which exact segment within that new page is directly related to the link we just followed?

Natural Language Processing Syntax Semantics Context Anaphora Tokenizing Theme

Our Answer Lexical Chains

A lexical chain is a sequence of related words in a narrative. It can be composed of adjacent words or sentences or can cover elements from the complete narrative. Cohesion is a way of connecting different parts of text into a single theme: is a list of semantically related words, constructed by the use of co- reference, ellipses and conjunctions. This aims to identify the relationship between words that tend to co-occur in the same lexical context.

Lexical Chains Coreference: The grammatical relation between two words that have a common referent – Example: You said you would come In the given sentence, both ‘you’ s have the same referent. Ellipsis: Omission or suppression of parts of words or sentences – Example: 'the virtues I admire', for, 'the virtues 'which' I admire' Conjecture: Reasoning that involves the formation of conclusions from incomplete evidence – Example: Scientists supposed that large dinosaurs lived in swamps

What is SLUI TK? SLUI is a set of tools that allows programmers to rapidly develop applications with Natural Language Processing Functionality

SLUI TK SLUI TK Steps for the Programmer to Follow while Setting up the Toolkit

SLUI TK SLUI TK An innovative way to assist programmers with no linguistic knowledge in developing programs that can understand, process, and act upon spoken Natural Language (NL) input

Our OurFrame Can you suggest some internet sites or books that give details on lowering the LDL by 50 points without including information on cancer risks?

Sentences collected from messages received between June 2000 and May 2001 Deleted attachments, html and other tags, header files, and senders’ information. Also deleted were salutations and greetings Total of 34,640 lines and 170,000 words We constantly update our corpus with new s from our customers. BCL Database

Our Lexical Chains

Relatedness Factor

An Application: Web Page Re- authoring

Segment Scores

Example Output

Future Work Only a single main theme can be handled per document. In future we are going to address a more generic solution that can handle documents with multiple themes. Integration of this NLP method in building commercial summarizers and in aiding existing web page summarization techniques based on structural analysis alone is already well underway. Determining the flow of web information between different web pages as the browser loads up new pages following hyperlinks. Aiding geometric web parsers in determining the correct logical layout by complementing geometric information with linguistic coherence.

Conclusions A novel approach of determining semantic relationship among segments of web documents using lexical chain computation. Two related papers in ICDAR 2003 – One will explore the application of lexical chains in building a commercial summarizer capable of summarizing any document – The other will concentrate on a hybrid approach to web page summarization, combining structural and NLP techniques.