Presentation is loading. Please wait.

Presentation is loading. Please wait.

Understanding the Flow of Content in Summarizing HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers.

Similar presentations


Presentation on theme: "Understanding the Flow of Content in Summarizing HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers."— Presentation transcript:

1 Understanding the Flow of Content in Summarizing HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers Inc. Santa Clara, Calif, USA

2 Basic Problem Statement How do we summarize web based documents? Does HTML structure gives us any clue to the understanding of the content? Does flow of content has anything to do with the main message?

3 Why Summarization? Display area of handheld devices i.e. PDAs and Cell phones is too small for useful web browsing Download times is still too slow for comfortable browsing using wireless devices Cost factor is still too high

4 Current need? Viewing website using small screen handheld devices Since web sites are written using HTML codes, we need to translate these to systems that the wireless devices can support.

5 Current Solutions Handcrafting: –Custom Web Sites are typically crafted by hand by a set of content experts Transcoding: –Thranscoding replaces HTML tags with suitable device specific tags (HDML, WML etc)

6 Handcrafting Automation –Use of XML. There is no standard XML tagset (Document Type Definition – DTD) in use by vendors. XML has been available to web designers for the last 10 years. Examination of websites shows little use of document structural elements. –Web masters see themselves as artists rather than programmers. –XML may meet the same fate as SGML, an earlier attempt to create structured documents.

7 Handcrafting Take an existing website and make it available to wireless access. Aether Systems, Mshift and 2Roam currently offer these types of solutions. Use a proprietary graphical interface to ease the development of wireless applications from scratch. Covigo and iConverse offer these type of solutions. Let the user do all coding in languages such as C++ or Java. ThinAirApps offers this type of solution.

8 Handcrafting Labor intensive Expensive. Typically less than 1% of a web site gets converted to wireless content.

9 Transcoding Transcoding was introduced in Japan during 1999-2000. It was widely rejected by the Japanese users. Recently, Google and Pixo introduced this solution for the US market, but have so far failed to attract attention of end users.

10 The Alternate Solution Separate the content into smaller segments Generate a summary of these segments Prioritize these summaries from individual segments Put together to form a summary of the overall document

11 Summarization vs. Transcoding Long displays Long download times Finding information difficult No mapping of the importance of content in the original document

12 Steps to Summarization Structural analysis: Understanding the relationship of the various segments with the document Decomposition: Breakdown on these segments into operational units Contextual Analysis: Employment of context to revise the segmentation (Continued=>)

13 Steps to Summarization (Continued) Labeling => Segment Summary: Extraction of a low level summary of the segment Priority: Estimating importance of these segments Table of Content (TOC) => Document Summary: Putting together a summary of the document

14 Supported Devices and Formats PDAs (HTML3.2) Cell phones –USA/Europe: WAP –Japan iMode (NTT DoCoMo) J-Sky (J-Phone) EZWeb (KDDI)

15 Conclusion It is a good idea to use flow of content in understanding web documents Content can be used effectively to summarize web documents HTML structure is a good starting point, but not enough to understand context Summarization offers significant advantages over transcoding Summarization also helps in faster browsing experience


Download ppt "Understanding the Flow of Content in Summarizing HTML Documents A. Rahman H. Alam R. Hartono Document Analysis and Recognition Team (DART) BCL Computers."

Similar presentations


Ads by Google