Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman and Yuliya Tarnikova Human.

Similar presentations

Presentation on theme: "Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman and Yuliya Tarnikova Human."— Presentation transcript:

1 Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman and Yuliya Tarnikova Human Computer Interaction Group BCL Technologies Inc. Santa Clara, CA 95050

2 Overview of the talk Web document re-authoring HTML data structure and segmentation Merging and the “mess” Semantic Relatedness of Textual Segments Spoken Language User Interface Toolkit (SLUITK) How do we do it? Some applications Conclusion and future work

3 Web Page Data Structure

4 Merging R Us While merging two segments, the only information available to the merging algorithm is the proximity map and broad content classification. It is not uncommon that sometimes totally unrelated content can easily meet these tests, resulting in the failure of the merging algorithm.

5 eMerging Questions? How do we determine if two separate web document segments contain related information? What is the definition of 'relatedness'? If other segments are geometrically embedded within closely related segments, can we determine if this segment is also related to the surrounding segments? When a hyperlink is followed and a new page is accessed, how do we know which exact segment within that new page is directly related to the link we just followed?

6 Natural Language Processing Syntax Semantics Context Anaphora Tokenizing Theme

7 Our Answer Lexical Chains

8 A lexical chain is a sequence of related words in a narrative. It can be composed of adjacent words or sentences or can cover elements from the complete narrative. Cohesion is a way of connecting different parts of text into a single theme: is a list of semantically related words, constructed by the use of co- reference, ellipses and conjunctions. This aims to identify the relationship between words that tend to co-occur in the same lexical context.

9 Lexical Chains Coreference: The grammatical relation between two words that have a common referent – Example: You said you would come In the given sentence, both ‘you’ s have the same referent. Ellipsis: Omission or suppression of parts of words or sentences – Example: 'the virtues I admire', for, 'the virtues 'which' I admire' Conjecture: Reasoning that involves the formation of conclusions from incomplete evidence – Example: Scientists supposed that large dinosaurs lived in swamps

10 What is SLUI TK? SLUI is a set of tools that allows programmers to rapidly develop applications with Natural Language Processing Functionality

11 SLUI TK SLUI TK Steps for the Programmer to Follow while Setting up the Toolkit

12 SLUI TK SLUI TK An innovative way to assist programmers with no linguistic knowledge in developing programs that can understand, process, and act upon spoken Natural Language (NL) input

13 Our OurFrame Can you suggest some internet sites or books that give details on lowering the LDL by 50 points without including information on cancer risks?

14 Sentences collected from email messages received between June 2000 and May 2001 Deleted attachments, html and other tags, header files, and senders’ information. Also deleted were salutations and greetings Total of 34,640 lines and 170,000 words We constantly update our corpus with new emails from our customers. BCL Database

15 Our Lexical Chains

16 Relatedness Factor

17 An Application: Web Page Re- authoring

18 Segment Scores

19 Example Output

20 Future Work Only a single main theme can be handled per document. In future we are going to address a more generic solution that can handle documents with multiple themes. Integration of this NLP method in building commercial summarizers and in aiding existing web page summarization techniques based on structural analysis alone is already well underway. Determining the flow of web information between different web pages as the browser loads up new pages following hyperlinks. Aiding geometric web parsers in determining the correct logical layout by complementing geometric information with linguistic coherence.

21 Conclusions A novel approach of determining semantic relationship among segments of web documents using lexical chain computation. Two related papers in ICDAR 2003 – One will explore the application of lexical chains in building a commercial summarizer capable of summarizing any document – The other will concentrate on a hybrid approach to web page summarization, combining structural and NLP techniques.

Download ppt "Web Document Analysis: How can Natural Language Processing Help in Determining Correct Content Flow? Hassan Alam, Fuad Rahman and Yuliya Tarnikova Human."

Similar presentations

Ads by Google