LIDA 2003 Invited Paper The Challenge of Finding Information in Long Documents David J Harper The Robert Gordon University Smart Web Technologies Centre.

LIDA 2003 Invited Paper The Challenge of Finding Information in Long Documents David J Harper The Robert Gordon University Smart Web Technologies Centre School of Computing Aberdeen, Scotland.

LIDA 2003 Invited Paper Preamble u Information retrieval research has focussed largely on document retrieval, and rather less on within-document retrieval u Within-document retrieval is just part of a range of tools and techniques that address “retrieval-with-reading” activities u Explore language modelling as a principled basis for “retrieval-with-reading” techniques or tools

LIDA 2003 Invited Paper Outline of Talk u Categorisation of retrieval-with-reading activities u Review of retrieval-with-reading techniques and tools u Language Modelling 101 u ProfileSkim: Relevance Profiling Tool u Applying Language Modelling to retrieval-with- reading activities u Concluding Remarks

LIDA 2003 Invited Paper Categorization of Reading Activities Reading to … u … to select a document u Buying a book u Opening a webpage retrieved by search engine u Deciding to read document u … to extract/locate specific information u Finding a quotation in a book u Locating contact details on a webpage u … to reference information (more generally) u Finding supporting information for a legal case u Finding related work

LIDA 2003 Invited Paper Categorization of Reading Activities (cont) Reading to … u … to write a document u Usually involves a complex mix of other reading activities u … to explore the information space from a given “pivot” document u Follow-up bibliographic references in a paper u Follow hypertext links in web pages u Find similar documents u … to understand a document in depth u Reading a book/paper cover-to-cover u Skimming a book/paper

LIDA 2003 Invited Paper Reading to Select a Document u Enabled by various forms of document summarisation or overview u Summarisation of documents, e.g. automatic abstracting or extracting u Snippet summarisation of web pages retrieved by search engines: u Generic summarisation u Query-biased summarisation u Overviews of document structure/content

LIDA 2003 Invited Paper Reading to Select a Document (Example 1) Query-biased web page summarisation u Generating summaries for use in ranked retrieval display u Summaries based on distribution of words in document (title, headings, body) biased towards query words u Top-scoring sentences used in summary u User experiments confirm that query-biased summaries are better than general summaries u Tombros and Sanderson 1998

LIDA 2003 Invited Paper Reading to Select a Document (Example 2) u Tilebars: Compact visualisation of retrieved documents with respect to query (topic) showing: u relative length of each document, u the frequency of the topic words in the document, and u the distribution of the topic words with respect to the document and to each other u Hearst 1995

LIDA 2003 Invited Paper Reading to Extract Specific Information u Information extraction techniques that extract factoids (and usually populate a database) based on templates, e.g. extracting contact details from web pages, Ask Jeeves u Passage (or snippet) retrieval, where the passage contains the desired specific information u Browsing tools and techniques: u Query term highlighting within retrieved documents u Find function in web browser/ word processing package (woeful)

LIDA 2003 Invited Paper Reading to Reference Information in a Document u Reading tools that integrate document overviews (e.g. table of contents) and document view u Passage retrieval, providing that passages rather than documents are retrieved u Within-document retrieval tools u ProfileSkim: passage retrieval in context

LIDA 2003 Invited Paper Reading to Write a Document u Interleaving of writing and reading sub-tasks u Mix of different kinds of reading activities u Example: Remembrance Agent u Augments user while writing (unobstrusive) u Displays documents (emails, notes, online documents) relevant to user’s current context u Monitors writing/browsing activity and displays one- line summaries in document editor (Emacs) u Rhodes and Starner 1996

LIDA 2003 Invited Paper Reading to Explore from Pivot Document u Follow-up references, papers by same author, same group, etc. CiteSeer is obvious tool on the Web u Find nearest neighbour documents by essentially using pivot document as a query, e.g. “More Like This” function u Explore category in which document is located, e.g. documents in NLM MESH category, web pages in Yahoo! Category u Follow hard-wired hypertext links u Within and between document cross references u Follow “soft” hypertext links u Use chunk of document text as a query [Plagarism Story]

LIDA 2003 Invited Paper Reading to Understand or Study a Document u In general, will involve a mix of other kinds of reading activity u Annotation (including ability to add dynamic cross references) and “clipping” are arguably as important as reading

LIDA 2003 Invited Paper “Reading” of Multi-media Documents u Kinds of reading activity equally applicable to multimedia documents u Reading to select: video or soundtrack u Reading to extract: quotation in audio speech u Reading to reference: scene/shot retrieval in a video

LIDA 2003 Invited Paper Language Modelling 101 u (Simple) statistical representation of a “chunk” of text, e.g. of a document, paragraph, etc u Simpliest model is “bag of words” model, which essentially: u Counts frequencies of words (tokens) in text u Interprets counts as a probability distribution u Use distributions to compare different text chunks!!

LIDA 2003 Invited Paper “Bag of Words” Example u Consider relevance of this document with respect to queries: { TREC, experiment } { precision, recall } Document Words Frequency prob evaluation 0.05 retrieval 0.15 information 0.15 system 0.15 TREC 0.25 experiment 0.15 precision 0.05 recall 0.05

LIDA 2003 Invited Paper Language Modelling 101 (cont) u Language models can built over any chunks of text: u Collection or (arbitrary) set of documents u Entire document u Parts of document u Given Text1 and Text2, and corresponding language models Model T1 and Model T2, we can use them to: u Compare similarity of texts by comparing models Model T1 Model T2 e.g. document document u Deciding if a text could be “generated” from another text Probability of (Model T1 -> Text2) e.g. document -> query, often expressed as Prob( Query ¦ Model Document )

LIDA 2003 Invited Paper Using Language Models for Retrieval Processes u Similarity of text chunks, e.g. document with document u Matching based on probability of generating one text chunk from another, e.g. query from document Document 1Document 2 Model of 1Model of 2 Document D Model of DQuery Model T1 Model T2 Pr (Model T1 -> Text2)

LIDA 2003 Invited Paper ProfileSkim u Developed to support retrieval within long documents u Within document retrieval tool: supports reading to extract and reading to reference u Main concept: relevance profiling based on language modelling u Harper et al 2002, 2003

LIDA 2003 Invited Paper Overview of ProfileSkim Tool File to skim Skim query Tile being visited Highlighted query term variants

LIDA 2003 Invited Paper Relevance Profile Meter (1) Retrieval Status Value Word position Document Relevance Profile Meter Click and visit... Tile

LIDA 2003 Invited Paper Relevance Profiling Process P(query | window) Tile max -> tile RSV Sliding window

LIDA 2003 Invited Paper Profile Generation using Language Modelling u sliding window of N words of fixed size u compute “retrieval status value” RSV window at each word position in the document u RSV window = P( generate query | window )

LIDA 2003 Invited Paper Query-biased summarisation: Using LM u Select representative paragraph for a retrieved document based on query: u Choose paragraph (para) where: u Mpara Mdoc is largest AND u Pr (Mpara -> Query) is largest Paragraph Document Query Lang. Models Mdoc Mpara1 Mpara2 etc

LIDA 2003 Invited Paper Soft hyperlinks: Using LM u Given selected text within document, generate soft-links to other (relevant) documents u Assume text model of web (say) Mweb u Compare Mweb and Mselect to choose set of terms that contribute to MOST to divergence u Use chosen terms to query the Web, and generate soft links u Note: Can mix Mselect and Mdoc to obtain better model of selected text! Selected Text (Mselect) Document (Mdoc) Soft-linked Documents

LIDA 2003 Invited Paper Reading to write: Using LM (exercise for reader) u As you are writing a document, a tool suggests parts of other documents that may be relevant. c.f. Remembrance Agent writing this

LIDA 2003 Invited Paper Reading in Context u Reading documents is generally done in the context of a larger task, and the pattern of reading activities will depend on the task. u Task Writing a research proposal for EU Framework 6: u Reading FP6 Programme Call (and many related documents): reading to extract and reference u Reading to reference documents supporting proposal u Reading to extract ancillary information, e.g. contact details from web pages (say) u Can you think of any searching/reading environment that supports such a complex set of interactions?

LIDA 2003 Invited Paper Concluding Remarks u Reading of (long) documents to find information is raising interesting challenges in the field of information retrieval u A variety of reading activities should be supported, and preferably within an information seeking (with reading) environment u Language Models enable us to model text chunks at various levels of granularity, and thus provide a principled foundation for “retrieval-with-reading” techniques and tools

LIDA 2003 Invited Paper Reading List u Hearst, M. A.: TileBars: visualization of term distribution information in full text information access. Proc. CHI'95, (1995), 56-66. u Whittaker, S., Hirschberg, J., Choi, J., Hindle, D., Pereira, F. and Singhal, A.: SCAN: Designing and evaluating user interfaces to support retrieval from speech archives. In Proceedings ACM SIGIR '99. ACM Press (1999) 26-33. u Kaszkiel, M. and Zobel, J.: Passage Retrieval Revisited. In: Proceedings of the Twentieth International ACM-SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, July 1997. ACM Press (1997) 178-185. u Kaszkiel, M.: Indexing and Retrieval of Passages in Full-Text Databases, PhD thesis. RMIT Computer Science Technical Report (RT-17), May 2000 (2000).

LIDA 2003 Invited Paper cont… u Kaszkiel, M., Zobel, J. and Sacks-Davis, R.: Efficient Passage Ranking for Document Databases. ACM Transactions on Information Systems, Vol 17, No. 4 (1999) 406-439. u Landauer, T., Egan, D., Remde, J., Lesk, M., Lochbaum, C., and Ketchum, D.: Enhancing the usability of text through computer delivery and formative evaluation: The SuperBook project. In: McKnight, C., Dillon, A., and Richardson, J. (eds): Hypertext: A Psychological Perspective. Ellis Horwood (1993) 71-136. u Marchionini. G.: Information Seeking in Electronic Environments. Cambridge University Press, Cambridge (1995). u Byrd, D.: A Scrollbar-based Visualization for Document Navigation. In Proceedings of ACM Digital Libraries 99. ACM Press (1999). u de Kretser, O. and Moffat, A.: Effective Document Presentation with a Locality-Based Similarity Heuristic. In: Proceedings of the Twenty Second International ACM-SIGIR Conference on Research and Development in Information Retrieval, Berkeley, August 1999. ACM Press (1999) 113-120.

LIDA 2003 Invited Paper cont… u Tombros, A. and Sanderson, M.: Advantages of Query Biased Summaries in Information Retrieval. In: Proceedings of 1998 ACM SIGIR Conference on Research and Development in Information Retrieval (1998) 2-10. u Ponte, J. and Croft, W. B.: A language modeling approach to information retrieval. In: Proceedings of the 1998 ACM SIGIR Conference on Research and Development in Information Retrieval (1998) 275-281. u Song, F. and Croft, W.B.: A general language model for information retrieval in Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval (1999) 279-280. u Schilit, B. N., Golovchinsky, G. and Price, M. N.: Beyond paper: Supporting Active Reading with free-form digital ink annotations. In: Proceedings of CHI98, ACM Press (1998) 149-156.

LIDA 2003 Invited Paper cont… u Harper, D. J., Coulthard, S. and Sun, Y.: A Language Modelling Approach to Relevance Profiling for Document Browsing. In: Procs JCDL 2002, Oregon, USA (2002) 76-83. u Harper, D. J., Koychev, I. and Sun, Y. : Query-Based Document Skimming: A User-Centred Evaluation. In: Procs 25 th European Conference on IR Research, LNCS 2622, Springer (2003) 377-392. u Rhodes, B. J. and Starner, T.: Remembrance Agent: A continuously running automated retrieval system. In: Proceedings of The First International Conference on The Practical Application Of Intelligent Agents and Multi Agent Technology (PAAM '96), (1996) 487-495.

LIDA 2003 Invited Paper The Challenge of Finding Information in Long Documents David J Harper The Robert Gordon University Smart Web Technologies Centre.

Similar presentations

Presentation on theme: "LIDA 2003 Invited Paper The Challenge of Finding Information in Long Documents David J Harper The Robert Gordon University Smart Web Technologies Centre."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LIDA 2003 Invited Paper The Challenge of Finding Information in Long Documents David J Harper The Robert Gordon University Smart Web Technologies Centre.

Similar presentations

Presentation on theme: "LIDA 2003 Invited Paper The Challenge of Finding Information in Long Documents David J Harper The Robert Gordon University Smart Web Technologies Centre."— Presentation transcript:

Similar presentations

About project

Feedback