Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval.

Similar presentations


Presentation on theme: "1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval."— Presentation transcript:

1 1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval

2 2 Course Administration

3 3 Information Discovery People have many reasons to look for information: Known item Where will I find the wording of the US Copyright Act? Facts What is the capital of Barbados? Introduction or overview How do diesel engines work? Related information Is there a review of this article? Comprehensive search What is known of the effects of global warming on hurricanes?

4 4 Types of Information Discovery media type textimage, video, audio, etc. searchingbrowsing linking statistical user-in-loop catalogs, indexes (metadata) CS 502 natural language processing CS 474 No human effortBy user

5 5 Automated information discovery Creating catalog records manually is labor intensive and hence expensive. The aim of automatic indexing is to build indexes and retrieve information without human intervention. The aim of automated information discovery is for users to discover information without using skilled human effort to build indexes.

6 6 Resources for automated information discovery Computer power brute force computing ranking methods automatic generation of metadata The intelligence of the user browsing relevance feedback information visualization

7 7 Brute force computing Few people really understand Moore's Law -- Computing power doubles every 18 months -- Increases 100 times in 10 years -- Increases 10,000 times in 20 years Simple algorithms + immense computing power may outperform human intelligence

8 8 Problems with (old-fashioned) Boolean searching With Boolean retrieval, a document either matches a query exactly or not at all Encourages short queries Requires precise choice of index terms (professional indexing) Requires precise formulation of queries (professional training)

9 9 Relevance and Ranking Classical methods assume that a document is either relevant to a query or not relevant. Often a user will consider a document to be partially relevant. Ranking methods: measure the degree of similarity between a query and a document. RequestsDocuments Similar Similar: How similar is document to a request?

10 10 Contrast with (old-fashioned) Boolean searching With Boolean retrieval, a document either matches a query exactly or not at all Encourages short queries Requires precise choice of index terms Requires precise formulation of queries (professional training) With retrieval using similarity measures, similarities range from 0 to 1 for all documents Encourages long queries (to have as many dimensions as possible) Benefits from large numbers of index terms Permits queries with many terms, not all of which need match the document

11 11 SMART System An experimental system for automatic information retrieval automatic indexing to assign terms to documents and queries identify documents to be retrieved by calculating similarities between documents and queries collect related documents into common subject classes procedures for producing an improved search query based on information obtained from earlier searches Gerald Salton and colleagues Harvard 1964-1968 Cornell 1968-1988

12 12 t1t1 t2t2 t3t3 d1d1 d2d2  The space has as many dimensions as there are terms in the word list. The index term vector space

13 13 Vector similarity computation Documents in a collection are assigned terms from a set of n terms The term assignment array T is defined as if term j does not occur in document i, t ij = 0 if term j occurs in document i, t ij is greater than zero (the value of t ij is called the weight of term j in document i) Similarity between d i and d j is defined as  t ik t jk |d i | |d j | k=1 n cos(di, d j ) =

14 14 Term weighting Zipf's Law: If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = c This suggests that some terms are more effective than others in retrieval. In particular relative frequency is a useful measure that identifies terms that occur with substantial frequency in some documents, but with relatively low overall collection frequency. Term weights are functions that are used to quantify these concepts.

15 15 Term Frequency Concept A term that appears many times within a document is likely to be more important than a term that appears only once.

16 16 Inverse Document Frequency Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents.

17 17 Ranking -- Practical Experience 1. Basic method is inner (dot) product with no weighting 2. Cosine (dividing by product of lengths) normalizes for vectors of different lengths 3. Term weighting using frequency of terms in document usually improves ranking 4. Term weighting using an inverse function of terms in the entire collection improves ranking (e.g., IDF) 5. Weightings for document structure improve ranking 6. Relevance weightings after initial retrieval improve ranking Effectiveness of methods depends on characteristics of the collection. In general, there are few improvements beyond simple weighting schemes.

18 18 Page Rank Algorithm (Google) Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages.

19 19 Google PageRank Model A user: 1. Starts at a random page on the web 2a. With probability p, selects any random page and jumps to it 2b.With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited.

20 20 Compare TF.IDF to PageRank With TF.IDF document are ranked depending on how well they match a specific query. With PageRank, the pages are ranked in order of importance, with no reference to a specific query.

21 21 Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the index term vector space into a lower dimensional space, using singular value decomposition.

22 22 Use of Concept Space: Term Suggestion

23 23 Non-Textual Materials ContentAttribute mapslat. and long., content photographsubject, date and place bird songs and imagesfield mark, bird song softwaretask, algorithm data setsurvey characteristics videosubject, date, etc.

24 24 Direct Searching of Content Sometimes it is possible to match a query against the content of a digital object. The effectiveness varies from field to field. Examples Images -- crude characteristics of color, texture, shape, etc. Music -- optical recognition of score Bird song -- spectral analysis of sounds Fingerprints

25 25 Image Retrieval: Blobworld

26 26 Automated generation of metadata Vector methods are for textual material only. Metadata is needed for non-textual materials. (Vector methods can be applied to textual metadata.) Automated extraction of metadata is still weak because of the semantic knowledge needed.

27 27 Surrogates for non-textual materials Textual catalog record about a non-textual item (photograph) Surrogate Text based methods of information retrieval can search a surrogate for a photograph

28 28 Library of Congress catalog record CREATED/PUBLISHED: [between 1925 and 1930?] SUMMARY: U. S. President Calvin Coolidge sits at a desk and signs a photograph, probably in Denver, Colorado. A group of unidentified men look on. NOTES: Title supplied by cataloger. Source: Morey Engle. SUBJECTS: Coolidge, Calvin,--1872-1933. Presidents--United States--1920-1930. Autographing--Colorado--Denver--1920-1930. Denver (Colo.)--1920-1930. Photographic prints. MEDIUM: 1 photoprint ; 21 x 26 cm. (8 x 10 in.)

29 29 Photographs: Cataloguing Difficulties Automatic Image recognition methods are very primitive Manual Photographic collections can be very large Many photographs may show the same subject Photographs have little or no internal metadata (no title page) The subject of a photograph may not be known (Who are the people in a picture? Where is the location?)

30 30

31 31 DC-dot applied to http://www.georgewbush.com/ continued on next slide Automatic record for George W. Bush home page

32 32 DC-dot applied to http://www.georgewbush.com/ Automatic record for George W. Bush home page (continued)

33 33 Informedia: the need for metadata A video sequence is awkward for information discovery: Textual methods of information retrieval cannot be applied Browsing requires the user to view the sequence. Fast skimming is difficult. Computing requirements are demanding (MPEG-1 requires 1.2 Mbits/sec). Surrogates are required

34 34 Multi-Modal Information Discovery The multi-modal approach to information retrieval Computer programs to analyze video materials for clues e.g., changes of scene. methods from artificial intelligence, e.g., speech recognition, natural language processing, image recognition. analysis of video track, sound track, closed captioning if present, any other information. Each mode gives imperfect information. Therefore use many approaches and combine the evidence.

35 35 Informedia Library Creation Video Audio Text Speech recognition Image extraction Natural language interpretation Segmentation Segments with derived metadata

36 36 Harnessing the intelligence of the user Relevance feedback Support for browsing Information visualization

37 37 The Human in the Loop Search index Return hits Browse repository Return objects

38 38 Informedia: Information Discovery User Segments with derived metadata Browsing via multimedia surrogates Querying via natural language Requested segments and metadata

39 39 MIRA Evaluation Frameworks for Interactive Multimedia Information Retrieval Applications Information Retrieval techniques are beginning to be used in complex goal and task oriented systems whose main objectives are not just the retrieval of information. New original research in IR is being blocked or hampered by the lack of a broader framework for evaluation. European study, 1996-99

40 40 MIRA Aims Bring the user back into the evaluation process. Understand the changing nature of IR tasks and their evaluation. 'Evaluate' traditional evaluation methodologies. Consider how evaluation can be prescriptive of IR design Move towards balanced approach (system versus user) Understand how interaction affects evaluation. Support the move from static to dynamic evaluation. Understand how new media affects evaluation. Make evaluation methods more practical for smaller groups. Spawn new projects to develop new evaluation frameworks

41 41 Feedback in the Vector Space Model Document vectors as points on a surface Normalize all document vectors to be of length 1 Then the ends of the vectors all lie on a surface with unit radius For similar documents, we can represent parts of this surface as a flat region Similar document are represented as points that are close together on this surface

42 42 Relevance feedback (concept) x x x x o o o   hits from original search x documents identified as non-relevant o documents identified as relevant  original query reformulated query 

43 43 Document clustering (concept) x x x x x x x x x x x x x x x x x x x Document clusters are a form of automatic classification. A document may be in several clusters.

44 44 Browsing in Information Space x x x xx x x x x x x x x x Starting point Effectiveness depends on (a) Starting point (b) Effective feedback (c) Convenience

45 45 User Interface Concepts Users need a variety of ways to search and browse, depending on the task being carried out and preferred style of working Visual icons one-line headlines film strip views video skims transcript following of audio track Collages Semantic zooming Results set Named faces Skimming

46 46

47 47

48 48

49 49 Alexandria User Interface

50 50

51 51 Information Visualization: Tilebars The figure represents a set of hits from a text search. Each large rectangle represents a document or section of text. Each row represents a search term or subquery. The density of each small square indicates the frequency with which a term appears in a section of a document. Hearst 1995

52 52 Information Visualization: Dendrogram alpha delta golf bravo echo charlie foxtrot 1 2 3 6 4 5

53 53 Self Organizing Maps (SOM) Information Visualization:

54 54

55 55 Google has proved... For a very wide range of users entirely automated: selection indexing ranking combined with searching by untrained users and online browsing is a very effective form of information discovery.

56 56 Searching Changing users, changing user interfaces FromTo Trained user or librarianUntrained user Controlled vocabularyNatural language Fielded searchingUnfielded text Manually created recordsFull text Boolean algorithmsRanking methods Stateful protocolsStateless protocols

57 57 Information Discovery: 1991 and 2001 19912001 Contentprintonline Computingexpensiveinexpensive Choice of contentselectivecomprehensive Index creationhumanautomatic Frequencyone timemonthly Vocabularycontrollednot controlled Query Booleanranked retrieval Userstraineduntrained


Download ppt "1 CS 430: Information Discovery Lecture 26 Automated Information Retrieval."

Similar presentations


Ads by Google