Download presentation
Presentation is loading. Please wait.
Published byMarcus Hines Modified over 8 years ago
1
1 CS 430 / INFO 430 Information Retrieval Lecture 1 Overview of Information Retrieval
2
2 Course Administration Web site: http://www.cs.cornell.edu/courses/cs430/2004fa Instructor: William Arms Teaching assistant: Gilly Leshed Assistant: Anat Nidar-Levi Sign-up sheet: Include your NetID Contact the course team: email to cs430-l@cs.cornell.edu Notices: See the course web site
3
3 Information Science Information Science brings together faculty, students and researchers who share an interest in combining computer science with the social sciences of how people and society interact with information. Research programs Undergraduate majors in three colleges Ph.D. program This course is intended for both Computer Science and Information Science students.
4
4 Course Components: Lectures Slides on the web site The slides are an outline. Take your own notes of material that goes beyond the slides Examinations Mid-term and final examinations test material from lectures and discussion classes.
5
5 Discussion Classes Format of Wednesday evening classes: Topic announced on web site with chapter or article to read, or other preparation Allow several hours to prepare for class by reading the materials Class has discussion format One third of grade is class participation Class time is 7:30 to 8:30 in Upson B17 (note change of room)
6
6 Assignments Four individual assignments Intended to be programmed in Java. If you wish to use C++ rather than Java, please send email to cs430-l@cs.cornell.edu. Emphasis is to demonstrate understanding of algorithms and methods, not a test of programming expertise.
7
7 Code of Conduct Computing is a collaborative activity. You are encouraged to work together, but... Some tasks may require individual work. Always give credit to your sources and collaborators. To make use of the expertise of others and to build on previous work, with proper attribution is good professional practice. To use the efforts of others without attribution is unethical and academic cheating. Read and follow the University's Code of Academic Integrity. http://www.cs.cornell.edu/courses/cs430/2003fa/code.html
8
8 Course Description This course studies techniques and human factors in discovering information in online information systems. Methods that are covered include techniques for searching, browsing and filtering information, descriptive metadata, the use of classification systems and thesauruses, with examples from Web search systems and digital libraries.
9
9 Searching and Browsing: The Human in the Loop Search index Return hits Browse repository Return objects
10
10 Definitions Information retrieval: Subfield of computer science that deals with automated retrieval of documents (especially text) based on their content and context. Searching: Seeking for specific information within a body of information. The result of a search is a set of hits. Browsing: Unstructured exploration of a body of information. Linking: Moving from one item to another following links, such as citations, references, etc.
11
11 The Basics of Information Retrieval Query: A string of text, describing the information that the user is seeking. Each word of the query is called a search term. A query can be a single search term, a string of terms, a phrase in natural language, or a stylized expression using special symbols. Full text searching: Methods that compare the query with every word in the text, without distinguishing the function of the various words. Fielded searching: Methods that search on specific bibliographic or structural fields, such as author or heading.
12
12 SORTING AND RANKING HITS When a user submits a query to a search system, the system returns a set of hits. With a large collection of documents, the set of hits maybe very large. The value to the use depends on the order in which the hits are presented. Three main methods: Sorting the hits, e.g., by date Ranking the hits by similarity between query and document Ranking the hits by the importance of the documents
13
13 Examples of Search Systems Find file on a computer system (Spotlight for Macintosh). Library catalog for searching bibliographic records about books and other objects (Library of Congress catalog). Abstracting and indexing system for finding research information about specific topics (Medline for medical information). Web search service for finding web pages (Google).
14
14
15
15
16
16
17
17
18
18 Evaluation To place information retrieval on a systematic basis, we need repeatable criteria to evaluate how effective a system is in meeting the information needs of the user of the system. This proves to be very difficult with a human in the loop. It proves hard to define: the task that the human is attempting the criteria to measure success
19
19 Information Discovery: Examples and Measures of Success People have many reasons to look for information: Known item Where will I find the wording of the US Copyright Act? Success: A document from a reliable source that has the current wording of the act. Fact What is the capital of Barbados? Success: The name of the capital from an up to date reliable source.
20
20 Information Discovery: Examples and Measures of Success (continued) People have many reasons to look for information: Introduction or overview How do diesel engines work? Success: A document that is technically correct, of the appropriate length and technical depth for the audience. Related information (annotation) Is there a review of this item? Success: A review, if one exists, written by a competent author.
21
21 Information Discovery: Examples and Measures of Success (continued) People have many reasons to look for information: Comprehensive search What is known of the effects of global warming on hurricanes? Success: A list of all research papers on this topic. Historically, comprehensive search was the application that motivated information retrieval. It is important in such areas as medicine, law, and academic research. The standard methods for evaluating search services are appropriate only for comprehensive search.
22
22 Indexes Document collection Index User Create index Search index Search systems rarely search document collections directly. Instead an index is built of the documents in the collection and the user searches the index. Documents can be digital (e.g., web pages) or physical (e.g., books)
23
23 Automatic indexing The aim of automatic indexing is to build indexes and retrieve information without human intervention. When the information that is being searched is text, methods of automatic indexing can be very effective. History Much of the fundamental research in automatic indexing was carried out by Gerald Salton, Professor of Computer Science at Cornell, and his graduate students. The reading for Discussion Class 2 is a paper by Salton and others that describes the SMART system used for their research.
24
24 Descriptive metadata Some methods of information retrieval search descriptive metadata about the objects. Metadata typically consists of a catalog or indexing record, or an abstract, one record for each object. The record acts as a surrogate for the object. Usually the metadata is stored separately from the objects that it describes, but sometimes is embedded in the objects. Usually the metadata is a set of text fields. Textual metadata can be used to describe non-textual objects, e.g., software, images, music
25
25 Descriptive metadata Catalog: metadata records that have a consistent structure, organized according to systematic rules. (Example: Library of Congress Catalog) Abstract: a free text record that summarizes a longer document. Indexing record: less formal than a catalog record, but more structure than a simple abstract. (Example: Inspec, Medline)
26
26 Documents and Surrogates The sea is calm to-night. The tide is full, the moon lies fair Upon the straits;--on the French coast the light Gleams and is gone; the cliffs of England stand, Glimmering and vast, out in the tranquil bay. Come to the window, sweet is the night-air! Only, from the long line of spray Where the sea meets the moon-blanch'd land, Listen! you hear the grating roar Of pebbles which the waves draw back, and fling, At their return, up the high strand, Begin, and cease, and then again begin, With tremulous cadence slow, and bring The eternal note of sadness in. Author: Matthew Arnold Title: Dover Beach Genre: Poem Date: 1851 Document Surrogate (catalog record) Notes: 1. The surrogate is also a document 2. Every word is different!
27
27 Surrogates for non-textual materials Textual catalog record about a non-textual item (photograph) Surrogate Text based methods of information retrieval can search a surrogate for a photograph
28
28 Library of Congress catalog record (part) CREATED/PUBLISHED: [between 1925 and 1930?] SUMMARY: U. S. President Calvin Coolidge sits at a desk and signs a photograph, probably in Denver, Colorado. A group of unidentified men look on. NOTES: Title supplied by cataloger. Source: Morey Engle. SUBJECTS: Coolidge, Calvin,--1872-1933. Presidents--United States--1920-1930. Autographing--Colorado--Denver--1920-1930. Denver (Colo.)--1920-1930. Photographic prints. MEDIUM: 1 photoprint ; 21 x 26 cm. (8 x 10 in.)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.