CS 430: Information Discovery

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
CS 430 / INFO 430 Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Overview of Search Engines
Lecture #32 WWW Search. Review: Data Organization Kinds of things to organize –Menu items –Text –Images –Sound –Videos –Records (I.e. a person ’ s name,
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
1 CS 430: Information Discovery Lecture 12 Extending the Boolean Model.
CS 430: Information Discovery
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Distributed Information Retrieval Server Ranking for Distributed Text Retrieval Systems on the Internet B. Yuwono and D. Lee Siemens TREC-4 Report: Further.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
PubMed …featuring more than 20 million citations for biomedical literature from MEDLINE, life science journals, and online books.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Basics of Databases and Information Retrieval1 Databases and Information Retrieval Lecture 1 Basics of Databases and Information Retrieval Instructor Mr.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Wiser Social Sciences: OxLIP+ General resources ASSIA –Health, social services, economics, politics, race relations, education –Indexes 650 journals from.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Automated Information Retrieval
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Search Engine Architecture
Text Based Information Retrieval
CS 430: Information Discovery
Multimedia Information Retrieval
Thanks to Bill Arms, Marti Hearst
Searching and Indexing
Introduction to Search Engines
Representation of documents and queries
CS 430: Information Discovery
CS 430: Information Discovery
CS 430: Information Discovery
Introduction to Information Retrieval
Lecture 8 Information Retrieval Introduction
Content Analysis of Text
CS 430: Information Discovery
Information Retrieval and Web Design
Information Retrieval and Web Design
Introduction to Search Engines
Presentation transcript:

CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval

Course Administration • Campus store has run out of text books. More are on order. Reading for next week will be changed to not require the text book. • New Teaching Assistant, Yukiko Yamashita • Please send all questions about the course to: wya@cs.cornell.edu kyotov@cs.cornell.edu yukiko@cs.cornell.edu

Classical Information Retrieval media type text image, video, audio, etc. linking searching browsing CS 502 natural language processing catalogs, indexes (metadata) user-in-loop statistical CS 474

Documents A textual document is a digital object consisting of a sequence of words and other symbols, e.g., punctuation. The individual words and other symbols are known as tokens. A textual document can be: • Free text, also known as unstructured text, which is a continuous sequence of tokens. • Fielded text, also known as structured text, in which the text is broken into sections that are distinguished by tags or other markup. [Methods of markup, e.g., XML, are covered in CS 502.]

Word Frequency Observation: Some words are more common than others. Statistics: Most large collections of text documents have similar statistical characteristics. These statistics: • influence the effectiveness and efficiency of data structures used to index documents • many retrieval models rely on them The following example is taken from: Jamie Callan, Characteristics of Text, 1997 http://hobart.cs.umass.edu/~allan/cs646-f97/char_of_text.html

Rank Frequency Distribution For all the words in a collection of documents, for each word w f(w) is the frequency that w appears r(w) is rank of w in order of frequency, e.g., the most commonly occurring word has rank 1 f w has rank r and frequency f r

f f f the 1130021 from 96900 or 54958 of 547311 he 94585 about 53713 to 516635 million 3515 market 52110 a 464736 year 90104 they 51359 in 390819 its 86774 this 50933 and 387703 be 85588 would 50828 that 204351 was 83398 you 49281 for 199340 company 3070 which 48273 is 152483 an 76974 bank 47940 said 148302 has 74405 stock 47401 it 134323 are 74097 trade 47310 on 121173 have 73132 his 47116 by 118863 but 71887 more 46244 as 109135 will 71494 who 42142 at 101779 say 66807 one 41635 mr 101679 new 64456 their 40910 with 101210 share 63925

Zipf's Law If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = c Different collections have different constants c. In English text, c tends to be about n / 10, where n is the number of words in the collection. For a weird but wonderful discussion of this and many other examples of naturally occurring rank frequency distributions, see: Zipf, G. K., Human Behaviour and the Principle of Least Effort. Adison-Wesley, 1949

1000*rf/n 1000*rf/n 1000*rf/n the 59 from 92 or 101 of 58 he 95 about 102 to 82 million 98 market 101 a 98 year 100 they 103 in 103 its 100 this 105 and 122 be 104 would 107 that 75 was 105 you 106 for 84 company 109 which 107 is 72 an 105 bank 109 said 78 has 106 stock 110 it 78 are 109 trade 112 on 77 have 112 his 114 by 81 but 114 more 114 as 80 will 117 who 106 at 80 say 113 one 107 mr 86 new 112 their 108 with 91 share 114

Methods that Build on Zipf's Law Term weighting: Give differing weights to terms based on their frequency, with most frequent words weighed less. Stop lists: Ignore the most frequent words (upper cut-off) Significant words: Ignore the most frequent and least frequent words (upper and lower cut-off)

Luhn's Proposal "It is here proposed that the frequency of word occurrence in an article furnishes a useful measurement of word significance. It is further proposed that the relative position within a sentence of words having given values of significance furnish a useful measurement for determining the significance of sentences. The significance factor of a sentence will therefore be based on a combination of these two measurements." Luhn, H.P., The automatic creation of literature abstracts, IBM Journal of Research and Development, 2, 159-165 (1958)

Cut-off Levels for Significance Words Upper cut-off Lower cut-off Resolving power of significant words Significant words r from: Van Rijsbergen, Ch. 2

Information Retrieval Overview Similar Requests Documents Similar: mechanism for determining which information items meet the requirements of a given request.

Functional View of Information Retrieval Similar: mechanism for determining the similarity of the request representation to the information item representation. Documents Requests Index database

Major Subsystems Indexing subsystem: Receives incoming documents, converts them to the form required for the index and adds them to the index database. Search subsystem: Receives incoming requests, converts them to the form required for searching the index and searches the database for matching documents. The index database is the central hub of the system.

Example: Indexing Subsystem for Boolean Searching documents Documents assign document IDs text document numbers and *field numbers break into words words stoplist non-stoplist words stemming* *Indicates optional operation. stemmed words term weighting* terms with weights Index database from Frakes, page 7

Example: Search Subsystem for Boolean Searching query parse query query terms ranked document set stoplist non-stoplist words ranking* stemming* stemmed words relevance judgments* Boolean operations retrieved document set Index database *Indicates optional operation. relevant document set