What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like herding cats.” Dave Platt or… paper/article? video?
Basic IR: Documents Assume: free text from a quotation through a book (unstructured or semi-structured data) English available electronically (on-line repositories) generally, too many documents to store locally in an index. generally, infer semantics through low level units (e.g., terms) and metadata
Logical View of Documents structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs structureFull text Index terms (Figure taken from on-line course resources for Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto)
Structure Metadata is information on the organization of the data. external to meaning: length, author, date… subject matter: subject codes, keywords, taxonomic indicators Organizational Conventions: articles have a title, author list, abstract, sections, etc. web pages have headings, title, keywords, etc. structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs
Markup Languages Markup is extra syntax that describes formatting, attributes, semantics, etc. Tags provide direction and delineate beginning and end of marks. Examples: TeX, Standard Generalized Markup Language (SGML), eXtensible Markup Language (XML), HyperText Markup Language (HTML).
Term Separators: Accents, Spacing, etc… Lexical analysis divides text into distinct terms. usually disregard punctuation, numbers, spaces Decisions: how to treat case and hyphens? disregard comments? how to use or not formatting directives? structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs
Information in Terms Information entropy quantifies information content: where there are a set of terms and p is the relative frequency (%) of a term.
Term Distribution Zipf’s Law approximates the distribution of term frequencies in a text. Frequency of ith most frequent term is times that of most frequent term where 1.5 < < 2.0 Freq Terms
Stop Words words that either appear so frequently that they do not distinguish documents (e.g., “www”) or have more syntactic than semantic role (e.g., “the”). Advantage: Filtering out stop words reduces document description and focuses attention on terms that convey more information. Disadvantage: May reduce recall structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs
Vocabulary Size Heap’s Law models the size of vocabulary as a function of: the size of the text ( n ), a baseline ( 10<K<100 ), a growth factor ( < 1 ). Voc Text Size
Noun Groups Further focus term set by filtering for particular subsets selected manually (e.g., classifications or index terms). Discard terms that are not nouns*. Fix spelling errors. Use a thesaurus to combine similar words. *From Google web site, Top 20 gaining queries 2002 contain only nouns. structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs
Stemming Grammars permit minor modifications of terms that change their type rather than meaning, e.g., plurals, gerunds, some prefixes and suffixes… Stemming reduces term to just the core (stem). Advantages: reduces set of terms, combines same meaning Disadvantage: may reduce recall by incorrectly combining meanings (e.g., “skies” and “ski”) structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs
Putting it together: Document The purpose of the course is to teach theory and practice underlying the construction of Web based information systems. As such, the course will devote equal time to information retrieval and software engineering topics. The theory will be put into practice through a semester long team programming project. 48 words, 307 characters
Putting it together: Stop Word Removal purpose course teach theory practice underlying construction Web based information course devote equal time information retrieval software engineering topics theory practice semester long team programming project 26 words, 213 chars
Putting it together: Only Nouns purpose course theory practice construction Web information course equal time information retrieval software engineering topics theory practice semester team programming project 21 words, 179 chars
Putting it together: Stemming & Alphabetizing construct course course engineer equal informat informat practice practice program project purpose retrieve semester software team theory theory time topic web 21 words, 161 chars
Indexing Terms remaining after document processing must be stored to facilitate retrieval. Typically, they are stored in an inverted index. More on that later… structure Accents spacing stopwords Noun groups stemming Automatic or Manual indexing Docs