Properties of Text CS336 Lecture 3:
2 Generating Document Representations Want to automatically generate with little human intervention Use significant terms to build representations –referred to as indexing
3 Indexes Indexing choices (there is no “right” answer) What is a word? –Embedded punctuation (e.g., DC-10, long-term) –Case folding (e.g., New vs new, Apple vs apple) –Stopwords (e.g., the, a, its) –Morphology (e.g., computer, computers, computing, computed)
4 Conclusions Text is the main form of communicating knowledge Documents have syntax, structure, and semantics Metadata: information about data Formats of text Modeling Natural Language –Statistical properties Entropy Distribution of symbols –Structural properties e.g.) morphology
5 Generating Document Representations Use significant terms to build representations of documents –referred to as indexing Manual indexing: professional indexers –Usually from a controlled vocabulary –Typically phrases Automatic indexing: machine selects –Machine selects the non-objective terms –Terms can be single words or phrases
6 Indexes Indexing choices (there is no “right” answer) –What is a word? Embedded punctuation (e.g., DC-10, long-term) Case folding (e.g., New vs new, Apple vs apple) Stopwords (e.g., the, a, its) Morphology (e.g., computer, computers, computing, computed) Three basic steps: 1.Lexical analysis 2.Stopword Removal 3.Morphological analysis/Stemming
7 Lexical analysis Turn stream of characters into stream of words What is a word? –Strings separated by white space / punctuation? languages like Chinese need segmentation record positional information for proximity operators –Embedded punctuation? –Case sensitive? –numbers, dates? –other special cases?
8 Include hyphens? e.g.) long-term, DC-10 Break into distinct terms –long and term Single term with hyphen –Chemical/abstracts service-treats as single term –LEXIS/NEXIS - break apart into two terms if they occur in a title or abstract
9 Punctuation and Case Punctuation is sometimes important –“command.com” –“OS/2” Case folding: convert to lower case or not –Smith vs smith –Apple vs apple –New vs new
10 Include numbers? Numbers - not good discriminators But … important in some contexts Usually systems allow tokens to include digits but not to begin with one –So B6 (vitamin) but not 6
11 Stop lists List of terms which are not included in an index Why use stop words? –Lunh 1957 observed that many of the most frequently occurring words worthless as index terms –The 10 most frequently occurring terms account for 20-30% of the word occurrences (Zipf) –Eliminating them saves index space and computation time
12 Stop list Typically most frequently occurring words –a, about, at, and, etc, it, is, the, or, … Among the top 200 are words such as “time” “war” “home” etc. –May be collection specific “computer, machine, program, source, language” in a computer science collection Removal can be problematic (e.g. “Mr. The”, “and-or gates”)
13 Stop lists Commercial systems use only few stop words ORBIT uses only 8, “and, an, by, from, of, the, with” –patents,scientific and technical (sci-tech) information, trademarks and Internet domain names
14 Special Cases? Name Recognition –People’s names - “Bill Clinton” –Company names - IBM & big blue –Places New York City, NYC, the big apple
Stemming Commonly used to conflate morphological variants –combine non identical words referring to same concept compute, computation, computer, … Stemming is used to: –Enhance query formulation (and improve recall) by providing term variants –Reduce size of index files by combining term variants into single index term