9/11/2001Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Warren Sack University of California,

9/11/2001Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Warren Sack University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture authors: Marti Hearst & Ray Larson & Warren Sack

9/11/2001Information Organization and Retrieval Last Time Boolean Model of Information Retrieval

9/11/2001Information Organization and Retrieval Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)

9/11/2001Information Organization and Retrieval Boolean Advantages –simple queries are easy to understand –relatively easy to implement Disadvantages –difficult to specify what is wanted –too much returned, or too little –ordering not well determined Dominant language in commercial systems until the WWW

9/11/2001Information Organization and Retrieval How are the texts handled? What happens if you take the words exactly as they appear in the original text? What about punctuation, capitalization, etc.? What about spelling errors? What about plural vs. singular forms of words What about cases and declension in non- english languages? What about non-roman alphabets?

9/11/2001Information Organization and Retrieval Today Overview of Content Analysis Text Representation Statistical Characteristics of Text Collections Zipf distribution Statistical dependence

9/11/2001Information Organization and Retrieval Content Analysis Automated Transformation of raw text into a form that represent some aspect(s) of its meaning Including, but not limited to: –Automated Thesaurus Generation –Phrase Detection –Categorization –Clustering –Summarization

9/11/2001Information Organization and Retrieval Techniques for Content Analysis Statistical –Single Document –Full Collection Linguistic –Syntactic –Semantic –Pragmatic Knowledge-Based (Artificial Intelligence) Hybrid (Combinations)

9/11/2001Information Organization and Retrieval Text Processing Standard Steps: –Recognize document structure titles, sections, paragraphs, etc. –Break into tokens usually space and punctuation delineated –Stemming/morphological analysis Special issues with morphologically complex languages (e.g., Finnish) –Store in inverted index (to be discussed later)

Information need Index Pre-process Parse Collections Rank Query text input How is the query constructed? How is the text processed?

Information Organization and Retrieval Document Processing Steps

9/11/2001Information Organization and Retrieval Stemming and Morphological Analysis Goal: “normalize” similar words Morphology (“form” of words) –Inflectional Morphology E.g,. inflect verb endings and noun number Never change grammatical class –dog, dogs –tengo, tienes, tiene, tenemos, tienen –Derivational Morphology Derive one word from another, Often change grammatical class –build, building; health, healthy

9/11/2001Information Organization and Retrieval Automated Methods Powerful multilingual tools exist for morphological analysis –PCKimmo, Xerox Lexical technology –Require rules (a “grammar”) and dictionary E.g., “spy”+”s”  “spi”+”es” == “spies” –Use finite-state automata Stemmers: –Very dumb rules work well (for English) –Porter Stemmer: Iteratively remove suffixes –Improvement: pass results through a lexicon

9/11/2001Information Organization and Retrieval Porter Stemmer Rules sses  ss ies  I ed  NULL ing  NULL ational  ate ization  ize

9/11/2001Information Organization and Retrieval Errors Generated by Porter Stemmer (Krovetz 93) Errors of CommissionErrors of Ommission organization  organurgency  urgent generalization  generictriangle  triangular arm  armyexplain  explanation

9/11/2001Information Organization and Retrieval Table-Lookup Stemming E.g., Karp, Schabes, Zaidel and Egedi, “A Freely Available Wide Coverage Morphological Analyzer for English,” COLING-92 Example table entries: matrices  matrix N 3pl fish  fish N 3sg fish V INF fish N 3pl

9/11/2001Information Organization and Retrieval Statistical Properties of Text Token occurrences in text are not uniformly distributed They are also not normally distributed They do exhibit a Zipf distribution

9/11/2001Information Organization and Retrieval Common words in Tom Sawyer the3332 and2972 a1775 … I783 you686 Tom679

9/11/2001Information Organization and Retrieval Plotting Word Frequency by Rank Frequency: How many times do tokens (words) occur in the text (or collection). Rank: Now order these according to how often they occur.

9/11/2001Information Organization and Retrieval Empirical evaluation of Zipf’s Law for Tom Sawyer wordfrequencyrankf * r the33321 and297225944 a177535235 he877108770 be294308820 there222408880 one172508600 friends108008000

9/11/2001Information Organization and Retrieval Rank Freq 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason 10 13 inform 11 11 expert 12 11 analysi 13 10 rule 14 10 program 15 10 oper 16 10 evalu 17 10 comput 18 10 case 19 9 gener 20 9 form The Corresponding Zipf Curve

9/11/2001Information Organization and Retrieval Zoom in on the Knee of the Curve 43 6 approach 44 5 work 45 5 variabl 46 5 theori 47 5 specif 48 5 softwar 49 5 requir 50 5 potenti 51 5 method 52 5 mean 53 5 inher 54 5 data 55 5 commit 56 5 applic 57 4 tool 58 4 technolog 59 4 techniqu

9/11/2001Information Organization and Retrieval Zipf Distribution The Important Points: –a few elements occur very frequently –a medium number of elements have medium frequency –many elements occur very infrequently

9/11/2001Information Organization and Retrieval Zipf Distribution The product of the frequency of words (f) and their rank (r) is approximately constant –Rank = order of words’ frequency of occurrence Another way to state this is with an approximately correct rule of thumb: –Say the most common term occurs C times –The second most common occurs C/2 times –The third most common occurs C/3 times –…

9/11/2001Information Organization and Retrieval Empirical evaluation of Zipf’s Law for Tom Sawyer wordfrequencyrankf * r the33321 and297225944 a177535235 he877108770 be294308820 there222408880 one172508600 friends108008000

Information Organization and Retrieval Zipf Distribution (linear and log scale)

9/11/2001Information Organization and Retrieval Consequences of Zipf There are always a few very frequent tokens that are not good discriminators. –Called “stop words” in IR –Usually correspond to linguistic notion of “closed-class” words English examples: to, from, on, and, the,... Grammatical classes that don’t take on new members. There are always a large number of tokens that occur once and can mess up algorithms. Medium frequency words most descriptive

9/11/2001Information Organization and Retrieval Frequency v. Resolving Power Luhn's contribution to automatic text analysis: frequency data can be used to extract words and sentences to represent a document LUHN, H.P., 'The automatic creation of literature abstracts', IBM Journal of Research and Development, 2, 159-165 (1958)

9/11/2001Information Organization and Retrieval What Kinds of Data Exhibit a Zipf Distribution? Words in a text collection –Virtually any language usage Library book checkout patterns Incoming Web Page Requests (Nielsen) Outgoing Web Page Requests (Cunha & Crovella) Document Size on Web (Cunha & Crovella)

9/11/2001Information Organization and Retrieval Related Distributions/Laws Bradford’s Law of Literary Yield (1934) If periodicals are ranked into three groups, each yielding the same number of articles on a specified topic, the numbers of periodicals in each group increased geometrically. Thus, Leith (1969) shows that by reading only the “core” periodicals of your speciality you will miss about 40% of the articles relevant to it.

9/11/2001Information Organization and Retrieval Related Distributions/Laws Lotka’s distribution of literary productivity (1926) The number of author-scientists who had published N papers in a given field was roughly 1/N**2 the number of authors who had published one paper only.

9/11/2001Information Organization and Retrieval Related Distributions/Laws For more examples, see Robert A. Fairthorne, “Empirical Distributions (Bradford-Zipf- Mandelbrot) for Bibliometric Description and Prediction,” Journal of Documentation, 25(4), 319-341 Pareto distribution of wealth; Willis taxonomic distribition in biology; Mandelbrot on self- similarity and market prices and communication errors; etc.

9/11/2001Information Organization and Retrieval Statistical Independence vs. Statistical Dependence How likely is a red car to drive by given we’ve seen a black one? How likely is the word “ambulence” to appear, given that we’ve seen “car accident”? Color of cars driving by are independent (although more frequent colors are more likely) Words in text are not independent (although again more frequent words are more likely)

9/11/2001Information Organization and Retrieval Statistical Independence Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.

9/11/2001Information Organization and Retrieval Statistical Independence and Dependence What are examples of things that are statistically independent? What are examples of things that are statistically dependent?

9/11/2001Information Organization and Retrieval Lexical Associations Subjects write first word that comes to mind –doctor/nurse; black/white (Palermo & Jenkins 64) Text Corpora yield similar associations One measure: Mutual Information (Church and Hanks 89) If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)

9/11/2001Information Organization and Retrieval Statistical Independence Compute for a window of words w1w11 w21 a b c d e f g h i j k l m n o p

9/11/2001Information Organization and Retrieval Interesting Associations with “Doctor” AP Corpus, N=15 million, Church & Hanks 89)

9/11/2001Information Organization and Retrieval Uninteresting Associations with “Doctor” ( AP Corpus, N=15 million, Church & Hanks 89) These associations were likely to happen because the non-doctor words shown here are very common and therefore likely to co-occur with any noun.

9/11/2001Information Organization and Retrieval Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally –A vector is an array of numbers – floating point –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse

9/11/2001Information Organization and Retrieval Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

9/11/2001Information Organization and Retrieval Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

9/11/2001Information Organization and Retrieval Document Vectors novagalaxy heath’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI Document ids

9/11/2001Information Organization and Retrieval We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

Information Organization and Retrieval Documents in 3D Space

9/11/2001Information Organization and Retrieval Content Analysis Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties –Word frequencies have a Zipf distribution –Word co-occurrences exhibit dependencies Text documents are transformed to vectors –Pre-processing includes tokenization, stemming, collocations/phrases –Documents occupy multi-dimensional space.

Information need Index Pre-process Parse Collections Rank Query text input How is the index constructed?

9/11/2001Information Organization and Retrieval Inverted Index This is the primary data structure for text indexes Main Idea: –Invert documents into a big index Basic steps: –Make a “dictionary” of all the tokens in the collection –For each token, list all the docs it occurs in. –Do a few things to reduce redundancy in the data structure

9/11/2001Information Organization and Retrieval Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

9/11/2001Information Organization and Retrieval How Are Inverted Files Created Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

9/11/2001Information Organization and Retrieval How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically.

9/11/2001Information Organization and Retrieval How Inverted Files are Created Multiple term entries for a single document are merged. Within-document term frequency information is compiled.

9/11/2001Information Organization and Retrieval How Inverted Files are Created Then the file can be split into –A Dictionary file and –A Postings file

9/11/2001Information Organization and Retrieval How Inverted Files are Created Dictionary Postings

9/11/2001Information Organization and Retrieval Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: –document ID –frequency of term in doc (optional) –position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

9/11/2001Information Organization and Retrieval How Inverted Files are Used Dictionary Postings Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query.

9/11/2001Information Organization and Retrieval Next Time Term weighting Statistical ranking

9/11/2001Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Warren Sack University of California,

Similar presentations

Presentation on theme: "9/11/2001Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Warren Sack University of California,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

9/11/2001Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Warren Sack University of California,

Similar presentations

Presentation on theme: "9/11/2001Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Warren Sack University of California,"— Presentation transcript:

Similar presentations

About project

Feedback