SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall SIMS 202: Information Organization and Retrieval Lecture 7: Statistical Properties of Texts and Vector Representation
SLIDE 2IS 202 – FALL 2004 Lecture Overview Statistical Properties of Text –Zipf Distribution –Statistical Dependence Indexing and Inverted Files Vector Representation Term Weights Vector Matching Credit for some of the slides in this lecture goes to Marti Hearst
SLIDE 3IS 202 – FALL 2004 Lecture Overview Statistical Properties of Text –Zipf Distribution –Statistical Dependence Indexing and Inverted Files Vector Representation Term Weights Vector Matching Credit for some of the slides in this lecture goes to Marti Hearst
SLIDE 4IS 202 – FALL 2004 A Small Collection (Stems) Rank Freq Term 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason inform expert analysi rule program oper evalu comput case 19 9 gener 20 9 form enhanc energi emphasi detect desir date critic content consider concern compon compar commerci clause aspect area aim affect
SLIDE 5IS 202 – FALL 2004 The Corresponding Zipf Curve Rank Freq 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason inform expert analysi rule program oper evalu comput case 19 9 gener 20 9 form
SLIDE 6IS 202 – FALL 2004 Zipf Distribution The Important Points: –A few elements occur very frequently –A medium number of elements have medium frequency –Many elements occur very infrequently
SLIDE 7 Zipf Distribution Linear ScaleLogarithmic Scale
SLIDE 8IS 202 – FALL 2004 Related Distributions/”Laws” Bradford’s Law of Scattering Lotka’s Law of Productivity De Solla Price’s Urn Model for “Cumulative Advantage Processes” ½ = 50%2/3 = 66%¾ = 75%Pick Replace +1
SLIDE 9IS 202 – FALL 2004 Frequent Words on the WWW the a to of and in s for on this is by with or at all are from e you be that not an as home it i have if new t your page about com information will can more has no other one c d m was copyright us (see
SLIDE 10IS 202 – FALL 2004 Word Frequency vs. Resolving Power The most frequent words are not the most descriptive (from van Rijsbergen 79)
SLIDE 11IS 202 – FALL 2004 Statistical Independence Two events x and y are statistically independent if the product of the probabilities of their happening individually equals the probability of their happening together
SLIDE 12IS 202 – FALL 2004 Lexical Associations Subjects write first word that comes to mind –doctor/nurse; black/white (Palermo & Jenkins 64) Text Corpora can yield similar associations One measure: Mutual Information (Church and Hanks 89) If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)
SLIDE 13IS 202 – FALL 2004 Interesting Associations with “Doctor” AP Corpus, N=15 million, Church & Hanks 89
SLIDE 14IS 202 – FALL 2004 These associations were likely to happen because the non-doctor words shown here are very common and therefore likely to co-occur with any noun Un-Interesting Associations with “Doctor” AP Corpus, N=15 million, Church & Hanks 89
SLIDE 15IS 202 – FALL 2004 Content Analysis Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties –Word frequencies have a Zipf distribution –Word co-occurrences exhibit dependencies
SLIDE 16IS 202 – FALL 2004 Lecture Overview Statistical Properties of Text –Zipf Distribution –Statistical Dependence Indexing and Inverted Files Vector Representation Term Weights Vector Matching Credit for some of the slides in this lecture goes to Marti Hearst
SLIDE 17IS 202 – FALL 2004 Inverted Indexes We have seen “Vector files” conceptually –An Inverted File is a vector file “inverted” so that rows become columns and columns become rows
SLIDE 18IS 202 – FALL 2004 How Inverted Files Are Created Documents are parsed to extract tokens These are saved with the Document ID Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2
SLIDE 19IS 202 – FALL 2004 How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically
SLIDE 20IS 202 – FALL 2004 How Inverted Files are Created Multiple term entries for a single document are merged Within-document term frequency information is compiled
SLIDE 21IS 202 – FALL 2004 How Inverted Files are Created Then the file can be split into –A Dictionary file – and –A Postings file
SLIDE 22IS 202 – FALL 2004 How Inverted Files are Created Dictionary Postings
SLIDE 23IS 202 – FALL 2004 Inverted Indexes Permit fast search for individual terms For each term, you get a list consisting of: –Document ID –Frequency of term in doc (optional) –Position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms
SLIDE 24IS 202 – FALL 2004 How Inverted Files are Used Dictionary Postings Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query
SLIDE 25IS 202 – FALL 2004 Lecture Overview Review –Boolean Searching –Content Analysis Statistical Properties of Text –Zipf Distribution –Statistical Dependence Indexing and Inverted Files Vector Representation Term Weights Vector Matching Credit for some of the slides in this lecture goes to Marti Hearst
SLIDE 26IS 202 – FALL 2004 Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally –A vector is like an array of floating point –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse
SLIDE 27IS 202 – FALL 2004 Vector Space Model Documents are represented as vectors in term space –Terms are usually stems –Documents represented by binary or weighted vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents
SLIDE 28IS 202 – FALL 2004 Vector Representation Documents and Queries are represented as vectors Position 1 corresponds to term 1, position 2 to term 2, position t to term t The weight of the term is stored in each position
SLIDE 29IS 202 – FALL 2004 Document Vectors “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)
SLIDE 30IS 202 – FALL 2004 Document Vectors “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I
SLIDE 31IS 202 – FALL 2004 Document Vectors
SLIDE 32IS 202 – FALL 2004 We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior
SLIDE 33IS 202 – FALL 2004 Documents in 3D Space Primary assumption of the Vector Space Model: Documents that are “close together” in space are similar in meaning
SLIDE 34IS 202 – FALL 2004 Vector Space Documents and Queries D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D7D7 D8D8 D9D9 D 10 D 11 t2t2 t3t3 t1t1 Boolean term combinations Q is a query – also represented as a vector
SLIDE 35IS 202 – FALL 2004 Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6
SLIDE 36IS 202 – FALL 2004 Lecture Overview Statistical Properties of Text –Zipf Distribution –Statistical Dependence Indexing and Inverted Files Vector Representation Term Weights Vector Matching Credit for some of the slides in this lecture goes to Marti Hearst
SLIDE 37IS 202 – FALL 2004 Assigning Weights to Terms Binary Weights Raw term frequency tf*idf –Recall the Zipf distribution –Want to weight terms highly if they are Frequent in relevant documents … BUT Infrequent in the collection as a whole Automatically derived thesaurus terms
SLIDE 38IS 202 – FALL 2004 Binary Weights Only the presence (1) or absence (0) of a term is included in the vector
SLIDE 39IS 202 – FALL 2004 Raw Term Weights The frequency of occurrence for the term in each document is included in the vector
SLIDE 40IS 202 – FALL 2004 Assigning Weights tf*idf measure: –Term frequency (tf) –Inverse document frequency (idf) A way to deal with some of the problems of the Zipf distribution Goal: Assign a tf*idf weight to each term in each document
SLIDE 41IS 202 – FALL 2004 Simple tf*idf
SLIDE 42IS 202 – FALL 2004 Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of documents (N = 10000)
SLIDE 43IS 202 – FALL 2004 Lecture Overview Review –Boolean Searching –Content Analysis Statistical Properties of Text –Zipf Distribution –Statistical Dependence Indexing and Inverted Files Vector Representation Term Weights Vector Matching Credit for some of the slides in this lecture goes to Marti Hearst
SLIDE 44IS 202 – FALL 2004 Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient
SLIDE 45IS 202 – FALL 2004 tf*idf Normalization Normalize the term weights (so longer vectors are not unfairly given more weight) –Normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive
SLIDE 46IS 202 – FALL 2004 Vector Space Similarity Now, the similarity of two documents is: This is also called the cosine normalized inner product –The normalization was done when weighting the terms
SLIDE 47IS 202 – FALL 2004 Vector Space Similarity Measure Combine tf and idf into a similarity measure
SLIDE 48IS 202 – FALL 2004 Computing Similarity Scores
SLIDE 49IS 202 – FALL 2004 What’s Cosine Anyway? “One of the basic trigonometric functions encountered in trigonometry. Let theta be an angle measured counterclockwise from the x-axis along the arc of the unit circle. Then cos(theta) is the horizontal coordinate of the arc endpoint. As a result of this definition, the cosine function is periodic with period 2pi.” From
SLIDE 50IS 202 – FALL 2004 Cosine vs. Degrees CosineCosine Degrees
SLIDE 51IS 202 – FALL 2004 Computing a Similarity Score
SLIDE 52IS 202 – FALL 2004 Vector Space Matching D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)
SLIDE 53IS 202 – FALL 2004 Weighting Schemes We have seen something of –Binary –Raw term weights –TF*IDF There are many other possibilities –IDF alone –Normalized term frequency
SLIDE 54IS 202 – FALL 2004 Document Space Has High Dimensionality What happens beyond 2 or 3 dimensions? Similarity still has to do with how many tokens are shared in common More terms means that it is harder to understand which subsets of words are shared among similar documents One approach to handling high dimensionality: Clustering
SLIDE 55IS 202 – FALL 2004 Vector Space Visualization
SLIDE 56IS 202 – FALL 2004 Text Clustering Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Picks out some themes, ignores others
SLIDE 57IS 202 – FALL 2004 Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseau Term 1 Term 2
SLIDE 58IS 202 – FALL 2004 Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re-clusters the documents within Resulting new groups have different “themes”
SLIDE 59IS 202 – FALL 2004 S/G Example: Query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 stellar phenomena 10 flora/fauna49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated
SLIDE 63IS 202 – FALL 2004 Clustering Result Sets Advantages: –See some main themes Disadvantage: –Many ways documents could group together are hidden Thinking point: What is the relationship to classification systems and facets?
SLIDE 64IS 202 – FALL 2004 Salton
SLIDE 65IS 202 – FALL 2004 Cooper
SLIDE 66IS 202 – FALL 2004 Dumais
SLIDE 67IS 202 – FALL 2004 Next Time Probabilistic Ranking and Relevance Feedback Readings –Cheshire II: Designing a Next-Generation Online Catalog (Larson)