Presentation is loading. Please wait.

Presentation is loading. Please wait.

2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

Similar presentations


Presentation on theme: "2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002"— Presentation transcript:

1 2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/is202/f02/ SIMS 202: Information Organization and Retrieval Lecture 18: Vector Representation

2 2002.10.31 - SLIDE 2IS 202 – FALL 2002 Lecture Overview Review –Content Analysis –Statistical Properties of Text Zipf Distribution Statistical Dependence –Indexing and Inverted Files Vector Representation Term Weights Vector Matching Clustering Credit for some of the slides in this lecture goes to Marti Hearst

3 2002.10.31 - SLIDE 3IS 202 – FALL 2002 Lecture Overview Review –Content Analysis –Statistical Properties of Text Zipf Distribution Statistical Dependence –Indexing and Inverted Files Vector Representation Term Weights Vector Matching Clustering Credit for some of the slides in this lecture goes to Marti Hearst

4 2002.10.31 - SLIDE 4IS 202 – FALL 2002 Techniques for Content Analysis Statistical –Single Document –Full Collection Linguistic –Syntactic –Semantic –Pragmatic Knowledge-Based (Artificial Intelligence) Hybrid (Combinations)

5 2002.10.31 - SLIDE 5IS 202 – FALL 2002 Content Analysis Areas How is the text processed? Index Pre-Process Parse Collections Rank Query Text Input How is the query constructed? Information Need

6 2002.10.31 - SLIDE 6 Document Processing Steps From “Modern IR” Textbook

7 2002.10.31 - SLIDE 7IS 202 – FALL 2002 Errors Generated by Porter Stemmer From Krovetz ‘93

8 2002.10.31 - SLIDE 8IS 202 – FALL 2002 A Small Collection (Stems) Rank Freq Term 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason 10 13 inform 11 11 expert 12 11 analysi 13 10 rule 14 10 program 15 10 oper 16 10 evalu 17 10 comput 18 10 case 19 9 gener 20 9 form 150 2 enhanc 151 2 energi 152 2 emphasi 153 2 detect 154 2 desir 155 2 date 156 2 critic 157 2 content 158 2 consider 159 2 concern 160 2 compon 161 2 compar 162 2 commerci 163 2 clause 164 2 aspect 165 2 area 166 2 aim 167 2 affect

9 2002.10.31 - SLIDE 9IS 202 – FALL 2002 The Corresponding Zipf Curve Rank Freq 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason 10 13 inform 11 11 expert 12 11 analysi 13 10 rule 14 10 program 15 10 oper 16 10 evalu 17 10 comput 18 10 case 19 9 gener 20 9 form

10 2002.10.31 - SLIDE 10IS 202 – FALL 2002 Zipf Distribution The Important Points: –A few elements occur very frequently –A medium number of elements have medium frequency –Many elements occur very infrequently

11 2002.10.31 - SLIDE 11 Zipf Distribution Linear ScaleLogarithmic Scale

12 2002.10.31 - SLIDE 12IS 202 – FALL 2002 Related Distributions/”Laws” Bradford’s Law of Scattering Lotka’s Law of Productivity De Solla Price’s Urn Model for “Cumulative Advantage Processes” ½ = 50%2/3 = 66%¾ = 75%Pick Replace +1

13 2002.10.31 - SLIDE 13IS 202 – FALL 2002 Frequent Words on the WWW 65002930 the 62789720 a 60857930 to 57248022 of 54078359 and 52928506 in 50686940 s 49986064 for 45999001 on 42205245 this 41203451 is 39779377 by 35439894 with 35284151 or 34446866 at 33528897 all 31583607 are 30998255 from 30755410 e 30080013 you 29669506 be 29417504 that 28542378 not 28162417 an 28110383 as 28076530 home 27650474 it 27572533 i 24548796 have 24420453 if 24376758 new 24171603 t 23951805 your 23875218 page 22292805 about 22265579 com 22107392 information 21647927 will 21368265 can 21367950 more 21102223 has 20621335 no 19898015 other 19689603 one 19613061 c 19394862 d 19279458 m 19199145 was 19075253 copyright 18636563 us (see http://elib.cs.berkeley.edu/docfreq/docfreq.html)

14 2002.10.31 - SLIDE 14IS 202 – FALL 2002 Word Frequency vs. Resolving Power The most frequent words are not the most descriptive (from van Rijsbergen 79)

15 2002.10.31 - SLIDE 15IS 202 – FALL 2002 Statistical Independence Two events x and y are statistically independent if the product of the probabilities of their happening individually equals the probability of their happening together

16 2002.10.31 - SLIDE 16IS 202 – FALL 2002 Lexical Associations Subjects write first word that comes to mind –doctor/nurse; black/white (Palermo & Jenkins 64) Text Corpora can yield similar associations One measure: Mutual Information (Church and Hanks 89) If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)

17 2002.10.31 - SLIDE 17IS 202 – FALL 2002 Interesting Associations with “Doctor” AP Corpus, N=15 million, Church & Hanks 89

18 2002.10.31 - SLIDE 18IS 202 – FALL 2002 These associations were likely to happen because the non-doctor words shown here are very common and therefore likely to co-occur with any noun Un-Interesting Associations with “Doctor” AP Corpus, N=15 million, Church & Hanks 89

19 2002.10.31 - SLIDE 19IS 202 – FALL 2002 Content Analysis Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties –Word frequencies have a Zipf distribution –Word co-occurrences exhibit dependencies Text documents are transformed to vectors –Pre-processing includes tokenization, stemming, collocations/phrases –Documents occupy multi-dimensional space

20 2002.10.31 - SLIDE 20IS 202 – FALL 2002 Inverted Indexes We have seen “Vector files” conceptually –An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

21 2002.10.31 - SLIDE 21IS 202 – FALL 2002 How Inverted Files are Created Dictionary Postings

22 2002.10.31 - SLIDE 22IS 202 – FALL 2002 Inverted Indexes Permit fast search for individual terms For each term, you get a list consisting of: –Document ID –Frequency of term in doc (optional) –Position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

23 2002.10.31 - SLIDE 23IS 202 – FALL 2002 How Inverted Files are Used Dictionary Postings Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query

24 2002.10.31 - SLIDE 24IS 202 – FALL 2002 Lecture Overview Review –Content Analysis –Statistical Properties of Text Zipf Distribution Statistical Dependence –Indexing and Inverted Files Vector Representation Term Weights Vector Matching Clustering Credit for some of the slides in this lecture goes to Marti Hearst

25 2002.10.31 - SLIDE 25IS 202 – FALL 2002 Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally –A vector is like an array of floating point –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse

26 2002.10.31 - SLIDE 26IS 202 – FALL 2002 Vector Space Model Documents are represented as vectors in term space –Terms are usually stems –Documents represented by binary or weighted vectors of terms Queries represented the same as documents Query and Document weights are based on length and direction of their vector A vector distance measure between the query and documents is used to rank retrieved documents

27 2002.10.31 - SLIDE 27IS 202 – FALL 2002 Vector Representation Documents and Queries are represented as vectors Position 1 corresponds to term 1, position 2 to term 2, position t to term t The weight of the term is stored in each position

28 2002.10.31 - SLIDE 28IS 202 – FALL 2002 Document Vectors “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

29 2002.10.31 - SLIDE 29IS 202 – FALL 2002 Document Vectors “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

30 2002.10.31 - SLIDE 30IS 202 – FALL 2002 Document Vectors

31 2002.10.31 - SLIDE 31IS 202 – FALL 2002 We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

32 2002.10.31 - SLIDE 32IS 202 – FALL 2002 Documents in 3D Space Primary assumption of the Vector Space Model: Documents that are “close together” in space are similar in meaning

33 2002.10.31 - SLIDE 33IS 202 – FALL 2002 Vector Space Documents and Queries D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D7D7 D8D8 D9D9 D 10 D 11 t2t2 t3t3 t1t1 Boolean term combinations Q is a query – also represented as a vector

34 2002.10.31 - SLIDE 34IS 202 – FALL 2002 Documents in Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

35 2002.10.31 - SLIDE 35IS 202 – FALL 2002 Lecture Overview Review –Content Analysis –Statistical Properties of Text Zipf Distribution Statistical Dependence –Indexing and Inverted Files Vector Representation Term Weights Vector Matching Clustering Credit for some of the slides in this lecture goes to Marti Hearst

36 2002.10.31 - SLIDE 36IS 202 – FALL 2002 Assigning Weights to Terms Binary Weights Raw term frequency tf*idf –Recall the Zipf distribution –Want to weight terms highly if they are Frequent in relevant documents … BUT Infrequent in the collection as a whole Automatically derived thesaurus terms

37 2002.10.31 - SLIDE 37IS 202 – FALL 2002 Binary Weights Only the presence (1) or absence (0) of a term is included in the vector

38 2002.10.31 - SLIDE 38IS 202 – FALL 2002 Raw Term Weights The frequency of occurrence for the term in each document is included in the vector

39 2002.10.31 - SLIDE 39IS 202 – FALL 2002 Assigning Weights tf*idf measure: –Term frequency (tf) –Inverse document frequency (idf) A way to deal with some of the problems of the Zipf distribution Goal: Assign a tf*idf weight to each term in each document

40 2002.10.31 - SLIDE 40IS 202 – FALL 2002 tf*idf

41 2002.10.31 - SLIDE 41IS 202 – FALL 2002 Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of 10000 documents (N = 10000)

42 2002.10.31 - SLIDE 42IS 202 – FALL 2002 Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

43 2002.10.31 - SLIDE 43IS 202 – FALL 2002 tf*idf Normalization Normalize the term weights (so longer vectors are not unfairly given more weight) –Normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive

44 2002.10.31 - SLIDE 44IS 202 – FALL 2002 Vector Space Similarity Now, the similarity of two documents is: This is also called the cosine, or normalized inner product –The normalization was done when weighting the terms

45 2002.10.31 - SLIDE 45IS 202 – FALL 2002 Vector Space Similarity Measure Combine tf and idf into a similarity measure

46 2002.10.31 - SLIDE 46IS 202 – FALL 2002 Computing Similarity Scores 1.0 0.8 0.6 0.8 0.4 0.60.41.00.2

47 2002.10.31 - SLIDE 47IS 202 – FALL 2002 What’s Cosine Anyway? “One of the basic trigonometric functions encountered in trigonometry. Let theta be an angle measured counterclockwise from the x-axis along the arc of the unit circle. Then cos(theta) is the horizontal coordinate of the arc endpoint. As a result of this definition, the cosine function is periodic with period 2pi.” From http://mathworld.wolfram.com/Cosine.html

48 2002.10.31 - SLIDE 48IS 202 – FALL 2002 Cosine vs. Degrees CosineCosine Degrees

49 2002.10.31 - SLIDE 49IS 202 – FALL 2002 Computing a Similarity Score

50 2002.10.31 - SLIDE 50IS 202 – FALL 2002 Vector Space Matching 1.0 0.8 0.6 0.4 0.2 0.80.60.40.201.0 D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)

51 2002.10.31 - SLIDE 51IS 202 – FALL 2002 Weighting Schemes We have seen something of –Binary –Raw term weights –TF*IDF There are many other possibilities –IDF alone –Normalized term frequency

52 2002.10.31 - SLIDE 52IS 202 – FALL 2002 Term Weights in SMART SMART is an experimental IR system developed by Gerard Salton (and continued by Chris Buckley) at Cornell Designed for laboratory experiments in IR –Easy to mix and match different weighting methods –Really terrible user interface –Intended for use by code hackers (and even they have trouble using it)

53 2002.10.31 - SLIDE 53IS 202 – FALL 2002 Term Weights in SMART In SMART weights are decomposed into three factors:

54 2002.10.31 - SLIDE 54IS 202 – FALL 2002 SMART Freq Components Binary maxnorm augmented log

55 2002.10.31 - SLIDE 55IS 202 – FALL 2002 Collection Weighting in SMART Inverse squared probabilistic frequency

56 2002.10.31 - SLIDE 56IS 202 – FALL 2002 Term Normalization in SMART sum cosine fourth max

57 2002.10.31 - SLIDE 57IS 202 – FALL 2002 To Think About How does the tf*idf ranking algorithm behave? –Make a set of hypothetical documents consisting of terms and their weights –Create some hypothetical queries –How are the documents ranked, depending on the weights of their terms and the queries’ terms?

58 2002.10.31 - SLIDE 58IS 202 – FALL 2002 Document Space Has High Dimensionality What happens beyond 2 or 3 dimensions? Similarity still has to do with how many tokens are shared in common More terms -> harder to understand which subsets of words are shared among similar documents One approach to handling high dimensionality: Clustering

59 2002.10.31 - SLIDE 59IS 202 – FALL 2002 Vector Space Visualization

60 2002.10.31 - SLIDE 60IS 202 – FALL 2002 Text Clustering Finds overall similarities among groups of documents Finds overall similarities among groups of tokens Picks out some themes, ignores others

61 2002.10.31 - SLIDE 61IS 202 – FALL 2002 Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2

62 2002.10.31 - SLIDE 62IS 202 – FALL 2002 Text Clustering Clustering is “The art of finding groups in data.” -- Kaufmann and Rousseeu Term 1 Term 2

63 2002.10.31 - SLIDE 63IS 202 – FALL 2002 Pair-Wise Document Similarity How to compute document similarity?

64 2002.10.31 - SLIDE 64IS 202 – FALL 2002 Pair-Wise Document Similarity (no normalization for simplicity)

65 2002.10.31 - SLIDE 65IS 202 – FALL 2002 Document/Document Matrix

66 2002.10.31 - SLIDE 66IS 202 – FALL 2002 Agglomerative Clustering ABCDEFGHIABCDEFGHI

67 2002.10.31 - SLIDE 67IS 202 – FALL 2002 Agglomerative Clustering ABCDEFGHIABCDEFGHI

68 2002.10.31 - SLIDE 68IS 202 – FALL 2002 Agglomerative Clustering ABCDEFGHIABCDEFGHI

69 2002.10.31 - SLIDE 69IS 202 – FALL 2002 Clustering Agglomerative methods: Polythetic, Exclusive or Overlapping, Unordered clusters are order-dependent 1. Select initial centers (i.e., seed the space) 2. Assign docs to highest matching centers and compute centroids 3. Reassign all documents to centroid(s) Doc Rocchio’s method

70 2002.10.31 - SLIDE 70IS 202 – FALL 2002 Automatic Class Assignment Doc Search Engine 1. Create pseudo-documents representing intellectually derived classes. 2. Search using document contents 3. Obtain ranked list 4. Assign document to N categories ranked over threshold. OR assign to top-ranked category Automatic Class Assignment: Polythetic, Exclusive or Overlapping, usually ordered clusters are order-independent, usually based on an intellectually derived scheme

71 2002.10.31 - SLIDE 71IS 202 – FALL 2002 K-Means Clustering 1)Create a pair-wise similarity measure 2)Find K centers using agglomerative clustering –Take a small sample –Group bottom up until K groups found 3)Assign each document to nearest center, forming new clusters 4)Repeat 3 as necessary

72 2002.10.31 - SLIDE 72IS 202 – FALL 2002 Scatter/Gather Cutting, Pedersen, Tukey & Karger 92, 93, Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re-clusters the documents within Resulting new groups have different “themes”

73 2002.10.31 - SLIDE 73IS 202 – FALL 2002 S/G Example: Query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 stellar phenomena 10 flora/fauna49 galaxies, stars 29 constellations 7 miscelleneous Clustering and re-clustering is entirely automated

74

75

76

77 2002.10.31 - SLIDE 77IS 202 – FALL 2002 Clustering Result Sets Advantages: –See some main themes Disadvantage: –Many ways documents could group together are hidden Thinking point: What is the relationship to classification systems and facets?

78 2002.10.31 - SLIDE 78IS 202 – FALL 2002 Next Time Probabilistic Models Relevance Feedback


Download ppt "2002.10.31 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002"

Similar presentations


Ads by Google