2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002 http://www.sims.berkeley.edu/academics/courses/is202/f02/ SIMS 202: Information Organization and Retrieval Lecture 17: Statistical Properties of Text

2002.10.29 - SLIDE 2IS 202 – FALL 2002 Lecture Overview Review –Central Concepts in IR –Boolean Logic Content Analysis Statistical Properties of Text –Zipf distribution –Statistical dependence Indexing and Inverted Files Credit for some of the slides in this lecture goes to Marti Hearst

2002.10.29 - SLIDE 3IS 202 – FALL 2002 Central Concepts in IR Documents Queries Collections Evaluation Relevance

2002.10.29 - SLIDE 4IS 202 – FALL 2002 Relevance (introduction) In what ways can a document be relevant to a query? –Answer precise question precisely –Who is buried in grant’s tomb? Grant –Partially answer question –Where is Danville? Near Walnut Creek –Suggest a source for more information –What is lymphodema? Look in this Medical Dictionary… –Give background information –Remind the user of other knowledge –Others...

2002.10.29 - SLIDE 5IS 202 – FALL 2002 Relevance “Intuitively, we understand quite well what relevance means. It is a primitive ‘y’ know’ concept, as is information for which we hardly need a definition. … if and when any productive contact [in communication] is desired, consciously or not, we involve and use this intuitive notion or relevance.” »Saracevic, 1975 p. 324

2002.10.29 - SLIDE 6IS 202 – FALL 2002 Janes’ View Topicality Pertinence Relevance Utility Satisfaction

2002.10.29 - SLIDE 7IS 202 – FALL 2002 Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)

2002.10.29 - SLIDE 8IS 202 – FALL 2002 Boolean Logic A B

2002.10.29 - SLIDE 9IS 202 – FALL 2002 Boolean Logic 3t33t3 1t11t1 2t22t2 1D11D1 2D22D2 3D33D3 4D44D4 5D55D5 6D66D6 8D88D8 7D77D7 9D99D9 10 D 10 11 D 11 m1m1 m2m2 m3m3 m5m5 m4m4 m7m7 m8m8 m6m6 m 2 = t 1 t 2 t 3 m 1 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3

2002.10.29 - SLIDE 10IS 202 – FALL 2002 Boolean Systems Most of the commercial database search systems that pre-date the WWW are based on Boolean search –Dialog, Lexis-Nexis, etc. Most Online Library Catalogs are Boolean systems –E.g. MELVYL Database systems use Boolean logic for searching Many of the search engines sold for intranet search of web sites are Boolean

2002.10.29 - SLIDE 11IS 202 – FALL 2002 Content Analysis Automated Transformation of raw text into a form that represents some aspect(s) of its meaning Including, but not limited to: –Automated Thesaurus Generation –Phrase Detection –Categorization –Clustering –Summarization

2002.10.29 - SLIDE 12IS 202 – FALL 2002 Techniques for Content Analysis Statistical –Single Document –Full Collection Linguistic –Syntactic –Semantic –Pragmatic Knowledge-Based (Artificial Intelligence) Hybrid (Combinations)

2002.10.29 - SLIDE 13IS 202 – FALL 2002 Text Processing Standard Steps: –Recognize document structure titles, sections, paragraphs, etc. –Break into tokens usually space and punctuation delineated special issues with Asian languages –Stemming/morphological analysis –Store in inverted index (to be discussed later)

2002.10.29 - SLIDE 14IS 202 – FALL 2002 Content Analysis Areas How is the text processed? Information need Index Pre-process Parse Collections Rank Query text input How is the query constructed ?

2002.10.29 - SLIDE 15 Document Processing Steps From “Modern IR” textbook

2002.10.29 - SLIDE 16IS 202 – FALL 2002 Stemming and Morphological Analysis Goal: “normalize” similar words Morphology (“form” of words) –Inflectional Morphology E.g,. inflect verb endings and noun number Never change grammatical class –dog, dogs –tengo, tienes, tiene, tenemos, tienen –Derivational Morphology Derive one word from another, Often change grammatical class –build, building; health, healthy

2002.10.29 - SLIDE 17IS 202 – FALL 2002 Automated Methods Powerful multilingual tools exist for morphological analysis –PCKimmo, Xerox Lexical technology –Require a grammar and dictionary –Use “two-level” automata Stemmers: –Very dumb rules work well (for English) –Porter Stemmer: Iteratively remove suffixes –Improvement: pass results through a lexicon

2002.10.29 - SLIDE 18IS 202 – FALL 2002 Errors Generated by Porter Stemmer From Krovetz ‘93

2002.10.29 - SLIDE 19IS 202 – FALL 2002 Statistical Properties of Text Token occurrences in text are not uniformly distributed They are also not normally distributed They do exhibit a Zipf distribution

2002.10.29 - SLIDE 20IS 202 – FALL 2002 Plotting Word Frequency by Rank Main idea: –Count how many times tokens occur in the text Sum over all of the texts in the collection Now order these tokens according to how often they occur (highest to lowest) This is called the rank

2002.10.29 - SLIDE 21IS 202 – FALL 2002 A Typical Collection 8164 the 4771 of 4005 to 2834 a 2827 and 2802 in 1592 The 1370 for 1326 is 1324 s 1194 that 973 by 969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO 1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE Government documents, 157734 tokens, 32259 unique

2002.10.29 - SLIDE 22IS 202 – FALL 2002 A Small Collection (Stems) Rank Freq Term 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason 10 13 inform 11 11 expert 12 11 analysi 13 10 rule 14 10 program 15 10 oper 16 10 evalu 17 10 comput 18 10 case 19 9 gener 20 9 form 150 2 enhanc 151 2 energi 152 2 emphasi 153 2 detect 154 2 desir 155 2 date 156 2 critic 157 2 content 158 2 consider 159 2 concern 160 2 compon 161 2 compar 162 2 commerci 163 2 clause 164 2 aspect 165 2 area 166 2 aim 167 2 affect

2002.10.29 - SLIDE 23IS 202 – FALL 2002 The Corresponding Zipf Curve Rank Freq 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason 10 13 inform 11 11 expert 12 11 analysi 13 10 rule 14 10 program 15 10 oper 16 10 evalu 17 10 comput 18 10 case 19 9 gener 20 9 form

2002.10.29 - SLIDE 24IS 202 – FALL 2002 Zoom in on the Knee of the Curve 43 6 approach 44 5 work 45 5 variabl 46 5 theori 47 5 specif 48 5 softwar 49 5 requir 50 5 potenti 51 5 method 52 5 mean 53 5 inher 54 5 data 55 5 commit 56 5 applic 57 4 tool 58 4 technolog 59 4 techniqu

2002.10.29 - SLIDE 25IS 202 – FALL 2002 Zipf Distribution The Important Points: –a few elements occur very frequently –a medium number of elements have medium frequency –many elements occur very infrequently

2002.10.29 - SLIDE 26IS 202 – FALL 2002 The product of the frequency of words (f) and their rank (r) is approximately constant –Rank = order of words’ frequency of occurrence Another way to state this is with an approximately correct rule of thumb: –Say the most common term occurs C times –The second most common occurs C/2 times –The third most common occurs C/3 times Zipf Distribution

2002.10.29 - SLIDE 27 Zipf Distribution Linear ScaleLogarithmic Scale

2002.10.29 - SLIDE 28IS 202 – FALL 2002 What has a Zipf Distribution? Words in a text collection –Virtually any use of natural language Library book checkout patterns Incoming Web Page Requests (Nielsen) Outgoing Web Page Requests (Cunha & Crovella) Document Size on Web (Cunha & Crovella)

2002.10.29 - SLIDE 29IS 202 – FALL 2002 Related Distributions/”Laws” Bradford’s Law of Scattering Lotka’s Law of Productivity De Solla Price’s Urn Model for “Cumulative Advantage Processes” ½ = 50%2/3 = 66%¾ = 75%Pick Replace +1

2002.10.29 - SLIDE 30IS 202 – FALL 2002 Very frequent word stems From the Cha-Cha Web Index for the Berkeley.EDU domain

2002.10.29 - SLIDE 31IS 202 – FALL 2002 Frequent words on the WWW 65002930 the 62789720 a 60857930 to 57248022 of 54078359 and 52928506 in 50686940 s 49986064 for 45999001 on 42205245 this 41203451 is 39779377 by 35439894 with 35284151 or 34446866 at 33528897 all 31583607 are 30998255 from 30755410 e 30080013 you 29669506 be 29417504 that 28542378 not 28162417 an 28110383 as 28076530 home 27650474 it 27572533 i 24548796 have 24420453 if 24376758 new 24171603 t 23951805 your 23875218 page 22292805 about 22265579 com 22107392 information 21647927 will 21368265 can 21367950 more 21102223 has 20621335 no 19898015 other 19689603 one 19613061 c 19394862 d 19279458 m 19199145 was 19075253 copyright 18636563 us (see http://elib.cs.berkeley.edu/docfreq/docfreq.html)

2002.10.29 - SLIDE 32IS 202 – FALL 2002 Words that occur few times From the Cha-Cha Web Index for the Berkeley.EDU domain

2002.10.29 - SLIDE 33IS 202 – FALL 2002 Consequences of Zipf There are always a few very frequent tokens that are not good discriminators. –Called “stop words” in IR –Usually correspond to linguistic notion of “closed- class” words English examples: to, from, on, and, the,... Grammatical classes that don’t take on new members. There are always a large number of tokens that occur once (and can have unexpected consequences with some IR algorithms) Medium frequency words most descriptive

2002.10.29 - SLIDE 34IS 202 – FALL 2002 Word Frequency vs. Resolving Power The most frequent words are not the most descriptive. (from van Rijsbergen 79)

2002.10.29 - SLIDE 35IS 202 – FALL 2002 How likely is a red car to drive by given we’ve seen a black one? How likely is the word “ambulence” to appear, given that we’ve seen “car accident”? Color of cars driving by are independent (although more frequent colors are more likely) Words in text are not independent (although again more frequent words are more likely) Statistical Independence vs. Dependence

2002.10.29 - SLIDE 36IS 202 – FALL 2002 Statistical Independence Two events x and y are statistically independent if the product of the probabilities of their happening individually equals the probability of their happening together

2002.10.29 - SLIDE 37IS 202 – FALL 2002 Statistical Independence and Dependence What are examples of things that are statistically independent? What are examples of things that are statistically dependent?

2002.10.29 - SLIDE 38IS 202 – FALL 2002 Lexical Associations Subjects write first word that comes to mind –doctor/nurse; black/white (Palermo & Jenkins 64) Text Corpora can yield similar associations One measure: Mutual Information (Church and Hanks 89) If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)

2002.10.29 - SLIDE 39IS 202 – FALL 2002 Statistical Independence Compute for a window of words w1w11 w21 a b c d e f g h i j k l m n o p

2002.10.29 - SLIDE 40IS 202 – FALL 2002 Interesting Associations with “Doctor” AP Corpus, N=15 million, Church & Hanks 89

2002.10.29 - SLIDE 41IS 202 – FALL 2002 These associations were likely to happen because the non-doctor words shown here are very common and therefore likely to co-occur with any noun. Un-Interesting Associations with “Doctor” AP Corpus, N=15 million, Church & Hanks 89

2002.10.29 - SLIDE 42IS 202 – FALL 2002 Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally –A vector is like an array of floating point –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse

2002.10.29 - SLIDE 43IS 202 – FALL 2002 Document Vectors “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

2002.10.29 - SLIDE 44IS 202 – FALL 2002 Document Vectors “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

2002.10.29 - SLIDE 45IS 202 – FALL 2002 Document Vectors

2002.10.29 - SLIDE 46IS 202 – FALL 2002 We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

2002.10.29 - SLIDE 47 Documents in 3D Space

2002.10.29 - SLIDE 48IS 202 – FALL 2002 Content Analysis Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties –Word frequencies have a Zipf distribution –Word co-occurrences exhibit dependencies Text documents are transformed to vectors –Pre-processing includes tokenization, stemming, collocations/phrases –Documents occupy multi-dimensional space

2002.10.29 - SLIDE 49IS 202 – FALL 2002 Inverted Index This is the primary data structure for text indexes Main Idea: –Invert documents into a big index Basic steps: –Make a “dictionary” of all the tokens in the collection –For each token, list all the docs it occurs in. –Do a few things to reduce redundancy in the data structure

2002.10.29 - SLIDE 50IS 202 – FALL 2002 Information need Index Pre-process Parse Collections Rank Query text input How is the index constructed?

2002.10.29 - SLIDE 51IS 202 – FALL 2002 Inverted Indexes We have seen “Vector files” conceptually –An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

2002.10.29 - SLIDE 52IS 202 – FALL 2002 How Inverted Files Are Created Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

2002.10.29 - SLIDE 53IS 202 – FALL 2002 How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically.

2002.10.29 - SLIDE 54IS 202 – FALL 2002 How Inverted Files are Created Multiple term entries for a single document are merged. Within-document term frequency information is compiled.

2002.10.29 - SLIDE 55IS 202 – FALL 2002 How Inverted Files are Created Then the file can be split into –A Dictionary file – and –A Postings file

2002.10.29 - SLIDE 56IS 202 – FALL 2002 How Inverted Files are Created Dictionary Postings

2002.10.29 - SLIDE 57IS 202 – FALL 2002 Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: –document ID –frequency of term in doc (optional) –position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

2002.10.29 - SLIDE 58IS 202 – FALL 2002 How Inverted Files are Used Dictionary Postings Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query.

2002.10.29 - SLIDE 59IS 202 – FALL 2002 Next Time More on Vector Representation The Vector Model of IR Term weighting Statistical ranking methods

2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

Similar presentations

Presentation on theme: "2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002

Similar presentations

Presentation on theme: "2002.10.29 - SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002"— Presentation transcript:

Similar presentations

About project

Feedback