9/11/2000Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California,

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
9/18/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering (continued) Ray Larson & Warren Sack University of California,
Multimedia and Text Indexing. Multimedia Data Management The need to query and analyze vast amounts of multimedia data (i.e., images, sound tracks, video.
9/11/2001Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Warren Sack University of California,
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 CS 502: Computing Methods for Digital Libraries Lecture 12 Information Retrieval II.
Text Similarity Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521
9/13/2001Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Warren Sack University of California, Berkeley.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
9/14/2000Information Organization and Retrieval Vector Representation, Term Weights and Clustering Ray Larson & Marti Hearst University of California,
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
- SLAYT 1BBY220 Content Analysis & Stemming Yaşar Tonta Hacettepe Üniversitesi yunus.hacettepe.edu.tr/~tonta/ BBY220 Bilgi Erişim.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Search engines fdm 20c introduction to digital media lecture warren sack / film & digital media department / university of california, santa.
Text Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Advanced Multimedia Text Classification Tamara Berg.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
C.Watterscsci64031 Term Frequency and IR. C.Watterscsci64032 What is a good index term Occurs only in some documents ??? –Too often – get whole doc set.
Evidence from Content INST 734 Module 2 Doug Oard.
Information Retrieval Chapter 2 by Rajendra Akerkar, Pawan Lingras Presented by: Xxxxxx.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
WIRED Week 5 Readings Overview - Text & Multimedia Languages & Properties - Text Operations - Multimedia IR Finalize Topic Discussions Schedule Projects.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
Intelligent Information Retrieval
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Plan for Today’s Lecture(s)
Search Engine Architecture
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Indexing & querying text
Text Based Information Retrieval
CS 430: Information Discovery
Token generation - stemming
Representation of documents and queries
Content Analysis of Text
The ultimate in data organization
Presentation transcript:

9/11/2000Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval

9/11/2000Information Organization and Retrieval Today Overview of Content Analysis Text Representation Statistical Characteristics of Text Collections Zipf distribution Statistical dependence

9/11/2000Information Organization and Retrieval Content Analysis Automated Transformation of raw text into a form that represent some aspect(s) of its meaning Including, but not limited to: –Automated Thesaurus Generation –Phrase Detection –Categorization –Clustering –Summarization

9/11/2000Information Organization and Retrieval Techniques for Content Analysis Statistical –Single Document –Full Collection Linguistic –Syntactic –Semantic –Pragmatic Knowledge-Based (Artificial Intelligence) Hybrid (Combinations)

9/11/2000Information Organization and Retrieval Text Processing Standard Steps: –Recognize document structure titles, sections, paragraphs, etc. –Break into tokens usually space and punctuation delineated special issues with Asian languages –Stemming/morphological analysis –Store in inverted index (to be discussed later)

Information need Index Pre-process Parse Collections Rank Query text input How is the query constructed? How is the text processed?

Information Organization and Retrieval Document Processing Steps

9/11/2000Information Organization and Retrieval Stemming and Morphological Analysis Goal: “normalize” similar words Morphology (“form” of words) –Inflectional Morphology E.g,. inflect verb endings and noun number Never change grammatical class –dog, dogs –tengo, tienes, tiene, tenemos, tienen –Derivational Morphology Derive one word from another, Often change grammatical class –build, building; health, healthy

9/11/2000Information Organization and Retrieval Automated Methods Powerful multilingual tools exist for morphological analysis –PCKimmo, Xerox Lexical technology –Require a grammar and dictionary –Use “two-level” automata Stemmers: –Very dumb rules work well (for English) –Porter Stemmer: Iteratively remove suffixes –Improvement: pass results through a lexicon

9/11/2000Information Organization and Retrieval Errors Generated by Porter Stemmer (Krovetz 93)

9/11/2000Information Organization and Retrieval Statistical Properties of Text Token occurrences in text are not uniformly distributed They are also not normally distributed They do exhibit a Zipf distribution

9/11/2000Information Organization and Retrieval A More Standard Collection 8164 the 4771 of 4005 to 2834 a 2827 and 2802 in 1592 The 1370 for 1326 is 1324 s 1194 that 973 by 969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO 1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE Government documents, tokens, unique

9/11/2000Information Organization and Retrieval Plotting Word Frequency by Rank Main idea: count –How many times tokens occur in the text Over all texts in the collection Now rank these according to how often they occur. This is called the rank.

9/11/2000Information Organization and Retrieval Rank Freq Term 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason inform expert analysi rule program oper evalu comput case 19 9 gener 20 9 form enhanc energi emphasi detect desir date critic content consider concern compon compar commerci clause aspect area aim affect Most and Least Frequent Terms

9/11/2000Information Organization and Retrieval Rank Freq 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason inform expert analysi rule program oper evalu comput case 19 9 gener 20 9 form The Corresponding Zipf Curve

9/11/2000Information Organization and Retrieval Zoom in on the Knee of the Curve 43 6 approach 44 5 work 45 5 variabl 46 5 theori 47 5 specif 48 5 softwar 49 5 requir 50 5 potenti 51 5 method 52 5 mean 53 5 inher 54 5 data 55 5 commit 56 5 applic 57 4 tool 58 4 technolog 59 4 techniqu

9/11/2000Information Organization and Retrieval Zipf Distribution The Important Points: –a few elements occur very frequently –a medium number of elements have medium frequency –many elements occur very infrequently

9/11/2000Information Organization and Retrieval Zipf Distribution The product of the frequency of words (f) and their rank (r) is approximately constant –Rank = order of words’ frequency of occurrence Another way to state this is with an approximately correct rule of thumb: –Say the most common term occurs C times –The second most common occurs C/2 times –The third most common occurs C/3 times –…

Information Organization and Retrieval Zipf Distribution (linear and log scale)

9/11/2000Information Organization and Retrieval What Kinds of Data Exhibit a Zipf Distribution? Words in a text collection –Virtually any language usage Library book checkout patterns Incoming Web Page Requests (Nielsen) Outgoing Web Page Requests (Cunha & Crovella) Document Size on Web (Cunha & Crovella)

9/11/2000Information Organization and Retrieval Related Distributions/”Laws” Bradford’s Law of Scattering Lotka’s Law of Productivity De Solla Price’s Urn Model for “Cumulative Advantage Processes” ½ = 50%2/3 = 66%¾ = 75%Pick Replace +1

9/11/2000Information Organization and Retrieval Very frequent word stems (Cha-Cha Web Index)

9/11/2000Information Organization and Retrieval Words that occur few times (Cha-Cha Web Index)

9/11/2000Information Organization and Retrieval Consequences of Zipf There are always a few very frequent tokens that are not good discriminators. –Called “stop words” in IR –Usually correspond to linguistic notion of “closed-class” words English examples: to, from, on, and, the,... Grammatical classes that don’t take on new members. There are always a large number of tokens that occur once and can mess up algorithms. Medium frequency words most descriptive

9/11/2000Information Organization and Retrieval Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive.

9/11/2000Information Organization and Retrieval Statistical Independence vs. Statistical Dependence How likely is a red car to drive by given we’ve seen a black one? How likely is the word “ambulence” to appear, given that we’ve seen “car accident”? Color of cars driving by are independent (although more frequent colors are more likely) Words in text are not independent (although again more frequent words are more likely)

9/11/2000Information Organization and Retrieval Statistical Independence Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.

9/11/2000Information Organization and Retrieval Statistical Independence and Dependence What are examples of things that are statistically independent? What are examples of things that are statistically dependent?

9/11/2000Information Organization and Retrieval Lexical Associations Subjects write first word that comes to mind –doctor/nurse; black/white (Palermo & Jenkins 64) Text Corpora yield similar associations One measure: Mutual Information (Church and Hanks 89) If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)

9/11/2000Information Organization and Retrieval Statistical Independence Compute for a window of words w1w11 w21 a b c d e f g h i j k l m n o p

9/11/2000Information Organization and Retrieval Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)

9/11/2000Information Organization and Retrieval Un-Interesting Associations with “Doctor” ( AP Corpus, N=15 million, Church & Hanks 89) These associations were likely to happen because the non-doctor words shown here are very common and therefore likely to co-occur with any noun.

9/11/2000Information Organization and Retrieval Document Vectors Documents are represented as “bags of words” Represented as vectors when used computationally –A vector is like an array of floating point –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse

9/11/2000Information Organization and Retrieval Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

9/11/2000Information Organization and Retrieval Document Vectors One location for each word. novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

9/11/2000Information Organization and Retrieval Document Vectors novagalaxy heath’wood filmroledietfur ABCDEFGHIABCDEFGHI Document ids

9/11/2000Information Organization and Retrieval We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

Information Organization and Retrieval Documents in 3D Space

9/11/2000Information Organization and Retrieval Content Analysis Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties –Word frequencies have a Zipf distribution –Word co-occurrences exhibit dependencies Text documents are transformed to vectors –Pre-processing includes tokenization, stemming, collocations/phrases –Documents occupy multi-dimensional space.

Information need Index Pre-process Parse Collections Rank Query text input How is the index constructed?

9/11/2000Information Organization and Retrieval Inverted Index This is the primary data structure for text indexes Main Idea: –Invert documents into a big index Basic steps: –Make a “dictionary” of all the tokens in the collection –For each token, list all the docs it occurs in. –Do a few things to reduce redundancy in the data structure

9/11/2000Information Organization and Retrieval Inverted Indexes We have seen “Vector files” conceptually. An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

9/11/2000Information Organization and Retrieval How Are Inverted Files Created Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

9/11/2000Information Organization and Retrieval How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically.

9/11/2000Information Organization and Retrieval How Inverted Files are Created Multiple term entries for a single document are merged. Within-document term frequency information is compiled.

9/11/2000Information Organization and Retrieval How Inverted Files are Created Then the file can be split into –A Dictionary file and –A Postings file

9/11/2000Information Organization and Retrieval How Inverted Files are Created Dictionary Postings

9/11/2000Information Organization and Retrieval Inverted indexes Permit fast search for individual terms For each term, you get a list consisting of: –document ID –frequency of term in doc (optional) –position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

9/11/2000Information Organization and Retrieval How Inverted Files are Used Dictionary Postings Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query.

9/11/2000Information Organization and Retrieval Next Time Term weighting Statistical ranking