Text Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521

Text Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu

Text Mining/Information Retrieval Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries. This assumption underlies the field of Information Retrieval.

Information need Index Pre-process Parse Collections Rank Query text input How is the query constructed? How is the text processed? Evaluate

Terminology Token: A natural language word “Swim”, “Simpson”, “92513” etc Document: Usually a web page, but more generally any file.

Some IR History –Roots in the scientific “Information Explosion” following WWII –Interest in computer-based IR from mid 1950’s H.P. Luhn at IBM (1958) Probabilistic models at Rand (Maron & Kuhns) (1960) Boolean system development at Lockheed (‘60s) Vector Space Model (Salton at Cornell 1965) Statistical Weighting methods and theoretical advances (‘70s) Refinements and Advances in application (‘80s) User Interfaces, Large-scale testing and application (‘90s)

Relevance In what ways can a document be relevant to a query? –Answer precise question precisely. –Who is Homer’s Boss? Montgomery Burns. –Partially answer question. –Where does Homer work? Power Plant. –Suggest a source for more information. –What is Bart’s middle name? Look in Issue 234 of Fanzine –Give background information. –Remind the user of other knowledge. –Others...

Information need Index Pre-process Parse Collections Rank Query text input How is the query constructed? How is the text processed? Evaluate The section that follows is about Content Analysis (transforming raw text into a computationally more manageable form)

Figure from Baeza-Yates & Ribeiro- Neto Document Processing Steps

Stemming and Morphological Analysis Goal: “normalize” similar words Morphology (“form” of words) –Inflectional Morphology E.g,. inflect verb endings and noun number Never change grammatical class –dog, dogs –Bike, Biking –Swim, Swimmer, Swimming What about… build, building;

Original Words … consign consigned consigning consignment consist consisted consistency consistent consistently consisting consists … Stemmed Words … consign consign consign consign consist consist consist consist consist consist consist Examples of Stemming (using Porters algorithm) Porters algorithms is available in Java, C, Lisp, Perl, Python etc from http://www.tartarus.org/ ~martin/PorterStemmer/

Errors Generated by Porter Errors Generated by Porter Stemmer (Krovetz 93)

Statistical Properties of Text Token occurrences in text are not uniformly distributed They are also not normally distributed They do exhibit a Zipf distribution

8164 the 4771 of 4005 to 2834 a 2827 and 2802 in 1592 The 1370 for 1326 is 1324 s 1194 that 973 by 969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO 1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE Government documents, 157734 tokens, 32259 unique

Plotting Word Frequency by Rank Main idea: count –How many times tokens occur in the text Over all texts in the collection Now rank these according to how often they occur. This is called the rank.

Rank Freq 1 37 system 2 32 knowledg 3 24 base 4 20 problem 5 18 abstract 6 15 model 7 15 languag 8 15 implem 9 13 reason 10 13 inform 11 11 expert 12 11 analysi 13 10 rule 14 10 program 15 10 oper 16 10 evalu 17 10 comput 18 10 case 19 9 gener 20 9 form The Corresponding Zipf Curve

Zipf Distribution The Important Points: –a few elements occur very frequently –a medium number of elements have medium frequency –many elements occur very infrequently

Zipf Distribution The product of the frequency of words (f) and their rank (r) is approximately constant –Rank = order of words’ frequency of occurrence Another way to state this is with an approximately correct rule of thumb: –Say the most common term occurs C times –The second most common occurs C/2 times –The third most common occurs C/3 times –…

Illustration by Jacob Nielsen Zipf Distribution (linear and log scale)

What Kinds of Data Exhibit a Zipf Distribution? Words in a text collection –Virtually any language usage Library book checkout patterns Incoming Web Page Requests Outgoing Web Page Requests Document Size on Web City Sizes …

Consequences of Zipf There are always a few very frequent tokens that are not good discriminators. –Called “stop words” in IR English examples: to, from, on, and, the,... There are always a large number of tokens that occur once and can mess up algorithms. Medium frequency words most descriptive

Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive.

Statistical Independence Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together.

Statistical Independence and Dependence What are examples of things that are statistically independent? What are examples of things that are statistically dependent?

Lexical Associations Subjects write first word that comes to mind –doctor/nurse; black/white (Palermo & Jenkins 64) Text Corpora yield similar associations One measure: Mutual Information (Church and Hanks 89) If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection)

Statistical Independence Compute for a window of words w1w11 w21 a b c d e f g h i j k l m n o p

Interesting Associations with “Doctor” Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89)

Un-Interesting Associations with “Doctor Un-Interesting Associations with “Doctor” ( AP Corpus, N=15 million, Church & Hanks 89) These associations were likely to happen because the non-doctor words shown here are very common and therefore likely to co-occur with any noun.

Associations Are Important Because… We may be able to discover that phrases that should be treated as a word. I.e. “data mining”. We may be able to automatically discover synonyms. I.e. “Bike” and “Bicycle”

Content Analysis Summary Content Analysis: transforming raw text into more computationally useful forms Words in text collections exhibit interesting statistical properties –Word frequencies have a Zipf distribution –Word co-occurrences exhibit dependencies Text documents are transformed to vectors –Pre-processing includes tokenization, stemming, collocations/phrases

Information need Index Pre-process Parse Collections Rank Query text input How is the index constructed? The section that follows is about Index Construction Evaluate

Inverted Index This is the primary data structure for text indexes Main Idea: –Invert documents into a big index Basic steps: –Make a “dictionary” of all the tokens in the collection –For each token, list all the docs it occurs in. –Do a few things to reduce redundancy in the data structure

How Are Inverted Files Created Documents are parsed to extract tokens. These are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2

How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically.

How Inverted Files are Created Multiple term entries for a single document are merged. Within-document term frequency information is compiled.

How Inverted Files are Created Then the file can be split into –A Dictionary file and –A Postings file

How Inverted Files are Created Dictionary Postings

Inverted Indexes Permit fast search for individual terms For each term, you get a list consisting of: –document ID –frequency of term in doc (optional) –position of term in doc (optional) These lists can be used to solve Boolean queries: country -> d1, d2 manor -> d2 country AND manor -> d2 Also used for statistical ranking algorithms

How Inverted Files are Used Query on “time” AND “dark” 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Dictionary Postings

Information need Index Pre-process Parse Collections Rank Query text input How is the index constructed? The section that follows is about Querying (and ranking) Evaluate

Simple query language: Boolean –Terms + Connectors (or operators) –terms words normalized (stemmed) words phrases –connectors AND OR NOT NEAR (Pseudo Boolean) Word Doc Catx Dog Collarx Leash

Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)

Boolean Queries (Cat OR Dog) AND (Collar OR Leash) –Each of the following combinations works: Catxxxx Dogxxxxx Collarxxxx Leashxxxx

Boolean Queries (Cat OR Dog) AND (Collar OR Leash) –None of the following combinations work: Catxx Dogxx Collarxx Leashxx

Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)

Ordering of Retrieved Documents Pure Boolean has no ordering In practice: –order chronologically –order by total number of “hits” on query terms What if one term has more hits than others? Is it better to one of each term or many of one term?

Boolean Model Advantages –simple queries are easy to understand –relatively easy to implement Disadvantages –difficult to specify what is wanted –too much returned, or too little –ordering not well determined Dominant language in commercial Information Retrieval systems until the WWW Since the Boolean model is limited, lets consider a generalization…

Vector Model Documents are represented as “bags of words” Represented as vectors when used computationally –A vector is like an array of floating point –Has direction and magnitude –Each vector holds a place for every term in the collection –Therefore, most vectors are sparse Smithers secretly loves Monty Burns Monty Burns secretly loves Smithers Both map to… [ Burns, loves, Monty, secretly, Smithers]

Document Vectors One location for each word novagalaxy heath’wood filmroledietfur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 ABCDEFGHIABCDEFGHI Document ids

We Can Plot the Vectors Star Diet Doc about astronomy Doc about movie stars Doc about mammal behavior

Illustration from Jurafsky & Martin Documents in 3D Vector Space t1t1 t2t2 t3t3 D1D1 D2D2 D 10 D3D3 D9D9 D4D4 D7D7 D8D8 D5D5 D 11 D6D6

Vector Space Model Note that the query is projected into the same vector space as the documents. The query here is for “Marge”. We can use a vector similarity model to determine the best match to our query (details in a few slides). But what weights should we use for the terms?

Assigning Weights to Terms Binary Weights Raw term frequency tf x idf –Recall the Zipf distribution –Want to weight terms highly if they are frequent in relevant documents … BUT infrequent in the collection as a whole

Binary Weights Only the presence (1) or absence (0) of a term is included in the vector We have already seen and discussed this model.

Raw Term Weights The frequency of occurrence for the term in each document is included in the vector This model is open to exploitation by websites… sex sex sex sex sex sex sex sex sex sex Counts can be normalized by document lengths.

tf * idf Weights tf * idf measure: –term frequency (tf) –inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution Goal: assign a tf * idf weight to each term in each document

tf * idf

Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of 10000 documents

Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient

Cosine 1.0 0.8 0.6 0.8 0.4 0.60.41.00.2

Problems with Vector Space There is no real theoretical basis for the assumption of a term space –it is more for visualization that having any real basis –most similarity measures work about the same regardless of model Terms are not really orthogonal dimensions –Terms are not independent of all other terms

Probabilistic Models Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) Rely on accurate estimates of probabilities

Relevance Feedback Main Idea: –Modify existing query based on relevance judgements Query Expansion: Extract terms from relevant documents and add them to the query Term Re-weighing: and/or re-weight the terms already in the query –Two main approaches: Automatic (psuedo-relevance feedback) Users select relevant documents –Users/system select terms from an automatically- generated list

Definition: Relevance Feedback is the reformulation of a search query in response to feedback provided by the user for the results of previous versions of the query. Term Vector [Jordan, Bank, Bull, River] Term Weights [ 1, 1, 1, 1 ] Term Vector [Jordan, Bank, Bull, River] Term Weights [ 1.1, 0.1, 1.3, 1.2 ] Search Display Results Gather Feedback Update Weights Suppose you are interested in bovine agriculture on the banks of the river Jordan…

Rocchio Method

Rocchio Illustration Although we usually work in vector space for text, it is easier to visualize Euclidian space Original QueryTerm Re-weighting Note that both the location of the center, and the shape of the query have changed Query Expansion

Rocchio Method Rocchio automatically –re-weights terms –adds in new terms (from relevant docs) have to be careful when using negative terms Rocchio is not a machine learning algorithm Most methods perform similarly –results heavily dependent on test collection Machine learning methods are proving to work better than standard IR approaches like Rocchio

Using Relevance Feedback Known to improve results People don’t seem to like giving feedback!

Note: In this example we are using a piecewise linear approximation of the data. We will learn more about this representation later. Relevance Feedback for Time Series The original query The weigh vector. Initially, all weighs are the same.

One by one the 5 best matching sequences will appear, and the user will rank them from between very bad (-3) to very good (+3) The initial query is executed, and the five best matches are shown (in the dendrogram)

Based on the user feedback, both the shape and the weigh vector of the query are changed. The new query can be executed. The hope is that the query shape and weights will converge to the optimal query. Two papers consider relevance feedback for time series. Query Expansion L Wu, C Faloutsos, K Sycara, T. Payne: FALCON: Feedback Adaptive Loop for Content- Based Retrieval. VLDB 2000: 297-306 Term Re-weighting Keogh, E. & Pazzani, M. Relevance feedback retrieval of time series data. In Proceedings of SIGIR 99

Document Space has High Dimensionality What happens beyond 2 or 3 dimensions? Similarity still has to do with how many tokens are shared in common. More terms -> harder to understand which subsets of words are shared among similar documents. One approach to handling high dimensionality:Clustering

Text Clustering Finds overall similarities among groups of documents. Finds overall similarities among groups of tokens. Picks out some themes, ignores others.

Scatter/Gather Hearst & Pedersen 95 Cluster sets of documents into general “themes”, like a table of contents (using K-means) Display the contents of the clusters by showing topical terms and typical titles User chooses subsets of the clusters and re-clusters the documents within Resulting new groups have different “themes”

S/G Example: query on “star” Encyclopedia text 14 sports 8 symbols47 film, tv 68 film, tv (p) 7 music 97 astrophysics 67 astronomy(p)12 stellar phenomena 10 flora/fauna 49 galaxies, stars 29 constellations 7 miscellaneous Clustering and re-clustering is entirely automated

Ego Surfing! http://vivisimo.com/

Information need Index Pre-process Parse Collections Rank Query text input How is the index constructed? The section that follows is aboutEvaluation Evaluate

Evaluation Why Evaluate? What to Evaluate? How to Evaluate?

Why Evaluate? Determine if the system is desirable Make comparative assessments Others?

What to Evaluate? How much of the information need is satisfied. How much was learned about a topic. Incidental learning: –How much was learned about the collection. –How much was learned about other topics. How inviting the system is.

What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of Information –Form of Presentation –Effort required/Ease of Use –Time and Space Efficiency –Recall proportion of relevant material actually retrieved –Precision proportion of retrieved material actually relevant effectiveness

Relevant vs. Retrieved Relevant Retrieved All docs

Precision vs. Recall Relevant Retrieved All docs

Why Precision and Recall? Intuition: Get as much good stuff while at the same time getting as little junk as possible.

Retrieved vs. Relevant Documents Relevant Very high precision, very low recall

Retrieved vs. Relevant Documents Relevant Very low precision, very low recall (0 in fact)

Retrieved vs. Relevant Documents Relevant High recall, but low precision

Retrieved vs. Relevant Documents Relevant High precision, high recall (at last!)

Precision/Recall Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall Note: this is an AVERAGE over MANY queries precision recall x x x x

Precision/Recall Curves Difficult to determine which of these two hypothetical results is better: precision recall x x x x

Precision/Recall Curves

Recall under various retrieval assumptions 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 RECALLRECALL 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Proportion of documents retrieved random Perfect Perverse Tangent Parabolic Recall Parabolic Recall 1000 Documents 100 Relevant

Precision under various assumptions 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 PRECISIONPRECISION 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Proportion of documents retrieved random Perfect Perverse Tangent Parabolic Recall Parabolic Recall 1000 Documents 100 Relevant

Document Cutoff Levels Another way to evaluate: –Fix the number of documents retrieved at several levels: top 5 top 10 top 20 top 50 top 100 top 500 –Measure precision at each of these levels –Take (weighted) average over results This is a way to focus on how well the system ranks the first k documents.

Problems with Precision/Recall Can’t know true recall value –except in small collections Precision/Recall are related –A combined measure sometimes more appropriate Assumes batch mode –Interactive IR is important and has different criteria for successful searches –Assumes a strict rank ordering matters.

Relation to Contingency Table Accuracy: (a+d) / (a+b+c+d) Precision: a/(a+b) Recall: a/(a+c) Why don’t we use Accuracy for IR? –(Assuming a large collection) –Most docs aren’t relevant –Most docs aren’t retrieved –Inflates the accuracy value Doc is Relevant Doc is NOT relevant Doc is retrieved ab Doc is NOT retrieved cd Doc is Relevant Doc is NOT relevant Doc is retrieved Doc is NOT retrieved

The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) P = precision R = recall b = measure of relative importance of P or R For example, b = 0.5 means user is twice as interested in precision as recall

How to Evaluate? Test Collections

Test Collections Cranfield 2 – –1400 Documents, 221 Queries –200 Documents, 42 Queries INSPEC – 542 Documents, 97 Queries UKCIS -- > 10000 Documents, multiple sets, 193 Queries ADI – 82 Document, 35 Queries CACM – 3204 Documents, 50 Queries CISI – 1460 Documents, 35 Queries MEDLARS (Salton) 273 Documents, 18 Queries

TREC Text REtrieval Conference/Competition –Run by NIST (National Institute of Standards & Technology) –2002 (November) will be 11 th year Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs –Newswire & full text news (AP, WSJ, Ziff, FT) –Government documents (federal register, Congressional Record) –Radio Transcripts (FBIS) –Web “subsets”

TREC (cont.) Queries + Relevance Judgments –Queries devised and judged by “Information Specialists” –Relevance judgments done only for those documents retrieved -- not entire collection! Competition –Various research and commercial groups compete (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66) –Results judged on precision and recall, going up to a recall level of 1000 documents

TREC Benefits: –made research systems scale to large collections (pre- WWW) –allows for somewhat controlled comparisons Drawbacks: –emphasis on high recall, which may be unrealistic for what most users want –very long queries, also unrealistic –comparisons still difficult to make, because systems are quite different on many dimensions –focus on batch ranking rather than interaction –no focus on the WWW

TREC is changing Emphasis on specialized “tracks” –Interactive track –Natural Language Processing (NLP) track –Multilingual tracks (Chinese, Spanish) –Filtering track –High-Precision –High-Performance http://trec.nist.gov/

What to Evaluate? Effectiveness –Difficult to measure –Recall and Precision are one way –What might be others?

Text Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521

Similar presentations

Presentation on theme: "Text Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Text Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521

Similar presentations

Presentation on theme: "Text Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521"— Presentation transcript:

Similar presentations

About project

Feedback