©2003 Paula Matuszek CSC 9010: Text Mining Applications Document-Level Techniques Dr. Paula Matuszek (610)
©2003 Paula Matuszek Dealing with Documents l Sometimes our information need is not for something specific which we can capture in a clearcut knowledge model –What is the current research in secure networks? –What are our competitors working on? –Who should review this paper? l These kinds of questions are more typically answered by techniques which look at the entire document, or set of documents. –Categorizing –Clustering –Visualizing
©2003 Paula Matuszek Document Categorization l Document categorization –Assign documents to pre-defined categories l Examples –Process into work, personal, junk –Process documents from a newsgroup into “interesting”, “not interesting”, “spam and flames” –Process transcripts of bugged phone calls into “relevant” and “irrelevant” l Issues –Real-time? –How many categories/document? Flat or hierarchical? –Categories defined automatically or by hand?
©2003 Paula Matuszek Document Categorization l Usually –relatively few categories –well defined; a person could do task easily –Categories don't change quickly l Flat vs Hierarchy –Simple categorization is into mutually-exclusive document collections –Richer categorization is into hierarchy with multiple inheritance –broader and narrower categories –documents can go more than one place –merges into search-engine with category browsers
©2003 Paula Matuszek Categorization -- Automatic l Statistical approaches similar to search engine l Set of “training” documents define categories –Underlying representation of document is bag of words/TF*IDF variant –Category description is created using neural nets, regression trees, other Machine Learning techniques –Individual documents categorized by net, inferred rules, etc l Requires relatively little effort to create categories l Accuracy is heavily dependent on "good" training examples l Typically limited to flat, mutually exclusive categories
©2003 Paula Matuszek Categorization: Manual l Natural Language/linguistic techniques l Categories are defined by people –underlying representation of document is stream of tokens –category description contains –ontology of terms and relations –pattern-matching rules –individual documents categorized by pattern-matching l Defining categories can be very time-consuming l Typically takes some experimentation to "get it right" l Can handle much more complex structures
©2003 Paula Matuszek Document Classification l Document classification –Cluster documents based on similarity l Examples –Group samples of writing in an attempt to determine author(s) –Look for “hot spots” in customer feedback –Find new trends in a document collection (outliers, hard to classify) l Getting into areas where we don’t know ahead of time what we will have; true “mining”
©2003 Paula Matuszek Document Classification -- How l Typical process is: –Describe each document –Assess similiaries among documents –Establish classification scheme which creates optimal "separation" l One typical approach: –document is represented as term vector –cosine similarity for measuring association –bottom-up pairwise combining of documents to get clusters l Assumes you have the corpus in hand
©2003 Paula Matuszek Document Clustering l Approaches vary a great deal in –document characteristics used to describe document (linguistic or semantic? bow? –methods used to define "similar" –methods used to create clusters l Other relevant factors –Number of clusters to extract is variable –Often combined with visualization tools based on similarity and/or clusters –Sometimes important that approach be incremental l Useful approach when you don't have a handle on the domain or it's changing
©2003 Paula Matuszek Document Visualization l Visualization –Visually display relationships among documents l Examples –hyperbolic viewer based on document similarity; browse a field of scientific documents –“map” based techniques showing peaks, valleys, outliers –graphs showing relationships between companies and research areas l Highly interactive, intended to aid a human in finding interrelationships and new knowledge in the document set.
©2003 Paula Matuszek Latent Semantic Analysis l Bag of Words methods we have looked at ignore syntax -- A document is "about" the words in it l People interpret documents in a richer context: –a document is about some domain –reflected in the vocabulary –but not limited to it
©2003 Paula Matuszek Match Topic and Phrase l Astronomy l Automobiles l Biology l I saw Pathfinder on Mars with a telescope. l The Pathfinder photograph mars our perception of a lifeless planet. l The Pathfinder photograph from Ford has arrived. l When a Pathfinder fords a river it sometimes mars its paint job.
©2003 Paula Matuszek Domain-Based Processing l This task is relatively easy because we know a lot about all of the domains, and can disambiguate using that knowledge. l It's not completely trivial: the biology choice could also have been astronomy. l Information Extraction systems like GATE and AeroText model the domain knowledge explicitly, but this takes a lot of effort. l Is there an easier way?
©2003 Paula Matuszek Word Co-Occurrences l BOW approaches assume meaning is carried by vocabulary, ignore syntax l Domain modeling approaches capture detailed knowledge about the meaning l An intermediate position is to look at vocabulary groups; what words tend to occur together? l Still a statistical approach, but richer representation than single terms
©2003 Paula Matuszek Examples of What We Would Like: l Looking for articles about Tiger Woods in an API newswire database brings up stories about the golfer, followed by articles about golf tournaments that don't mention his name. l Constraining the search to days when no articles were written about Tiger Woods still brings up stories about golf tournaments and well-known players. l So we are recognizing that Tiger Woods is about golf. javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm
©2003 Paula Matuszek Example l Tiger Woods takes some drama out of cut streak with opening round at Funai. l Every player on the money list is at Disney trying to make it to the Tour Championship. Tiger Woods has no such worries. l Going into this week's event, Tiger Woods has made the cut in 113 successive events. He tied the PGA Tour's consecutive cut record two weeks ago at the Funai Classic in Orlando, Florida, while Cink finished second. l Stewart Cink finished second at the Funai Classic at Walt Disney World.
©2003 Paula Matuszek Example, Cont. l Woods tended to occur in same articles as Funai. l Cinc also tended to occur in articles about Funai l So there is a relationship between Wood and Cinc which is stronger than is indicated just by the one article in which they are both mentioned. l It has to do with cuts, Funai, and the championship tour. l So by creating a term-document matrix and examining it we can find potential relationships which are latent, or hidden. They are tied together by the meaning, or semantics, of the terms. l This is the basic concept of Latent Semantic Analysis and Latent Semantic Indexing.
©2003 Paula Matuszek Problem: Very High Dimensionality l A vector of TF*IDF representing a document is high dimensional. l If we start looking at a matrix of terms by documents, it gets even worse. l Need some way to trim words looked at –First, throw away anything "not useful" –Second, identify clusters and pick representative terms
©2003 Paula Matuszek Throw Away l Most domain semantics carried by nouns, adjectives, verbs, adverbs –throw away prepositions, articles, conjunctions, pronouns l Very frequent words don't add to domain semantics. –throw away common verbs (go, be, see), adjectives (big, good, bad ), adverbs (very) –throw away words which appear in most documents l Very infrequent words don't either –throw away terms which only appear in one document
©2003 Paula Matuszek What's Left l A condensed matrix where we can assume that most terms are meaningful. –It's still very large, and very sparse. –Basic index table for a keyword search tool. l Where can we go now? –We have fewer concepts than terms –So move from terms to concepts l So: Identify clusters and pick representative terms
©2003 Paula Matuszek Singular Value Decomposition l One approach to this is called Singular Value Decomposition. –Have a term space of thousands of dimensions, with each document a vector in that space. –Want to project or map those dimensions onto a smaller number of dimensions in such a way that relative distance among vectors is preserved as much as possible. l We end up with a much smaller number of dimensions, and a vector for each document of its value for those dimensions l For a detailed explanation:
©2003 Paula Matuszek Dimension Reduction For n (words) x m (documents) matrix M Finds least squares best U (nxk) Rows of U map input features (words) to encoded features (concept clusters) Closely related to l symm. eigenvalue decomposition, l factor analysis l principle component analysis Subroutine in many math packages.
©2003 Paula Matuszek LSI/LSA Latent semantic indexing is the application of SVD to IR. Latent semantic analysis is the more general term. Features are words, examples are text passages. Latent: Not visible on the surface Semantic: Word meanings
©2003 Paula Matuszek Geometric View Words embedded in high-d space. exam test fish
©2003 Paula Matuszek Comparison to VSM A:The feline climbed upon the roof B:A cat leapt onto a house C:The final will be on a Thursday How similar? l Vector space model: sim(A,B)=0 l LSI: sim(A,B)=.49>sim(A,C)=.45 Non-zero sim with no words in common by overlap in reduced representation.
©2003 Paula Matuszek What Does LSI Do? Let’s send it to school…
©2003 Paula Matuszek Plato’s Problem 7 th grader learns new words today, fewer than 1 by direct instruction. Perhaps 3 were even encountered. How can this be? Plato: You already knew them. LSA: Many weak relationships combined (data to back it up!) Rate comparable to students.
©2003 Paula Matuszek Vocabulary TOEFL synonym test Choose alternative with highest similarity score. LSA correct on 64% of 80 items. Matches avg applicant to US college. Mistakes correlate w/ people (r=.44). best solo measure of intelligence
©2003 Paula Matuszek Multiple Choice Exam Trained on psych textbook. Given same test as students. LSA 60% lower than average, but passes. Has trouble with “hard” ones.
©2003 Paula Matuszek Essay Test LSA can’t write. If you can’t do, judge. Students write essays, LSA trained on related text. Compare similarity and length with graded essays (labeled). Cosine weighted average of top 10. Regression to combine sim and len. Correlation: Better than human. Bag of words!?
©2003 Paula Matuszek Digit Representations Look at similarities of all pairs from one to nine. Look at best fit of these similarities in one dimension: they come out in order! Similar experiments with cities in Europe in two dimensions.
©2003 Paula Matuszek Word Sense The chemistry student knew this was not a good time to forget how to calculate volume and mass. heavy?.21 church?.14 LSI picks best p<.001
©2003 Paula Matuszek LSApplications l Improve IR. l Cross-language IR. Train on parallel collection. l Measure text coherency. l Use essays to pick educational text. l Grade essays. l Visualize word clusters Demos at
©2003 Paula Matuszek LSI Background Reading Landauer, Laham, Foltz (1998). Learning human-like knowledge by Singular Value Decomposition: A Progress Report. Advances in Neural Information Processing Systems 10, (pp )