©2003 Paula Matuszek CSC 9010: Text Mining Applications Document-Level Techniques Dr. Paula Matuszek (610) 270-6851.

Slides:

Advertisements

Similar presentations

Albert Gatt Corpora and Statistical Methods Lecture 13.

Advertisements

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.

Dimensionality Reduction PCA -- SVD

Artificial Intelligence Paula Matuszek. ©2006 Paula Matuszek What is Artificial Intelligence l Definitions –The science and engineering of making intelligent.

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Latent Semantic Analysis

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Latent Semantic Indexing Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Chapter 5: Information Retrieval and Web Search

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Data Mining Techniques

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Query Relevance Feedback and Ontologies How to Make Queries Better.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek

Text Mining. ©2002 Paula Matuszek Challenges and Possibilities l Information overload. There’s too much. We would like –Better retrieval –Help with handling.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.

Chapter 6: Information Retrieval and Web Search

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

SINGULAR VALUE DECOMPOSITION (SVD)

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.

Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.

©2003 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)

©2012 Paula Matuszek CSC 9010: Text Mining Applications Dr. Paula Matuszek (610)

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Copyright Paula Matuszek Kinds of Machine Learning.

Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Shamil Mustafayev 04/16/

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

Nearest Neighbour and Clustering. Nearest Neighbour and clustering Clustering and nearest neighbour prediction technique was one of the oldest techniques.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.

Data Mining and Text Mining. The Standard Data Mining process.

From Frequency to Meaning: Vector Space Models of Semantics

Automatic Writing Evaluation

School of Computer Science & Engineering

Vector-Space (Distributional) Lexical Semantics

Efficient Estimation of Word Representation in Vector Space

Word Embedding Word2Vec.

Text Categorization Berlin Chen 2003 Reference:

Restructuring Sparse High Dimensional Data for Effective Retrieval

Latent Semantic Analysis

Presentation transcript:

©2003 Paula Matuszek CSC 9010: Text Mining Applications Document-Level Techniques Dr. Paula Matuszek (610)

©2003 Paula Matuszek Dealing with Documents l Sometimes our information need is not for something specific which we can capture in a clearcut knowledge model –What is the current research in secure networks? –What are our competitors working on? –Who should review this paper? l These kinds of questions are more typically answered by techniques which look at the entire document, or set of documents. –Categorizing –Clustering –Visualizing

©2003 Paula Matuszek Document Categorization l Document categorization –Assign documents to pre-defined categories l Examples –Process into work, personal, junk –Process documents from a newsgroup into “interesting”, “not interesting”, “spam and flames” –Process transcripts of bugged phone calls into “relevant” and “irrelevant” l Issues –Real-time? –How many categories/document? Flat or hierarchical? –Categories defined automatically or by hand?

©2003 Paula Matuszek Document Categorization l Usually –relatively few categories –well defined; a person could do task easily –Categories don't change quickly l Flat vs Hierarchy –Simple categorization is into mutually-exclusive document collections –Richer categorization is into hierarchy with multiple inheritance –broader and narrower categories –documents can go more than one place –merges into search-engine with category browsers

©2003 Paula Matuszek Categorization -- Automatic l Statistical approaches similar to search engine l Set of “training” documents define categories –Underlying representation of document is bag of words/TF*IDF variant –Category description is created using neural nets, regression trees, other Machine Learning techniques –Individual documents categorized by net, inferred rules, etc l Requires relatively little effort to create categories l Accuracy is heavily dependent on "good" training examples l Typically limited to flat, mutually exclusive categories

©2003 Paula Matuszek Categorization: Manual l Natural Language/linguistic techniques l Categories are defined by people –underlying representation of document is stream of tokens –category description contains –ontology of terms and relations –pattern-matching rules –individual documents categorized by pattern-matching l Defining categories can be very time-consuming l Typically takes some experimentation to "get it right" l Can handle much more complex structures

©2003 Paula Matuszek Document Classification l Document classification –Cluster documents based on similarity l Examples –Group samples of writing in an attempt to determine author(s) –Look for “hot spots” in customer feedback –Find new trends in a document collection (outliers, hard to classify) l Getting into areas where we don’t know ahead of time what we will have; true “mining”

©2003 Paula Matuszek Document Classification -- How l Typical process is: –Describe each document –Assess similiaries among documents –Establish classification scheme which creates optimal "separation" l One typical approach: –document is represented as term vector –cosine similarity for measuring association –bottom-up pairwise combining of documents to get clusters l Assumes you have the corpus in hand

©2003 Paula Matuszek Document Clustering l Approaches vary a great deal in –document characteristics used to describe document (linguistic or semantic? bow? –methods used to define "similar" –methods used to create clusters l Other relevant factors –Number of clusters to extract is variable –Often combined with visualization tools based on similarity and/or clusters –Sometimes important that approach be incremental l Useful approach when you don't have a handle on the domain or it's changing

©2003 Paula Matuszek Document Visualization l Visualization –Visually display relationships among documents l Examples –hyperbolic viewer based on document similarity; browse a field of scientific documents –“map” based techniques showing peaks, valleys, outliers –graphs showing relationships between companies and research areas l Highly interactive, intended to aid a human in finding interrelationships and new knowledge in the document set.

©2003 Paula Matuszek Latent Semantic Analysis l Bag of Words methods we have looked at ignore syntax -- A document is "about" the words in it l People interpret documents in a richer context: –a document is about some domain –reflected in the vocabulary –but not limited to it

©2003 Paula Matuszek Match Topic and Phrase l Astronomy l Automobiles l Biology l I saw Pathfinder on Mars with a telescope. l The Pathfinder photograph mars our perception of a lifeless planet. l The Pathfinder photograph from Ford has arrived. l When a Pathfinder fords a river it sometimes mars its paint job.

©2003 Paula Matuszek Domain-Based Processing l This task is relatively easy because we know a lot about all of the domains, and can disambiguate using that knowledge. l It's not completely trivial: the biology choice could also have been astronomy. l Information Extraction systems like GATE and AeroText model the domain knowledge explicitly, but this takes a lot of effort. l Is there an easier way?

©2003 Paula Matuszek Word Co-Occurrences l BOW approaches assume meaning is carried by vocabulary, ignore syntax l Domain modeling approaches capture detailed knowledge about the meaning l An intermediate position is to look at vocabulary groups; what words tend to occur together? l Still a statistical approach, but richer representation than single terms

©2003 Paula Matuszek Examples of What We Would Like: l Looking for articles about Tiger Woods in an API newswire database brings up stories about the golfer, followed by articles about golf tournaments that don't mention his name. l Constraining the search to days when no articles were written about Tiger Woods still brings up stories about golf tournaments and well-known players. l So we are recognizing that Tiger Woods is about golf. javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm

©2003 Paula Matuszek Example l Tiger Woods takes some drama out of cut streak with opening round at Funai. l Every player on the money list is at Disney trying to make it to the Tour Championship. Tiger Woods has no such worries. l Going into this week's event, Tiger Woods has made the cut in 113 successive events. He tied the PGA Tour's consecutive cut record two weeks ago at the Funai Classic in Orlando, Florida, while Cink finished second. l Stewart Cink finished second at the Funai Classic at Walt Disney World.

©2003 Paula Matuszek Example, Cont. l Woods tended to occur in same articles as Funai. l Cinc also tended to occur in articles about Funai l So there is a relationship between Wood and Cinc which is stronger than is indicated just by the one article in which they are both mentioned. l It has to do with cuts, Funai, and the championship tour. l So by creating a term-document matrix and examining it we can find potential relationships which are latent, or hidden. They are tied together by the meaning, or semantics, of the terms. l This is the basic concept of Latent Semantic Analysis and Latent Semantic Indexing.

©2003 Paula Matuszek Problem: Very High Dimensionality l A vector of TF*IDF representing a document is high dimensional. l If we start looking at a matrix of terms by documents, it gets even worse. l Need some way to trim words looked at –First, throw away anything "not useful" –Second, identify clusters and pick representative terms

©2003 Paula Matuszek Throw Away l Most domain semantics carried by nouns, adjectives, verbs, adverbs –throw away prepositions, articles, conjunctions, pronouns l Very frequent words don't add to domain semantics. –throw away common verbs (go, be, see), adjectives (big, good, bad ), adverbs (very) –throw away words which appear in most documents l Very infrequent words don't either –throw away terms which only appear in one document

©2003 Paula Matuszek What's Left l A condensed matrix where we can assume that most terms are meaningful. –It's still very large, and very sparse. –Basic index table for a keyword search tool. l Where can we go now? –We have fewer concepts than terms –So move from terms to concepts l So: Identify clusters and pick representative terms

©2003 Paula Matuszek Singular Value Decomposition l One approach to this is called Singular Value Decomposition. –Have a term space of thousands of dimensions, with each document a vector in that space. –Want to project or map those dimensions onto a smaller number of dimensions in such a way that relative distance among vectors is preserved as much as possible. l We end up with a much smaller number of dimensions, and a vector for each document of its value for those dimensions l For a detailed explanation:

©2003 Paula Matuszek Dimension Reduction For n (words) x m (documents) matrix M Finds least squares best U (nxk) Rows of U map input features (words) to encoded features (concept clusters) Closely related to l symm. eigenvalue decomposition, l factor analysis l principle component analysis Subroutine in many math packages.

©2003 Paula Matuszek LSI/LSA Latent semantic indexing is the application of SVD to IR. Latent semantic analysis is the more general term. Features are words, examples are text passages. Latent: Not visible on the surface Semantic: Word meanings

©2003 Paula Matuszek Geometric View Words embedded in high-d space. exam test fish

©2003 Paula Matuszek Comparison to VSM A:The feline climbed upon the roof B:A cat leapt onto a house C:The final will be on a Thursday How similar? l Vector space model: sim(A,B)=0 l LSI: sim(A,B)=.49>sim(A,C)=.45 Non-zero sim with no words in common by overlap in reduced representation.

©2003 Paula Matuszek What Does LSI Do? Let’s send it to school…

©2003 Paula Matuszek Plato’s Problem 7 th grader learns new words today, fewer than 1 by direct instruction. Perhaps 3 were even encountered. How can this be? Plato: You already knew them. LSA: Many weak relationships combined (data to back it up!) Rate comparable to students.

©2003 Paula Matuszek Vocabulary TOEFL synonym test Choose alternative with highest similarity score. LSA correct on 64% of 80 items. Matches avg applicant to US college. Mistakes correlate w/ people (r=.44). best solo measure of intelligence

©2003 Paula Matuszek Multiple Choice Exam Trained on psych textbook. Given same test as students. LSA 60% lower than average, but passes. Has trouble with “hard” ones.

©2003 Paula Matuszek Essay Test LSA can’t write. If you can’t do, judge. Students write essays, LSA trained on related text. Compare similarity and length with graded essays (labeled). Cosine weighted average of top 10. Regression to combine sim and len. Correlation: Better than human. Bag of words!?

©2003 Paula Matuszek Digit Representations Look at similarities of all pairs from one to nine. Look at best fit of these similarities in one dimension: they come out in order! Similar experiments with cities in Europe in two dimensions.

©2003 Paula Matuszek Word Sense The chemistry student knew this was not a good time to forget how to calculate volume and mass. heavy?.21 church?.14 LSI picks best p<.001

©2003 Paula Matuszek LSApplications l Improve IR. l Cross-language IR. Train on parallel collection. l Measure text coherency. l Use essays to pick educational text. l Grade essays. l Visualize word clusters Demos at

©2003 Paula Matuszek LSI Background Reading Landauer, Laham, Foltz (1998). Learning human-like knowledge by Singular Value Decomposition: A Progress Report. Advances in Neural Information Processing Systems 10, (pp )