Principles of Data Mining Published by Springer-Verlag. 2007

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Chapter 5: Introduction to Information Retrieval
Web Intelligence Text Mining, and web-related Applications
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Ch 4: Information Retrieval and Text Mining
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Term weighting and vector representation of text Lecture 3.
Chapter 5: Information Retrieval and Web Search
Advanced Multimedia Text Retrieval/Classification Tamara Berg.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
Text mining.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Vector Space Models.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Using Game Reviews to Recommend Games Michael Meidl, Steven Lytinen DePaul University School of Computing, Chicago IL Kevin Raison Chatsubo Labs, Seattle.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
IR 6 Scoring, term weighting and the vector space model.
January, 7th 2011Simon Giraudot & Ugo Martin - MoSIG - Machine learning1/1/ Large-Scale Image Retrieval with Compressed Fisher Vectors Presentation of.
Automated Information Retrieval
Similarity Measures for Text Document Clustering
Plan for Today’s Lecture(s)
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Video Google: Text Retrieval Approach to Object Matching in Videos
Central Tendency and Variability
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
Video Google: Text Retrieval Approach to Object Matching in Videos
Boolean and Vector Space Retrieval Models
INF 141: Information Retrieval
Presentation transcript:

Principles of Data Mining Published by Springer-Verlag. 2007 Text Mining Principles of Data Mining Published by Springer-Verlag. 2007

Representing Text Documents for Data Mining bag-of-words representation Example: (1) Tony likes to watch movies. John likes movies too. (2) Tony also likes to watch football games. Based on these two text documents, a list is constructed as follows: {“Tony”,”likes”,”to”,”watch”,”movies”,”also”,”football”, ”games”,”John”,”too”} local dictionary global dictionary

Stop Words and Stemming stop words:common words that are likely to be useless for classification. ex:‘a’, ‘an’, ‘the’, ‘is’, ‘I’, ‘you’ and ‘of’. stemming:this is based on the words in documents often have many morphological variants. ex:computing, computer, computation, computational etc. might be stemmed to ‘comput’.

Representing Text Documents: Constructing a Vector Space Model the total number of features is N , the terms in some arbitrary order as 𝑡 1 , 𝑡 2 , ..., 𝑡 𝑁 . The ith document as an ordered set of N values, which we will call an N-dimensional vector and write as ( 𝑋 𝑖1 , 𝑋 𝑖2 , ... , 𝑋 𝑖𝑁 ). The complete set of vectors for all documents is called a vector space model(VSM). In general we can say that value 𝑋 𝑖𝑗 is a weight measuring the importance of the jth term tj in the ith document.

Representing Text Documents: Constructing a Vector Space Model Term Frequency Inverse Document Frequency(TFIDF): The TFIDF value of a weight 𝑋 𝑖𝑗 is calculated as the product of two values, which correspond to the term frequency and the inverse document frequency, respectively. log 2 (𝑛/ 𝑛 𝑗 ) , 𝑛 𝑗 is the number of documents containing term 𝑡 𝑗 and 𝑛 is the total number of documents. Example: If a term occurs in every document its inverse document frequency value is 1. If it occurs in only one document out of every 16, its inverse document frequency value is log 2 (16/1) =4.

Normalising the Weights the weights used are just the term frequency values. the length of (0, 3, 0, 4, 0, 0) is , ( 3 2 + 4 2 ) =5, so the normalised vector is (0, 3/5, 0, 4/5, 0, 0), which has length 1. the length of (0, 6, 0, 8, 0, 0) is , ( 6 2 + 8 2 ) =10, so the normalised vector is (0, 6/10, 0, 8/10, 0, 0) = (0, 3/5, 0, 4/5, 0, 0) the length of (0, 30, 0, 40, 0, 0) is , ( 30 2 + 40 2 ) =50, so the normalised vector is (0, 30/50, 0, 40/50, 0, 0) = (0, 3/5, 0, 4/5, 0, 0) In normalised form all three vectors are the same, as they should be.

Measuring the Distance Between Two Vectors We define the dot product of two unit vectors of the same dimension to be the sum of the products of the corresponding pairs of values. Example: If we take the two unnormalised vectors (6, 4, 0, 2, 1) and (5, 7, 6, 0, 2), normalising them to unit length converts the values to (0.79, 0.53, 0, 0.26, 0.13) and (0.47, 0.66, 0.56, 0, 0.19). The dot product is now 0.79 × 0.47 + 0.53 × 0.66 + 0 × 0.56 + 0.26 × 0 + 0.13 × 0.19 = 0.74 approximately. So a measure of the distance between the two values, which is 1 − 0.74 = 0.26.

Measuring the Performance of a Text Classifier Predictive accuracy:(a + d)/(a + b + c + d) Recall:a/(a + c) Precision:a/(a + b) 𝐹1 Score is just the product of Precision and Recall divided by their average , which is defined by the formula 𝐹 1 = 2 × Precision × Recall/(Precision+ Recall). Predicted class 𝐶 𝑘 not 𝐶 𝑘 Actual class a c b d

Hypertext Categorisation The automatic classification of web pages, i.e. HTML files, is usually known as Hypertext Categorisation (or Hypertext Classification).

Hypertext Classification versus Text Classification Text Classification: In text classification the words that the human reader sees are very similar to the data provided to the classification program.

Hypertext Classification versus Text Classification