Principles of Data Mining Published by Springer-Verlag. 2007

Slides:

Advertisements

Similar presentations

Text Categorization.

Advertisements

Chapter 5: Introduction to Information Retrieval

Web Intelligence Text Mining, and web-related Applications

Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.

| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.

Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

TF-IDF David Kauchak cs160 Fall 2009 adapted from:

Learning for Text Categorization

IR Models: Overview, Boolean, and Vector

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

Ch 4: Information Retrieval and Text Mining

CS276 Information Retrieval and Web Mining

Hinrich Schütze and Christina Lioma

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

The Vector Space Model …and applications in Information Retrieval.

Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.

Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

Term weighting and vector representation of text Lecture 3.

Chapter 5: Information Retrieval and Web Search

Advanced Multimedia Text Retrieval/Classification Tamara Berg.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.

Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.

1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.

5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA

Chapter 6: Information Retrieval and Web Search

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.

Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Vector Space Models.

CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.

Lecture 6: Scoring, Term Weighting and the Vector Space Model

More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.

Using Game Reviews to Recommend Games Michael Meidl, Steven Lytinen DePaul University School of Computing, Chicago IL Kevin Raison Chatsubo Labs, Seattle.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

IR 6 Scoring, term weighting and the vector space model.

January, 7th 2011Simon Giraudot & Ugo Martin - MoSIG - Machine learning1/1/ Large-Scale Image Retrieval with Compressed Fisher Vectors Presentation of.

Automated Information Retrieval

Similarity Measures for Text Document Clustering

Plan for Today’s Lecture(s)

Information Retrieval and Web Search

IST 516 Fall 2011 Dongwon Lee, Ph.D.

Video Google: Text Retrieval Approach to Object Matching in Videos

Central Tendency and Variability

Representation of documents and queries

Text Categorization Assigning documents to a fixed set of categories

From frequency to meaning: vector space models of semantics

Video Google: Text Retrieval Approach to Object Matching in Videos

Boolean and Vector Space Retrieval Models

INF 141: Information Retrieval

Presentation transcript:

Principles of Data Mining Published by Springer-Verlag. 2007 Text Mining Principles of Data Mining Published by Springer-Verlag. 2007

Representing Text Documents for Data Mining bag-of-words representation Example： (1) Tony likes to watch movies. John likes movies too. (2) Tony also likes to watch football games. Based on these two text documents, a list is constructed as follows： {“Tony”,”likes”,”to”,”watch”,”movies”,”also”,”football”, ”games”,”John”,”too”} local dictionary global dictionary

Stop Words and Stemming stop words：common words that are likely to be useless for classiﬁcation. ex：‘a’, ‘an’, ‘the’, ‘is’, ‘I’, ‘you’ and ‘of’. stemming：this is based on the words in documents often have many morphological variants. ex：computing, computer, computation, computational etc. might be stemmed to ‘comput’.

Representing Text Documents: Constructing a Vector Space Model the total number of features is N , the terms in some arbitrary order as 𝑡 1 , 𝑡 2 , ..., 𝑡 𝑁 . The ith document as an ordered set of N values, which we will call an N-dimensional vector and write as ( 𝑋 𝑖1 , 𝑋 𝑖2 , ... , 𝑋 𝑖𝑁 ). The complete set of vectors for all documents is called a vector space model(VSM). In general we can say that value 𝑋 𝑖𝑗 is a weight measuring the importance of the jth term tj in the ith document.

Representing Text Documents: Constructing a Vector Space Model Term Frequency Inverse Document Frequency(TFIDF)： The TFIDF value of a weight 𝑋 𝑖𝑗 is calculated as the product of two values, which correspond to the term frequency and the inverse document frequency, respectively. log 2 (𝑛/ 𝑛 𝑗 ) , 𝑛 𝑗 is the number of documents containing term 𝑡 𝑗 and 𝑛 is the total number of documents. Example： If a term occurs in every document its inverse document frequency value is 1. If it occurs in only one document out of every 16, its inverse document frequency value is log 2 (16/1) =4.

Normalising the Weights the weights used are just the term frequency values. the length of (0, 3, 0, 4, 0, 0) is , ( 3 2 + 4 2 ) =5, so the normalised vector is (0, 3/5, 0, 4/5, 0, 0), which has length 1. the length of (0, 6, 0, 8, 0, 0) is , ( 6 2 + 8 2 ) =10, so the normalised vector is (0, 6/10, 0, 8/10, 0, 0) = (0, 3/5, 0, 4/5, 0, 0) the length of (0, 30, 0, 40, 0, 0) is , ( 30 2 + 40 2 ) =50, so the normalised vector is (0, 30/50, 0, 40/50, 0, 0) = (0, 3/5, 0, 4/5, 0, 0) In normalised form all three vectors are the same, as they should be.

Measuring the Distance Between Two Vectors We deﬁne the dot product of two unit vectors of the same dimension to be the sum of the products of the corresponding pairs of values. Example： If we take the two unnormalised vectors (6, 4, 0, 2, 1) and (5, 7, 6, 0, 2), normalising them to unit length converts the values to (0.79, 0.53, 0, 0.26, 0.13) and (0.47, 0.66, 0.56, 0, 0.19). The dot product is now 0.79 × 0.47 + 0.53 × 0.66 + 0 × 0.56 + 0.26 × 0 + 0.13 × 0.19 = 0.74 approximately. So a measure of the distance between the two values, which is 1 − 0.74 = 0.26.

Measuring the Performance of a Text Classiﬁer Predictive accuracy：(a + d)/(a + b + c + d) Recall：a/(a + c) Precision：a/(a + b) 𝐹1 Score is just the product of Precision and Recall divided by their average , which is deﬁned by the formula 𝐹 1 = 2 × Precision × Recall/(Precision+ Recall). Predicted class 𝐶 𝑘 not 𝐶 𝑘 Actual class a c b d

Hypertext Categorisation The automatic classification of web pages, i.e. HTML files, is usually known as Hypertext Categorisation (or Hypertext Classification).

Hypertext Classification versus Text Classification Text Classification： In text classification the words that the human reader sees are very similar to the data provided to the classification program.

Hypertext Classiﬁcation versus Text Classiﬁcation