Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Text Categorization.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Document Clustering Content: 1.Document Clustering Essentials. 2.Text Clustering Architecture 3.Preprocessing 4.Different Document Models 1.Probabilistic.
Chapter 5: Introduction to Information Retrieval
Albert Gatt Corpora and Statistical Methods Lecture 13.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Data Warehousing & Mining with Business Intelligence: Principles and Algorithms Data Warehousing & Mining with Business Intelligence: Principles and Algorithms.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Learning for Text Categorization
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Web-based Information Architectures Jian Zhang. Today’s Topics Term Weighting Scheme Vector Space Model & GVSM Evaluation of IR Rocchio Feedback Web Spider.
Vector Space Model CS 652 Information Extraction and Integration.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to Data Mining Engineering Group in ACL.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Text Classification, Active/Interactive learning.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Chapter 23: Probabilistic Language Models April 13, 2004.
Vector Space Models.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data Mining and Text Mining. The Standard Data Mining process.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
1 Query Directed Web Page Clustering Daniel Crabtree Peter Andreae, Xiaoying Gao Victoria University of Wellington.
Mining the Data Charu C. Aggarwal, ChengXiang Zhai
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
From frequency to meaning: vector space models of semantics
Fig. 1 (a) The PageRank algorithm (b) The web link structure
Presentation transcript:

Text mining

The Standard Data Mining process

Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks include: – Text categorization (document classification) – Text clustering – Text summarization – Opinion mining – Entity/concept extraction – Information retrieval: search engines – information extraction: Question answering

Supervised learning algorithms – Decision tree learning – Naïve Bayes – K-nearest neighbour – Support Vector Machines – Neural Networks – Genetic algorithms

Supervised Machine learning 1. Build or get a representative corpus 2. Label it 3. Define features 4. Represent documents 5. Learn and analyse 6. Go to 3 until accuracy is acceptable First test features: stemmed words

Unsupervised Learning – Document clustering HAC K-means BIRCH …

Applying machine learning on text Text Representation (Feature Extraction) preprocessing Indexing Weighting Model Dimensionality Reduction Similarity measure: how to compare text

Feature Extraction: Task(1) Task: Extract a good subset of words to represent documents Document collection All unique words/phrases Feature Extraction All good words/phrases Some slides by Huaizhong Kou

Feature Extraction: Task(2) While more and more textual information is available online, effective retrieval is difficult without good indexing of text content. While-more-and-textual-information-is-available-online- effective-retrieval-difficult-without-good-indexing-text-content Feature Extraction Text-information-online-retrieval-index

Feature Extraction: preprocessing and Indexing(1) Identification all unique words Removal stop words Removal stop words Word Stemming Training documents Term Weighting Naive terms Importance of term in Doc  Removal of suffix to generate word stem  grouping words  increasing the relevance  ex.{walker,walking}  walk  non-informative word  ex.{the,and,when,more}

Feature Extraction: Indexing(2) Vector Space Model (VSM) is one of the most commonly used Text data models Any text document is represented by a vector of terms Terms are typically words and/or phrases Every term in the vocabulary becomes an independent dimension Each term in the text document would be represented by a non zero value which will be added in the corresponding dimension A document collection is represented as a matrix: Where x ji represents the weight of the ith term in jth document

Feature Extraction:Weighting Model(1) tf - Term Frequency weighting w ij = Freq ij Freq ij : := the number of times jth term occurs in document D i.  Drawback: without reflection of importance factor for document discrimination. Ex. ABRTSAQWA XAO RTABBAXA QSAK D1 D2 A B K O Q R S T W X D D

Feature Extraction:Weighting Model(2) Tf-idf: simple version w ij = Freq ij * log(N/ DocFreq j ). N : := the number of documents in the training document collection. DocFreq j ::= the number of documents in which the jth term occurs. Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection A B K O Q R S T W X D D

Feature Extraction: Weighting Model(3) Tf-IDF weighting = TF * IDF ] ABKOQRSTWX 0 4/12 * (lg(2/2)1/12*(lg(2/1)

Feature Extraction: Weighting Model Tf-IDF weighting Entropy weighting where is average entropy of ith term and -1: if word occurs once time in every document 0: if word occurs in only one document Ref:[13]Ref:[11][22]

Feature Extraction: Dimension Reduction Document Frequency Thresholding X 2 -statistic Latent Semantic Indexing Information Gain Mutual information

Dimension Reduction:DocFreq Thresholding Document Frequency Thresholding Calculates DocFreq(w) Sets threshold  Removes all words: DocFreq <  Naive Terms Training documents D Feature Terms

Similarity measure There are many different ways to measure how similar two documents are, or how similar a document is to a query Highly depending on the choice of terms to represent text documents – Euclidian distance (L2 norm) – L1 norm – Cosine similarity

Document Similarity Measures

Document Similarity measures

Document Clustering: Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering (CACTUS, STIRR) …… STC QDC