01001110 01100101 01110101 01110010 01101111 01101110 01101111 01110110 01100001 00100000 01110011 01101011 01110101 01110000 01101001 01101110 01100001.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Chapter 5: Introduction to Information Retrieval
Albert Gatt Corpora and Statistical Methods Lecture 13.
Marković Miljan 3139/2011
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
Dimensionality Reduction PCA -- SVD
Self Organization of a Massive Document Collection
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
November 9, 2010Neural Networks Lecture 16: Counterpropagation 1 Unsupervised Learning So far, we have only looked at supervised learning, in which an.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Text Classification: An Implementation Project Prerak Sanghvi Computer Science and Engineering Department State University of New York at Buffalo.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Distributed Representations of Sentences and Documents
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
CS Instance Based Learning1 Instance Based Learning.
Aula 4 Radial Basis Function Networks
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Radial Basis Function (RBF) Networks
Radial-Basis Function Networks
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks II PROF. DR. YUSUF OYSAL.
Radial Basis Function Networks
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria.
NEURAL NETWORKS FOR DATA MINING
Radial Basis Function Networks:
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Chapter 6: Information Retrieval and Web Search
SINGULAR VALUE DECOMPOSITION (SVD)
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Data Mining and Text Mining. The Standard Data Mining process.
CSE 4705 Artificial Intelligence
Big data classification using neural network
Machine Learning with Spark MLlib
Semi-Supervised Clustering
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Instance Based Learning
Multimodal Learning with Deep Boltzmann Machines
Self organizing networks
Presented by: Prof. Ali Jaoua
Neuro-Computing Lecture 4 Radial Basis Function Network
Text Categorization Assigning documents to a fixed set of categories
Text Categorization Berlin Chen 2003 Reference:
Restructuring Sparse High Dimensional Data for Effective Retrieval
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

Fast Supervised Feature Extraction from Structured Representation of Text Data Ondřej Háva Pavel Kordík, Miroslav Skrbek Computational Intelligence Group Department of Computer Science and Engineering Faculty of Electrical Engineering Czech Technical University in Prague

Agenda Structure representation of text documents Dimensionality reduction Two stage supervised feature extraction Neural network implementation Experiments Notes and future plans

Data mining, text mining, dimensionality reduction Text documents are popular source of data for data mining tasks articles, web pages, notes, blogs, free survey questions, … Written language is rich synonyms, inflection, … The goal of text mining algorithms is to extract important topics from documents transformation from bag-of-words to reliable and easy-to-use representation of documents

Representation of text data Focus on collection of unstructured text documents categorization, classification, merging with structured data, information retrieval, … Transformation of free text to structured data matrix row  document column  dictionary item documents in text format document- term matrix term frequencies dictionaries linguistic tagging possible feature reduction

Structured representation of documents M>N>K linguistic items extracted from documents topics assigned externally to documents frequency based weights topic coverage in document unsupervised learning supervised learning text records: web pages, paragraphs, documents, patents, medical records, operator’s notes, press releases, blogs, etc.

Dimensionality reduction Typical dimensionality of document matrix: 10 3 terms (M), 10 2 documents (N), 10 1 categories (K) Classes of dimensionality reduction techniques: feature selection x feature extraction standalone algorithm x wrapper supervised x unsupervised Probably most popular dimensionality reduction technique in text mining is Singular Value Decomposition (SVD) extraction, standalone, unsupervised extracted features represent latent semantic concepts

Objective Develop supervised standalone feature extraction method suitable for document-term matrix applicable on structured representation of text Method should be fast enough to process new unlabeled documents efficiently Exploit useful information from labeled training documents

Solution document-term matrix NxM document-document matrix NxN document-category matrix NxK 1. stage2. stage training dictionary similarities to training documents training documents categories similarities to training category assigent

Notation D … document-term training matrix without target columns (NxM) C … document-category indicator training matrix (NxK) S … matrix of similarities among the training documents (NxN) R … document-extracted-feature matrix for training documents (NxK) d … row vector of new unclassified document (1xM) s … row vector of similarities between new document and training documents (1xN) r … row vector of extracted features for new document (1xK)

First stage Positioning of new documents in training document space Coordinates are similarities with training documents cosine similarities Unsupervised phase Labels are unnecessary

Second stage Each training document is expressed in category space by supervised labeling new document can be also placed to category space because its similarities to training documents are known from first stage (Weighted) mean of coordinates of training documents in category space weighted by similarities with new document after normalization it is again cosine similarity to category columns in training document-category matrix Supervised phase utilization of assignment of training documents to categories

Modified second stage Assignments of training documents to categories are usually binary (0/1) document is/isn’t member of particular category Due richness of written language the real coordinates are more pragmatic then binary ones in document-category matrix some documents represent particular category better then others some documents represent more categories Training documents that are similar to training documents from same category are better representatives of this category substitute binary document-category matrix by real matrix that consists of sums of the similarities to the same category documents

Neural implementation Cosine similarity can be easily simulated by artificial neuron weighted sum of inputs (action potential) is realized by matrix multiplication in numerator activation function changes action potential by normalization in denominator Two types of neurons similarity to training documents expression of similarity in category space First stage: similarity to training document w i …term frequencies of training document x i …term frequencies of new unlabeled document Second stage: similarity to particular category w i …category indicators of training documents x i …similarities of new unlabeled document to training documents

Neural network M>N>K original second stage modified second stage input layer only transfers term frequencies to hidden layer hidden layer computes similarities to training documents output layer expresses similarities in category space

Experimental design 645 press releases by Czech News Agency (ČTK) or Grand Prince (GP) Length of each document is approximately 5KB Press releases are manually divided to eight categories cars, housing, travel, culture, Prague, domestic news, health, foreign news categories are roughly equally occupied Random split to training (65%) and test (35%) sets Comparison of proposed feature extraction method (SFX) with standard SVD Binary logistic regression classifier for each category

Experimental setup construction of document- term matrix SFX logistic regression classifier alternative to SVD

Document weighting terms constructed from words no lemmatization or stemming dictionary items selected by frequency filter gf > 2…word appears at least two times in training collection n/df > 2…word does not appear in more than half of documents gf/df > 1.2…average tf is at least 1.2 dictionary includes 5320 most frequent words from training set term frequencies are expressed by popular tfidf weights tf…term frequency in particular document gf…term frequency in whole collection df…number of documents where the term is present n…number of documents in the collection global feature of term derived from training collection local feature of term derived from particular document

Experimental results better quality but sometime overfit even better quality without overfit

Summary SFX is easy to implement no matrix decomposition, inverse or eigenvector computation SFX is fast 100 times faster than SVD SFX can deal with multitopic documents simple modification in training matrix Extracted features are interpretable they correspond to training categories Performs better then SVD utilization of training labels Simple neural simulation of SFX topology is derived from training documents and their labels no learning iterative algorithm, synaptic weights are just values from training matrix

Note: Similarity to RBF networks RBF 1. classifier 2. hidden neurons and connections between input and hidden layers represent cluster centers 3. similarity is measured by Euclidian distance transformed by radial basis function 4. weights between hidden and output layers are assessed by iterative learning SFX 1. feature extractor 2. hidden neurons and connections between input and hidden layers represent training documents 3. cosine similarity in hidden and output layers 4. weights between hidden and output layers represent labeling

Future plans Unification of all measures to common scale term weighting, similarities, labeling preserve simplicity of feature extraction by matrix multiplication Selection of best documents for training set topology of neural network: neurons in hidden layer can influence labeling

Thank you