Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer.

Slides:



Advertisements
Similar presentations
Naïve-Bayes Classifiers Business Intelligence for Managers.
Advertisements

Mixture Models and the EM Algorithm
Architecture for graphical maps of Web contents Krzysztof Ciesielski, Michal Draminski, Mieczyslaw Klopotek, Mariusz Kujawiak, Slawomir Wierzchon Institute.
Statistical Topic Modeling part 1
Visual Recognition Tutorial
Generative Topic Models for Community Analysis
Overview Full Bayesian Learning MAP learning
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Expectation Maximization Method Effective Image Retrieval Based on Hidden Concept Discovery in Image Database By Sanket Korgaonkar Masters Computer Science.
Relevance Feedback based on Parameter Estimation of Target Distribution K. C. Sia and Irwin King Department of Computer Science & Engineering The Chinese.
Latent Dirichlet Allocation a generative model for text
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Vector Space Model CS 652 Information Extraction and Integration.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Visual Recognition Tutorial
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
1 Collaborative Filtering: Latent Variable Model LIU Tengfei Computer Science and Engineering Department April 13, 2011.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Clustering Unsupervised learning Generating “classes”
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Mining document maps Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Classification Techniques: Bayesian Classification
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Mapping document collections in non-standard geometries Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak.
Latent Dirichlet Allocation D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3: , January Jonathan Huang
HMM - Part 2 The EM algorithm Continuous density HMM.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
Vector Space Models.
Document Maps Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Parameter Learning with Hidden Variables & Expectation Maximization.
Lecture 2: Statistical learning primer for biologists
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
CS246 Latent Dirichlet Analysis. LSI  LSI uses SVD to find the best rank-K approximation  The result is difficult to interpret especially with negative.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Analysis of Social Media MLD , LTI William Cohen
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
KNN & Naïve Bayes Hongning Wang
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
B. Freeman, Tomasz Malisiewicz, Tom Landauer and Peter Foltz,
Clustering medical and biomedical texts – document map based approach
Multimodal Learning with Deep Boltzmann Machines
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
Latent Dirichlet Analysis
Probabilistic Models with Latent Variables
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Text Categorization Berlin Chen 2003 Reference:
CS 430: Information Discovery
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw Research partially supported by the KBN research project 4 T11C "Maps and intelligent navigation in WWW using Bayesian networks and artificial immune systems"4 T11C

A search engine with SOM- based document set representation

Map visualizations in 3D (BEATCA)

 The preparation of documents is done by an indexer, which turns a document into a vector-space model representation  Indexer also identifies frequent phrases in document set for clustering and labelling purposes  Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded  The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation  ‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query

Document model in search engines  In the so-called vector model a document is considered as a vector in space spanned by the words it contains. dog food walk My dog likes this food When walking, I take some food

Clustering document vectors Document space 2D map m x r Mocna zmiana położenia (gruba strzałka) Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar

Our problem  Instability  Pre-defined major themes needed  Our approach – Find a coarse clustering into a few themes

Bayesian Networks in Document Clustering  SOM document-map based search engines require initial document clustering in order to present results in a meaningful way.  Latent semantic Indexing based methods appear to be promising for this purpose.  One of them, the PLSA, has been empirically investigated.  A modification is proposed to the original algorithm and an extension via TAN-like Bayesian networks is suggested.

A Bayesian Network chappiR dog owner food Meat walk Represents joint probability distribution as a product of conditional probabilities of childs on parents in a directed acyclic graph High compression, Simpliofication of reasoning.

BN application in text processing  Document classification  Document Clustering  Query Expansion 

Hidden variable approaches  PLSA (Probabilistic Latent Semantic Analysis)  PHITS (Probabilistic Hyperlink Analysis)  Combined PLSA/PHITS  Assumption of a hidden variable expressing the topic of the document.  The topic probabilistically influence the appearence of the document (links in PHITS, terms in PLSA)

PLSA - concept  N be term-document matrix of word counts, i.e., N ij denotes how often a term (single word or phrase) t i occurs in document d j.  probabilistic decomposition into factors z k (1  k  K)  P(t i | d j ) = Σ k P(t i |z k )P(z k |d j ), with non-negative probabilities and two sets of normalization constraints  Σ i P(t i |z k ) = 1 for all k and  Σ k P(z k | d j ) = 1 for all j. DZ T1 T2 Tn..... Hidden variable

PLSA - concept  PLSA aims at maximizing L:= Σ i,j N ij log Σ k P(t i |z k )P(z k |d j ).  Factors z k can be interpreted as states of a latent mixing variable associated with each observation (i.e., word occurrence),  Expectation-Maximization (EM) algorithm can be applied to find a local maximum of L DZ T1 T2 Tn Hidden variable different factors usually capture distinct "topics" of a document collection; by clustering documents according to their dominant factors, useful topic-specific document clusters often emerge

EM algorithm – step 0 Data: DZT1T2...Tn 1? ? ? ? ? Data: DZT1T2...Tn Z randomly initialized

EM algorithm – step 1 Data: DZT1T2...Tn BN trained DZ T1 T2 Tn Hidden variable

EM algorithm step 2 Data: DZT1T2...Tn Z sampled from BN DZ T1 T2 Tn Hidden variable GOTO step 1 untill convergence (Z assignment „stable”) Z sampled for each record according to the probability distribution P(Z=1|D=d,T1=t1,...,Tn=tn) P(Z=2|D=d,T1=t1,...,Tn=tn)....

The problem  Too high number of adjustable variables  Pre-defined clusters not identified  Long computation times  instability

Solution  Our suggestion – Use the „Naive Bayes” „sharp version” – document assigned to the „most probable class”  We were successful – Up to five classes well clustered – High speed (with 20,000 documents)

Next step  Naive bayes assumes document and term independence  What if they are in fact dependent?  Our solution: – TAN APPROACH – First we create a BN of terms/documents – Then assume there is a hidden variable  Promissing results, need a deeper study

PLSA – a model with term TAN D1 Z T6 T5 T4 Hidden variable D2 Dk T1 T2 T3

PLSA – a model with document TAN T1 Z Hidden variable T2 Ti D6 D5 D4 D1 D2 D3