So Hirai The University of Tokyo Currently NTT DATA Corp. Kenji Yamanishi The University of Tokyo WITMSE 2012, Amsterdam, Netherland Presented at KDD 2012.

Slides:



Advertisements
Similar presentations
A probabilistic model for retrospective news event detection
Advertisements

Unsupervised Learning
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Hypothesis testing Another judgment method of sampling data.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
Efficient Distribution Mining and Classification Yasushi Sakurai (NTT Communication Science Labs), Rosalynn Chong (University of British Columbia), Lei.
Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.
A Statistical Model for Domain- Independent Text Segmentation Masao Utiyama and Hitoshi Isahura Presentation by Matthew Waymost.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Lecture 5: Learning models using EM
Adaptive Rao-Blackwellized Particle Filter and It’s Evaluation for Tracking in Surveillance Xinyu Xu and Baoxin Li, Senior Member, IEEE.
Incremental Learning of Temporally-Coherent Gaussian Mixture Models Ognjen Arandjelović, Roberto Cipolla Engineering Department, University of Cambridge.
Evaluating Hypotheses
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Maximum Likelihood (ML), Expectation Maximization (EM)
Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University
Experimental Evaluation
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
Determining the Significance of Item Order In Randomized Problem Sets Zachary A. Pardos, Neil T. Heffernan Worcester Polytechnic Institute Department of.
Tuesday, September 10, 2013 Introduction to hypothesis testing.
Universal and composite hypothesis testing via Mismatched Divergence Jayakrishnan Unnikrishnan LCAV, EPFL Collaborators Dayu Huang, Sean Meyn, Venu Veeravalli,
6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):
1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu
On Model Validation Techniques Alex Karagrigoriou University of Cyprus "Quality - Theory and Practice”, ORT Braude College of Engineering, Karmiel, May.
Optimal n fe Tian-Li Yu & Kai-Chun Fan. n fe n fe = Population Size × Convergence Time n fe is one of the common used metrics to measure the performance.
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Tracking with Unreliable Node Sequences Ziguo Zhong, Ting Zhu, Dan Wang and Tian He Computer Science and Engineering, University of Minnesota Infocom 2009.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Probabilistic Reasoning for Robust Plan Execution Steve Schaffer, Brad Clement, Steve Chien Artificial Intelligence.
Non-Informative Dirichlet Score for learning Bayesian networks Maomi Ueno and Masaki Uto University of Electro-Communications, Japan 1.Introduction: Learning.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Kevin Cherry Robert Firth Manohar Karki. Accurate detection of moving objects within scenes with dynamic background, in scenarios where the camera is.
Probabilistic & Statistical Techniques Eng. Tamer Eshtawi First Semester Eng. Tamer Eshtawi First Semester
Randomized Algorithms for Bayesian Hierarchical Clustering
A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION Yan Song, Sheng Tang, Yan-Tao Zheng, Tat-Seng Chua, Yongdong Zhang, Shouxun Lin.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
INTRODUCTION TO Machine Learning 3rd Edition
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Experimentation in Computer Science (Part 2). Experimentation in Software Engineering --- Outline  Empirical Strategies  Measurement  Experiment Process.
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Machine Learning 5. Parametric Methods.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
CGH Data BIOS Chromosome Re-arrangements.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
Chapter 7 Introduction to Sampling Distributions Business Statistics: QMIS 220, by Dr. M. Zainal.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Learning and Removing Cast Shadows through a Multidistribution Approach Nicolas Martel-Brisson, Andre Zaccarin IEEE TRANSACTIONS ON PATTERN ANALYSIS AND.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Mingze Zhang, Mun Choon Chan and A. L. Ananda School of Computing
Understanding Results
Sequential Pattern Discovery under a Markov Assumption
Lecture 11 Generalizations of EM.
Presentation transcript:

So Hirai The University of Tokyo Currently NTT DATA Corp. Kenji Yamanishi The University of Tokyo WITMSE 2012, Amsterdam, Netherland Presented at KDD 2012 on Aug.13.

Contents Problem Setting Significance Proposed Algorithm : Sequential Dynamic Model Selection with NML(normalized maximum likelihood) coding How to compute the NML coding for Gaussian mixtures Experimental Results Marketing Applications Conclusion 2

Problem Setting (1/2) 3 Time Change Clustering change detection ---Tracking changes of clustering structures in a sequential setting to detect novelty in data Ex. Market analysis The structure of customer groups changes over time Detect changes of the number of clusters as well as their assignment

Problem Setting (2/2) 4 F E DC B A F E D C B A F E D C B A F E D C B A α β Examples of clustering structure changes Existing customers change their patterns New customer s emerge to form a new group There exist various types of clustering structures

Related works Evolutionally clustering [Chakrabrti et. al., 2006] Hypothesis testing approach[Song and Wang, 2005] Kalman filter approach [Krempl et. al., 2011] Graph Scope [Sun et. al., 2007] Variational Bayes approach[Sato, 2001] 5 Clustering change detection issue

Significance A novel clustering change detection algorithm Key idea: ・ Sequential dynamic model selection (sequential DMS ) ・ NML(normalized maximum likelihood) code-length as criteria ……..First formulae for NML for Gaussian mixture models 6 Empirical demonstration of its superiority over existing methods Shown using artificial data sets Demonstration of its validity in market analysis Shown using real beer consumption data sets

Sequential Dynamic Model Selection Algorithm 7

Proposed Alg. – background of DMS – Batch DMS criterion : 8 Dynamic Model Selection ( DMS ) [Yamanishi and Maruyama, 2007] Total code-length Code-length of data seq. Code-length of model seq. Minimum w.r.t. ~Extension of MDL (Minimum Description Length) principle[Rissanen, 1978] into model “sequence” selection

Proposed Alg. – Sequential DMS – At each time t, given, sequentially select for clustering 9 Sequential dynamic model selection (SDMS) Alg. Code-length for data clustering ~ NML (normalized maximum likelihoood) coding Code-length for transition of clustering structure Minimum w.r.t. K t, Z t Sequential variant of DMS criterion [Yamanishi and Maruyama, 2007] s.t.

Proposed Alg. – model transition – Run EM alg. with initial values below: Case 1 # of clusters does not change Initial parameter values remain the same Case 2 # of clusters decreases (e.g., merging) Assign data in a certain cluster to other ones randomly Case 3 # of clusters increases (e.g., splitting) Set data to a new cluster randomly 10 Consider three patterns of clustering changes Case 2 Case 3

Proposed Alg. – code-length for transition – Model transition probability distribution Suppose K transits to neighbors only Employ Krichevsky-Trofimov (KT) estimate [Krichevsky and Trofimov, 1981] 11 Code-length of the model transition

How to compute NML code-length for Gaussian mixtures 12

Criteria – NML code-length – Model (Gaussian mixture model) : NML (normalized maximum likelihood) code-length : Shortest code-length in the sense of minimax criterion [Shatarkov 1987] 13 Normalization term

For Continuous Data Normalization term In case of, the data ranges over all domains Problem: NML for Gaussian distribution Normalization term diverges NML for mixture distribution Normalization term is computationally intractable This comes from combinational difficulties 14

For Continuous Data (Example) For the one-dimension Gaussian distribution (σ 2 is given) Normalization term 15

Approximate computation (1/2) 16 Use sufficient statistics g 1 : Gaussian distribution g 2 : Wishart distribution

Criteria – NML for GMM – Restrict the range of data so that the MLE lies in a bounded range specified by a parameter 17 Efficiently computing an approximate variant of the NML code-length for a GMM [Hirai and Yamanishi, 2011] The normalization term does not diverge But still highly depends on the parameters :

NML The normalization term is calculated as follows : 18 where, : number of data,: dim. of data

Criteria – RNML code-length – Re-normalize around the MLE of parameter by restricting the range of data 19 Modify NML to develop the re-normalized maximum likelihood coding (RNML) [Rissanen, Roos, Myllymaki 2010] [Hirai and Yamanishi, 2012] Less dependent on hyper-parameter

20 Criteria – RNML code-length –

RNML code-length Theorem [Hirai and Yamanishi 2012] RNML code-length for GMM is calculated as follows : 21 Definition Problem Computing, costs. 1

Criteria – efficient computing of RNML – Straightforward computation of RNML requires time ⇒ But we can compute it efficiently Theorem [Kontkanen and Myllymaki, 07] 22 1 )

Can compute the normalization term in for “mixture” models Criteria – efficient computing of RNML – Straightforward computation of RNML requires time ⇒ But we can compute it efficiently Theorem [Hirai and Yamanishi, 2012] The normalization term satisfies recurrsive formula 23 2 2 2

Experimental Results – Artificial Data – – Market Analysis – 24

Experimental Results – data generation – Generate artificial data set according to GMM with 25

Experimental Results – comparison criteria – AR (accuracy rate) : Average rate of correctly estimating the true number of clusters over all time IR (identification rate) : Probability of correctly identifying change-points and change themselves FAR (false alarm rate) : Rate of the number of false alarms over all detected change-points 26 Employ three comparison metrics

Experimental Results – artificial data – 27 Our alg. with NML was able to detect true change- points and identify the true # of clusters with higher probability than AIC and BIC Average Number of clusters Over Time AIC:Akaike’s information criteria [Akaike1974] BIC:Bayesian information criteria [Shwarz 1978] RNMLAICBIC AR IR FAR

Comparison w. r. t. KL-divergence Evaluated change detection accuracies by varying the Kullback-Leibler divergence (KLD) between the distributions before and after the change points 28 The larger the KLD between GMMs before and after the change-point was, the more accurately it was detected in terms of IR (identification rate).

Experimental Results – vs SW Alg. – SW algorithm : Hypothesis testing whether clusters are identical or not, then make splitting, merging, etc. [Song and Wang, 2005] 29 The sequential DMS with RNML significantly outperformed SW-alg. ARIRFAR Proposed SW-RNML SW-BIC Data : size/time = 512

Experimental Results – market analysis – 30 Data set provided by MACROMILL, Inc. Clustering customers to detect their structure changes Our alg. detected clustering changes that corresponded to the year’s ending demand Beer 1Beer 2... User User Beer 1Beer 2... User User Beer 1Beer 2... User User kinds of beer 3185 users 78 days

The cluster change in change-point : 1/1,2 31 Many of customers changed their patterns to purchase Beer-A and Third-Beer at the year’s end

Conclusion Proposed the sequential DMS algorithm to address clustering change detection issue. Key ideas : Sequential dynamic model selection based on MDL principle The use of the NML code-length as criteria and its efficient computation It is able to detect cluster changes significantly more accurately than AIC/BIC based methods and the existing statistical-test based method in artificial data Tracking changes of group structures leads to the understanding changes of market structures 32

Why is NML ? 33 The shortest code-length in the sense of Shtarkov’s minimax criterion [Shtarkov, 1987] Minimum is attained by Q= NML distribution Maximum Likelihood Estimator For a given class :

Restrict the range of data 34 Restrict the range of data for Shtarkov’s minimax criterion [Shtarkov, 1987] For a given class : Restrict the range of data. We change the Shtarkov’s minimax criterion itself

Comparison with non-parametric Bayes Sequential Dynamic Model Selection works better than non-parametric Bayes (Infinite HMM, etc.) [Comparison of Dynamic Model Selection with Infinite HMM for Statistical Model Change Detection Sakurai and Yamanishi, to appear in ITW 2012] 35