Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor ： Dr. Hsu Graduate ： You-Cheng Chen Author ： Jeremy.

Slides:

Advertisements

Similar presentations

Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.

Advertisements

Hierarchical Dirichlet Processes

Bayesian dynamic modeling of latent trait distributions Duke University Machine Learning Group Presented by Kai Ni Jan. 25, 2007 Paper by David B. Dunson,

Graduate : Sheng-Hsuan Wang

BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.

Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.

Jeremy Tantrum, Department of Statistics, University of Washington joint work with Alejandro Murua & Werner Stuetzle Insightful Corporation University.

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

Scalable Text Mining with Sparse Generative Models

POSTER TEMPLATE BY: Cluster-Based Modeling: Exploring the Linear Regression Model Space Student: XiaYi(Sandy) Shen Advisor:

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Discovering Outlier Filtering Rules from Unlabeled Data Author: Kenji Yamanishi & Jun-ichi Takeuchi Advisor: Dr. Hsu Graduate: Chia- Hsien Wu.

Similarity Measure Based on Partial Information of Time Series Advisor ： Dr. Hsu Graduate ： You-Cheng Chen Author ： Xiaoming Jin Yuchang Lu Chunyi Shi.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel genetic algorithm for automatic clustering Advisor.

Clustering Seasonality Patterns in the Presence of Errors Advisor ： Dr. Hsu Graduate ： You-Cheng Chen Author ： Mahesh Kumar Nitin R. Patel Jonathan Woo.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.

Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.

Clustering Spatial Data Using Random Walks Author : David Harel Yehuda Koren Graduate : Chien-Ming Hsiao.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology HE-Tree: a framework for detecting changes in clustering.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Ming Hsiao Author ： Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.

A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.

Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 New Unsupervised Clustering Algorithm for Large Datasets.

Chapter 8 Sampling Variability and Sampling Distributions.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Mixture of Gaussians This is a probability distribution for random variables or N-D vectors such as… –intensity of an object in a gray scale image –color.

ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,

Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

A Fuzzy k-Modes Algorithm for Clustering Categorical Data

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Manoranjan.

A New Temporal Pattern Identification Method for Characterization and Prediction of Complex Time Series Events Advisor ： Dr. Hsu Graduate ： You-Cheng Chen.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Chung-hung.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Authors :

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.

CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.

Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A hierarchical clustering algorithm for categorical sequence.

Bayesian Density Regression Author: David B. Dunson and Natesh Pillai Presenter: Ya Xue April 28, 2006.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Wei Xu,

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Recognizing Partially Occluded, Expression Variant Faces.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Hierarchical model-based clustering of large datasets.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author ： Yongqiang Cao Jianhong Wu 國立雲林科技大學 National Yunlin University of Science.

ViSOM － A Novel Method for Multivariate Data Projection and Structure Visualization Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Hujun Yin.

TreeFinder ： a first step towards XML data mining Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Alexandre Termier Marie-Christine Michele Sebag.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.

Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.

Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.

The Nested Dirichlet Process Duke University Machine Learning Group Presented by Kai Ni Nov. 10, 2006 Paper by Abel Rodriguez, David B. Dunson, and Alan.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Lynette.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Chun Kai Chen Author ： Andrew.

Mining Top-n Local Outliers in Large Databases Author: Wen Jin, Anthony K. H. Tung, Jiawei Han Advisor: Dr. Hsu Graduate: Chia- Hsien Wu.

CACTUS-Clustering Categorical Data Using Summaries

Kernel Stick-Breaking Process

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Topic Models in Text Processing

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Presentation transcript:

Hierarchical Model-Based Clustering of Large Datasets Through Fractionation and Refractionation Advisor ： Dr. Hsu Graduate ： You-Cheng Chen Author ： Jeremy Tantrum Alejandro Murua Werner Stuetzle

Motivation Objective Introduction Model-based Fractionation Model-based ReFractionation Example Conclusions Personal Opinion Outline

Motivation Propose a extended method to improve performance of model-based clustering method and apply it to large datasets.

Objective Apply Fractionation and Refractionation to model-based clustering.

Introduction Model-based clustering in a nutshell Sample: is the density modeling group g is the prior probability that a randomly chosen observation belongs to group g

Introduction Model-based clustering in a nutshell We can use Approximate Weight of Evidence to estimate the number of groups. where

Introduction Previous work on model-based clustering for large datasets Scalable EM(SEM) algorithm can be used to finding fitting mixture models to large datasets but it can ’ t estimate the number of groups. The simplest and potentially fastest is to draw a sample of the data.

Original Fractionation algorithm 2. Fractionation 1Split data into fractions of size M 2Cluster each fraction into a fixed number  M where a < 1. Summarize each cluster by its mean We refer to these cluster means as meat-observations. 3If the total number of meta-observations is greater that M return to setp1 4Cluster the meta-observations into G clusters. 5Assign each individual observation to the cluster with the closet mean.

 In model-based Fractionation, we use all sufficient the mean,the covariance,and the number of observations to present cluster.  Using AWE to determine the number of clusters in Step Model-based Fractionation Main difference:

3. Model-based ReFractionation Step 4 of Fractionation algorithm is replaced by 4a,4b 4a Clustering the meta-observations into G clusters, where G is determined by AWE criterion 4b Define the fractions for the i-th pass.

3.1 Illustration M=100 fraction=4 meta-observation=40

3.1 Illustration Step 4a Use AWE find G=25 Step 4b

3.1 Illustration Second pass

3.1 Illustration 2th pass3th pass

3.2 Scope of (Re)Fractionation Let n g be the number of groups in the data n f be the number of fractions n c be the number of clusters generated from each fraction Step2 If n g > n c will bead to impure clusters.

4. Example 4.1 Measuring the agreement between groups and clusters Fowlkes-Mallows index=

4.3 Example 1 Group = 19 n=22000 M=1000 clusters=100

4.3 Example 3 Group=361 n=20900 M=1045 cluster=100

Conclusions We can study the performance of the AWE criterion for estimating the number of groups in a mixture of factor analyzers model.

Personal Opinion We can apply advantage of another clustering method to improve ours defect.