Large scale clustering of simulated gravitational wave triggers using S-means and Constrained Validation methods. L.R. Tang, H. Lei, S. Mukherjee, S. D.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
K-means Clustering Given a data point v and a set of points X,
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.
CHAPTER 2 D IRECT M ETHODS FOR S TOCHASTIC S EARCH Organization of chapter in ISSO –Introductory material –Random search methods Attributes of random search.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Basic Data Mining Techniques
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Radial Basis Function Networks
Evaluating Performance for Data Mining Techniques
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
1 Local search and optimization Local search= use single current state and move to neighboring states. Advantages: –Use very little memory –Find often.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Data Characterization in Gravitational Waves Soma Mukherjee Max Planck Institut fuer Gravitationsphysik Golm, Germany. Talk at University of Texas, Brownsville.
Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Proposal to add three new members to the existing UTB-LIGO MOU Soma Mukherjee on behalf of the UTB LSC members LIGO Tech Doc # G LSC meeting,
EECS 274 Computer Vision Segmentation by Clustering II.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
First topic: clustering and pattern recognition Marc Sobel.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Dec 16, 2005GWDAW-10, Brownsville Population Study of Gamma Ray Bursts S. D. Mohanty The University of Texas at Brownsville.
Multidimensional classification of burst triggers from LIGO S5 run Soma Mukherjee for the LSC Center for Gravitational Wave Astronomy University of Texas.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
A new clustering tool of Data Mining RAPID MINER.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Multidimensional classification analysis of kleine Welle triggers in LIGO S5 run Soma Mukherjee for the LSC University of Texas at Brownsville GWDAW12,
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
A Genetic Algorithm Approach to K-Means Clustering
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining K-means Algorithm
Clustering.
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
CSE572: Data Mining by H. Liu
Presentation transcript:

Large scale clustering of simulated gravitational wave triggers using S-means and Constrained Validation methods. L.R. Tang, H. Lei, S. Mukherjee, S. D. Mohanty University of Texas at Brownsville GWDAW12, Boston, December 13-16, 2007 LIGO DCC # LIGO-G

Introduction LIGO data is seen to pick up a variety of glitches in all channels. Glitch analysis is a critical task in LIGO detector characterization. Several analysis methods (kleine Welle, Q-scan, BN event display, Multidimensional classification, NoiseFloorMon etc. have been in use to explain the source of the glitches. Trigger clustering is an important step towards identification of distinct sources of triggers and unknown pattern discovery. The study presents two new techniques for large scale analysis of gravitational wave burst triggers. Traditional approaches to clustering treats the problem as an optimization problem in an open search space of clustering models. However, this can lead to over-fitting problems or even worse, non-convergence of the algorithm. The new algorithms address these problems. S-means looks at similarity statistics of burst triggers and builds up clusters that have the advantage of avoiding local minima. Constrained Validation clustering tackles the problem by constraining the search in the space of clustering models that are “non- splittable” –models in which centroids of the left and right child of a cluster (after splitting) are nearest to each other. These methods are demonstrated by using simulated data. The results on simulated data look promising and the methods are expected to be useful for LIGO S5 data analysis. This work is supported by NSF CREST grant to UTB 2007 and NSF PHY (2006).

Clustering: unsupervised classification of data points. It is a well researched problem for decades with applications in many disciplines. New Challenges: Massive amount of data, many traditional algorithms not suitable. K-means: the most efficient clustering algorithm due to low complexity O(nkl), high scalability and simple implementation Weakness: i) sensitive to initial partition, solution: repeat K- means; ii) converge to local minima, local minima is acceptable in practice; iii) centroid sensitive to outliers solution: fuzzy K- means; iv) the number of cluster, K, must specified in advance. Solution: S-means! S-means Clustering Algorithm

Step 1. Randomly Initialize K centroids (user can specify any starting K, K=1 by default). Step 2. Calculate the similarities from every point to every centroid. Then, for any point, if the highest similarity to centroid ci is >T, group it to cluster i, otherwise, add it to a new cluster (the K+ 1th cluster) and let K increase by 1. Step 3. Update each centroid with the mean of all member points. If one group becomes empty, remove its centroid and reduce K by 1. Repeat the Step 2 and 3 until no new cluster is formed and none of the centroids moves.

–Similar to K-means, but with big differences. –Major difference: in the second step, which basically groups all the points to a new cluster whose highest similarity to existing centroids is below the given threshold, while K-means forces points into a group, even it shouldn’t be. This makes big difference in output clusters! –Minor difference: K increases by 1 as new clusters forms in each iteration, k decreases by 1 as some clusters become empty.

Fig. S-means converges on dataset SCT in less than 30 iterations. a) the maximum number of clusters is restricted (up-bound is set 6). b) without restriction (up-bound is set 600). Dataset used: UCI time series datasets, 600 x 60, 6 classes. Similarity used: correlation. Convergence Fig. Accuracy comparison between S-means (dot point), fast K-means (star point) and G- means(circle point). Confidence varies from to in G-means. T varies from 0 to 0.95 with step 0.05 and up-bound of clusters is set 20 in S-means. Dataset used: PenDigit, x 32, 10 classes.

Fig. (a) Samples of simulated Gravitational-wave time series. (b) Centroids mined by S-means when T = 0:1. Simulated GW trigger dataset: x 1024

S-means eliminates the necessity of specifying K (the number of clusters ) in K-means clustering. An intuitive argument, similarity threshold T is used instead of K in S-means. S-means mines the number of compact clusters in a given dataset without prior knowledge in its statistical distribution. Extensive comparative experiments are needed to further validate the algorithm. For instance, the clustering result is very sensitive to threshold T and the number of returned clusters can be unexpectedly large when T is high (e.g, T > 0.4). It is necessary to evaluate S-means with different similarity measures such as Dynamic Time Warping and kernel functions in our future work. Estimated application to LIGO data – March S-means

Why Constraining the Search Space of Cluster Models? Correct models can only be found here (Note: this doesn’t mean that any model in this region is correct) M1M1 M2M2 M3M3 M4M4 MiMi MkMk M i+1 ? The correct cluster model M i : a cluster model with i clusters Non-splittable Models Splittable Models Constrained Validation Clustering Algorithm

More Formal Outline of Key Ideas Suppose D is a set of data points in R n. The source of a subset of data points is a function source: P (D)  P (S) (mapping from powerset of D to powerset of S) where S is a set of stochastic processes (i.e. S = {F | F = {F t : t T}} where F t is a random variable in a state space and T is “time”). A cluster model M = {C i | C i is a subset of D}is a partition of D. Suppose a cluster C i = X U Y such that X ∩Y = Ø. Then, X and Y are the left child and right child of C i respectively. Given a cluster model M, two clusters C i and C j are nearest each other if the Euclidean distance between the centroids of C i and C j are shortest in M. A cluster C i is non-splittable if the children X and Y are nearest to each other. Otherwise, C i is splittable. A cluster model M is non-splittable if every cluster C i in M is non-splittable. A cluster C is homogeneous if |source(C)| = 1. Otherwise, C is non-homogeneous. A set of data D has a tolerable amount of noise if it is the case that if source(Y) = source(Z) and source(Y) ≠ source(X), then d(mean(Y),mean(Z)) < d(mean(Y),mean(X)) where X,Y,Z are subsets of D and d( ) is the Euclidean distance. Theorem: Suppose D is a set of data with a tolerable amount of noise, and M k is a cluster model in D with exactly k clusters. If for all cluster C i in M k, C i is homogeneous and source(C i ) ≠ source(C j ) if i ≠ j, then M k is non-splittable. Corollary: If cluster model M k in D is splittable and D is a set of data with a tolerable amount of noise, then either there exists C i in M k that is non-homogeneous (i.e. M k “underfits” the data) or it is the case that source(C i ) = source(C j ) for some C i and C j where i ≠ j (i.e. M k “overfits” the data).

The Algorithm C1C1 C2C2 CiCi CnCn parent model P M1M1 MiMi MnMn children of C 1 N children models Is there a C i in P such there C i is non-homogeneous? No Return P Yes Choose M i that optimizes a cluster model validation index where C i is non-homogeneous P := M i P should not overfit data already

Experimental Results Cluster Data size: x 1024 (20 clusters, each has a size of 1001) G-means found 2882 clusters (average size of a cluster = 6.9)

CV Clust conclusion and future direction G-means detected the existence of Gaussian clusters at the 95% confidence level – but unfortunately the model obviously over-fits the data. The model discovered by CV Clust was much more compact than that by G-means. A future work is to provide statistical meaning to results discovered by CV Clust Tentative schedule of implementation completion – March 2008 Tentative schedule of application to LIGO data – Fall 2008