Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)

Slides:

Advertisements

Similar presentations

CS 336 March 19, 2012 Tandy Warnow.

Advertisements

1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.

Hierarchical Clustering, DBSCAN The EM Algorithm

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Google Similarity Distance Presented by: Akshay Kumar Pankaj Prateek.

Mapping: Scaling Rotation Translation Warp

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Phylogenetics - Distance-Based Methods CIS 667 March 11, 2204.

Phylogenetic reconstruction

Quadtrees, Octrees and their Applications in Digital Image Processing

Compression-based Unsupervised Clustering of Spectral Signatures D. Cerra, J. Bieniarz, J. Avbelj, P. Reinartz, and R. Mueller WHISPERS, Lisbon,

Support Vector Machines and Kernel Methods

1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.

Quadtrees, Octrees and their Applications in Digital Image Processing

Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.

Outline Chapter 1 Hardware, Software, Programming, Web surfing, … Chapter Goals –Describe the layers of a computer system –Describe the concept.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Chapter 12: Simulation and Modeling Invitation to Computer Science, Java Version, Third Edition.

Radial Basis Function (RBF) Networks

Image processing Lecture 4.

Data Mining Techniques

Metric Self Calibration From Screw-Transform Manifolds Russell Manning and Charles Dyer University of Wisconsin -- Madison.

Chapter 12: Simulation and Modeling

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

Data Representation CS280 – 09/13/05. Binary (from a Hacker’s dictionary) A base-2 numbering system with only two digits, 0 and 1, which is perfectly.

General Tensor Discriminant Analysis and Gabor Features for Gait Recognition by D. Tao, X. Li, and J. Maybank, TPAMI 2007 Presented by Iulian Pruteanu.

VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

University of Texas at Austin CS384G - Computer Graphics Fall 2010 Don Fussell Image processing.

Information theory concepts in software engineering Richard Torkar

Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.

Algorithmic Information Theory, Similarity Metrics and Google Varun Rao.

September 5, 2013Computer Vision Lecture 2: Digital Images 1 Computer Vision A simple two-stage model of computer vision: Image processing Scene analysis.

Chapter 9 (modified) Abstract Data Types and Algorithms Nell Dale John Lewis.

Quadtrees, Octrees and their Applications in Digital Image Processing.

Term 2, 2011 Week 1. CONTENTS Problem-solving methodology Programming and scripting languages – Programming languages Programming languages – Scripting.

Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.

Graphics and Images Graphics and images are both non-textual information, that can be displayed and printed. These images may appear on screen as well.

Wavelets and Multiresolution Processing (Wavelet Transforms)

Implementation of Least Significant Bit Image Steganography and its Steganalaysis By: Deniz Oran Fourth Quarter.

Digital imaging By : Alanoud Al Saleh. History: It started in 1960 by the National Aeronautics and Space Administration (NASA). The technology of digital.

Support Vector Machines and Kernel Methods Machine Learning March 25, 2010.

CSC2535: Computation in Neural Networks Lecture 12: Non-linear dimensionality reduction Geoffrey Hinton.

The Nature of Data and Information 11 IPT Miss O’Grady.

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.

Digital imaging By : Alanoud Al Saleh. History: It started in 1960 by the National Aeronautics and Space Administration (NASA). The technology of digital.

The task of compression consists of two components, an encoding algorithm that takes a file and generates a “compressed” representation (hopefully with.

Data Representation. What is data? Data is information that has been translated into a form that is more convenient to process As information take different.

CHAPTER 4 THE VISUALIZATION PIPELINE. CONTENTS The focus is on presenting the structure of a complete visualization application, both from a conceptual.

Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.

1 Hidden Markov Model: Overview and Applications in MIR MUMT 611, March 2005 Paul Kolesnik MUMT 611, March 2005 Paul Kolesnik.

Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,

ALGORITHMS AND FLOWCHARTS. Why Algorithm is needed? 2 Computer Program ? Set of instructions to perform some specific task Is Program itself a Software.

What is GIS? “A powerful set of tools for collecting, storing, retrieving, transforming and displaying spatial data”

Implementation of Least Significant Bit Image Steganography and its Steganalaysis By: Deniz Oran Third Quarter.

Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

ADAPTIVE HIERARCHICAL CLASSIFICATION WITH LIMITED TRAINING DATA Dissertation Defense of Joseph Troy Morgan Committee: Dr Melba Crawford Dr J. Wesley Barnes.

Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.

CSCI-235 Micro-Computer Applications

Binary Representation

Ch2: Data Representation

Scale-Space Representation of 3D Models and Topological Matching

Chapter 2 Data Representation.

Binary Representation

Paper by D.L Parnas And D.P.Siewiorek Prepared by Xi Chen May 16,2003

Implementation of Learning Systems

Presentation transcript:

Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)

Overview Input to the software is a set of files Output is a hierarchical clustering shown as an unrooted binary tree This is a case of unsupervised learning (example follows)

Process Overview 1. File translations, if necessary, for example from MIDI to “player-piano” type format. 2. Calculation of Normalized Compression Distance, or NCD. 3. Representation as an unrooted binary tree.

What’s Unique? This clustering system is unique in that it can be described as feature-free There are no parameters to tune, and no domain-specific knowledge went into it. Using general-purpose data compressors gives us a parameterized family of features automatically for each domain

Featureless Clustering No parameters and no customized features makes it convenient to develop as well as use Since it is based on information-theoretic foundations, it tends to be less brittle than other methods that make considerably more domain-specific assumptions So how does it work?

Midi Translation In order to restrict information entering the algorithm, we removed undesirable MIDI fields such as artist or composer name, headers, and other non-musical data. We keep only the basic MIDI-track decomposition as well as note timing and duration events. We throw away individual note volume.

Gene sequence translation Genetic sequences are represented in ASCII ain four letter alphabets: A,T,G,C Almost no translation at all

Image Translation Black and white images are converted to ASCII using spaces for black and # for white Newlines are used to separate rows

NCD Once a group of songs has been acquired and translated, a quantity is computed on each pair in the group Normalized Compression Distance measures how different two files are from one another.

NCD NCD is based on an earlier idea called Normalized Information Distance. NID uses as compressor a mathematical abstraction called Kolmogorov Complexity, often abbreviated K. K represents a perfect data compressor, and is therefore uncomputable.

NCD Since we cannot compute K, we approximate it using real general-purpose file-compressors like gzip, bzip2, winzip, ppmz, and others NCD depends on a particular compressor and NCD with different compressors may give different results for the same pair of objects

NCD C(x) means “the compressed size of x” C(xy) means “compressed size of x and y” 0 <= NCD(x,y) <= 1 (roughly)

NCD NCD measures how similar or different two strings (or equivalently, files) are. NCD(x,x) = 0, because nothing is different from itself NCD(x,y) = 1 means that x and y are completely unrelated Often less extreme values in real cases

NCD Computing NCD of every song with every other song yields a 2-dimensional symmetric distance matrix Next step is transforming this array of distances into something easier to grasp We use the Quartet Method to construct an unrooted binary tree from the NCD matrix

Quartet Method Our algorithm is a slight enhancement of the standard quartet method of tree reconstruction popular for the last 30 years The input is a matrix of distances (NCD) The output is an unrooted binary tree topology where each song is at a leaf and each non-leaf node has exactly three connections. Tree is just one visualization of NCD matrix

Newer developments Since the original Algorithmic Clustering of Music paper, we have since developed further the underlying mathematical formalisms upon which the method is based in a new paper, Clustering by Compression We’ve included experiments from many other areas: biology, astronomy, images…

Current and future work This year, we’ve begun experimenting with automatic conversion from.mp3 (and most other audio formats) to MIDI. This enables us to participate in new emerging spaces We’re investigating alternatives for all stages of this process, to try to understand more about this apparently general machine learning algorithm

New directions Combination of NCD and Support Vector Machine (SVM) learning for providing scalable generalization in a wide class of domains both musical and otherwise Application of our techniques in real outstanding questions within the musical community

Contact and more info Related papers and information: Software: