Algorithmic Information Theory, Similarity Metrics and Google Varun Rao.

Slides:



Advertisements
Similar presentations
The Equivalence of Sampling and Searching Scott Aaronson MIT.
Advertisements

Indexing DNA Sequences Using q-Grams
15-583:Algorithms in the Real World
Introduction to Information Retrieval
Applied Algorithmics - week7
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Greedy Algorithms Amihood Amir Bar-Ilan University.
Google Similarity Distance Presented by: Akshay Kumar Pankaj Prateek.
The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships.
SIMS-201 Compressing Information. 2  Overview Chapter 7: Compression Introduction Entropy Huffman coding Universal coding.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Wan Accelerators: Optimizing Network Traffic with Compression Introduction & Motivation Results for Trained Files Another Compression Method Approach &
Effect of Linearization on Normalized Compression Distance Jonathan Mortensen Julia Wu DePaul University July 2009.
Compression-based Unsupervised Clustering of Spectral Signatures D. Cerra, J. Bieniarz, J. Avbelj, P. Reinartz, and R. Mueller WHISPERS, Lisbon,
Information Distance from a Question to an Answer.
©TheMcGraw-Hill Companies, Inc. Permission required for reproduction or display. COMPSCI 125 Introduction to Computer Science I.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Biology How Does Information/Entropy/ Complexity fit in?
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Genetic Algorithms Sushil J. Louis Evolutionary Computing Systems LAB Dept. of Computer Science University of Nevada, Reno
Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)
Optimal Partitions of Strings: A new class of Burrows-Wheeler Compression Algorithms Raffaele Giancarlo Marinella Sciortino
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
CS146 Overview. Problem Solving by Computing Human Level  Virtual Machine   Actual Computer Virtual Machine Level L0.
Lecture 3. Relation with Information Theory and Symmetry of Information Shannon entropy of random variable X over sample space S: H(X) = ∑ P(X=x) log 1/P(X=x)‏,
CSE Lectures 22 – Huffman codes
Information theory, fitness and sampling semantics colin johnson / university of kent john woodward / university of stirling.
Huffman Coding Vida Movahedi October Contents A simple example Definitions Huffman Coding Algorithm Image Compression.
Experiments in Dialectology with Normalized Compression Metrics Kiril Simov and Petya Osenova Linguistic Modelling Laboratory Bulgarian Academy of Sciences.
Similarity and Denoising Paul Vitanyi CWI, University of Amsterdam.
Anomaly Detection Using Symmetric Compression Benjamin Arai & Chris Baron Computer Science and Engineering Department University of California - Riverside.
Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand
Positive and Negative Randomness Paul Vitanyi CWI, University of Amsterdam Joint work with Kolya Vereshchagin.
Similarity Analysis by Data Compression Research carried out by Rudi Cilibrasi, Teemu Roos, Hannes Wettig. CWI is the National Centre of Mathematics and.
Relationships Between Structures “→” ≝ “Can be defined in terms of” Programs Groups Proofs Trees Complex numbers Operators Propositions Graphs Real.
Information theory concepts in software engineering Richard Torkar
DATA REPRESENTATION, DATA STRUCTURES AND DATA MANIPULATION TOPIC 4 CONTENT: 4.1. Number systems 4.2. Floating point binary 4.3. Normalization of floating.
Homework #5 New York University Computer Science Department Data Structures Fall 2008 Eugene Weinstein.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Kolmogorov Complexity and Universal Distribution Presented by Min Zhou Nov. 18, 2002.
Coding Theory Efficient and Reliable Transfer of Information
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
1 Algorithms & Data Structures for Games Lecture 2A Minor Games Programming.
Lecture 4: Lossless Compression(1) Hongli Luo Fall 2011.
INTRO TO COMPUTING. Looking Inside Computer 2Computing 2 | Lecture-1 Capabilities Can Read Can Write Can Store A/L Operations Automation.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
Normalized Web Distance and Word Similarity 施文娴. Introduction Some objects can be given literally ◦The literal four-letter genome of a mouse ◦The literal.
1Computer Sciences Department. 2 Advanced Design and Analysis Techniques TUTORIAL 7.
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,
Page 1KUT Graduate Course Data Compression Jun-Ki Min.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
Huffman Coding (2 nd Method). Huffman coding (2 nd Method)  The Huffman code is a source code. Here word length of the code word approaches the fundamental.
Introductory Lecture. What is Discrete Mathematics? Discrete mathematics is the part of mathematics devoted to the study of discrete (as opposed to continuous)
Estimation of Distribution Algorithm and Genetic Programming Structure Complexity Lab,Seoul National University KIM KANGIL.
Everything is a number Everything in a computer memory and on storages is a number. Number  Number Characters  Number by ASCII code Sounds  Number.
Data Coding Run Length Coding
Information Complexity Lower Bounds
Applied Algorithmics - week7
Lecture 6. Prefix Complexity K
School of Computer Science & Engineering
Advanced Algorithms Analysis and Design
Chapter 11 Data Compression
Digital Encodings.
Huffman Coding Greedy Algorithm
Clustering Algorithms for Perceptual Image Hashing
Presentation transcript:

Algorithmic Information Theory, Similarity Metrics and Google Varun Rao

Algorithmic Information Theory Kolmogorov Complexity Information Distance Normalized Information Distance Normalized Compression Distance Normalized Google Distance

Kolmogorov Complexity The Kolmogorov complexity of a string x is the length, in bits, of the shortest computer program of the fixed reference computing system that produces x as output. 1 First million bits of Pi vs. First million bits of your favourite song recording

Information Distance Given two strings x & y, Information Distance is the length of the shortest binary program that computes output y from input x, and also output x from input y 1 ID minorizes all other computable distance metrics

Normalized Information Distance Roughly speaking, two objects are deemed close if we can significantly “compress” one given the information in the other, the idea being that if two pieces are more similar, then we can more succinctly describe one given the other. 2 NID is characterized as the most informative metric Sadly, completely and utterly uncomputable

Normalized Compression Distance But we have compressors (lossless) If C is a compressor, and C(x) is the compressed length of a string x then NCD gets closer to NID as the compressor approximates the ultimate compression, Kolmogorov complexity

Normalized Compression Distance II Basic Process to compute NCD for x & y –Use compressor to compute C(x), C(y) –Append x to y and compute C(xy) Use relatively simple clustering methods to use NCD as a similarity metric to group strings

Normalized Compression Distance III Using Bzip2 on various types of files

Normalized Compression Distance IV The evolutionary tree built from complete mammalian mtDNA sequences of 24 species 2

Normalized Compression Distance V Clustering of Native-American, Native-African, and Native-European languages (translations of The Universal Declaration of Human Rights) 2

Normalized Compression Distance VI Optical Character Recognition using NCD. More complex clustering techniques achieved 85% success rate as opposed to industry standard 90%-95% 2

What about Semantic meaning ? Or what about how different a horse is from a car, or a hawk from a handsaw for that matter ? Compressors are semantically indifferent to their data To insert semantic relationships, turn to Google

Google Massive database, containing lots of information about semantic relationships The Quick Brown ___ ? Use simple page counts as indicators of closeness Use relative number of hits as a measure of probability to create a Google Distribution i.e. p(x) = hits in a search of x/total number of pages indexed

Google II Given that we can construct a distribution we can construct a Google Shannon Fano code (conceptually) because we can apply the Kraft inequality (after some normalization).... ???

Normalized Google Distance After all that hand waving, we can create a distance (like) metric NGD that has all kinds of nice properties

Applying NGD NGD as applied to 15 painting names by 3 Dutch artists

Applying NGD II Using SVM to learn the concept of primes 2

Applying NGD III Using SVM to learn “electrical” terms 2

Applying NGD IV Using SVM to learn “religious” terms 2

Sources 1.R. Cilibrasi and P. Vitanyi, “Automatic Meaning Discovery Using Google” R. Cilibrasi, P. Vitanyi. Clustering by compression, Submitted to IEEE Trans. Information Theory. 3.C.H. Bennett, P. G´acs, M. Li, P.M.B. Vit´anyi,W. Zurek, Information Distance, IEEE Trans. Information Theory, 44:4(1998), 1407– M. Li, X. Chen, X. Li, B. Ma, P. Vitanyi. The similarity metric, IEEE Trans. Information Theory, 50:12(2004), “Algorithmic Information Theory”, Wikipedia, accessed 25th January Greg Harfst, “Kolmogorov Complexity”, accessed 25th January