Experiments in Dialectology with Normalized Compression Metrics Kiril Simov and Petya Osenova Linguistic Modelling Laboratory Bulgarian Academy of Sciences.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Indexing DNA Sequences Using q-Grams
 Over-all: Very good idea to use more than one source. Good motivation (use of graphics). Good use of simplified, loosely defined -- but intuitive --
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Lecture 24 MAS 714 Hartmut Klauck
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Clustering Basic Concepts and Algorithms
Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.
Analysis of Algorithms
Google Similarity Distance Presented by: Akshay Kumar Pankaj Prateek.
Boris Altshuler Columbia University Anderson Localization against Adiabatic Quantum Computation Hari Krovi, Jérémie Roland NEC Laboratories America.
On Evaluating the Performance of Compression Based Techniques for Sequence Comparison R AMEZ M INA † D HUNDY B ASTOLA †, * AND H ESHAM A LI †, * †College.
Effect of Linearization on Normalized Compression Distance Jonathan Mortensen Julia Wu DePaul University July 2009.
AN IMPROVED AUDIO Jenn Tam Computer Science Dept. Carnegie Mellon University SOAPS 2008, Pittsburgh, PA.
Space Shuttle Engine Valve Anomaly Detection by Data Compression Matt Mahoney.
Complexity 7-1 Complexity Andrei Bulatov Complexity of Problems.
1 CLUSTERING  Basic Concepts In clustering or unsupervised learning no training data, with class labeling, are available. The goal becomes: Group the.
Compression-based Unsupervised Clustering of Spectral Signatures D. Cerra, J. Bieniarz, J. Avbelj, P. Reinartz, and R. Mueller WHISPERS, Lisbon,
Multivariate Methods Pattern Recognition and Hypothesis Testing.
1 Introduction to Computability Theory Lecture12: Reductions Prof. Amos Israeli.
Pattern Recognition and Machine Learning
Support Vector Machines and Kernel Methods
Surface Reconstruction from 3D Volume Data. Problem Definition Construct polyhedral surfaces from regularly-sampled 3D digital volumes.
Information Distance from a Question to an Answer.
Complexity 5-1 Complexity Andrei Bulatov Complexity of Problems.
Applications of Evolutionary Computation in the Analysis of Factors Influencing the Evolution of Human Language Alex Decker.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Prénom Nom Document Analysis: Non Parametric Methods for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Clustering by Compression Rudi Cilibrasi (CWI), Paul Vitanyi (CWI/UvA)
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Cryptography Week-6.
Final Presentation Tong Wang. 1.Automatic Article Screening in Systematic Review 2.Compression Algorithm on Document Classification.
Anomaly Detection Using Symmetric Compression Benjamin Arai & Chris Baron Computer Science and Engineering Department University of California - Riverside.
FlowString: Partial Streamline Matching using Shape Invariant Similarity Measure for Exploratory Flow Visualization Jun Tao, Chaoli Wang, Ching-Kuang Shene.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
SeL-LR&SD, LREC 2010, Valletta, Malta 1 Semantic Annotation for Semi- Automatic Positioning of the Learner Kiril Simov, Petya Osenova Linguistic Modelling.
Information theory concepts in software engineering Richard Torkar
Algorithmic Information Theory, Similarity Metrics and Google Varun Rao.
Modeling and simulation of systems Model building Slovak University of Technology Faculty of Material Science and Technology in Trnava.
(Spring 2015) Instructor: Craig Duckett Lecture 10: Tuesday, May 12, 2015 Mere Mortals Chap. 7 Summary, Team Work Time 1.
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
SPEECH AND WRITING. Spoken language and speech communication In a normal speech communication a speaker tries to influence on a listener by making him:
Competence Centre on Information Extraction and Image Understanding for Earth Observation Prof. Dr. Mihai Datcu SATELLITE IMAGE ARTIFACTS DETECTION BASED.
1 Grammar Extraction and Refinement from an HPSG Corpus Kiril Simov BulTreeBank Project ( Linguistic Modeling Laboratory, Bulgarian.
Reducibility 1. 2 Text book Pages 187– 199 Reducibility A reduction is a way of converting one problem to another problem in such a way that a solution.
Outline Intro to Representation and Heuristic Search Machine Learning (Clustering) and My Research.
1 VISA: Virtual Scanning Algorithm for Dynamic Protection of Road Networks IEEE Infocom’09, Rio de Janeiro, Brazil Jaehoon Jeong (Paul), Yu Gu, Tian He.
Theoretical Foundations of Clustering – CS497 Talk Shai Ben-David February 2007.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Automatic acquisition for low frequency lexical items Nuria Bel, Sergio Espeja, Montserrat Marimon.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Corpus Linguistics MOHAMMAD ALIPOUR ISLAMIC AZAD UNIVERSITY, AHVAZ BRANCH.
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Lecture 3. Symmetry of Information In Shannon information theory, the symmetry of information is well known. The same phenomenon is also in Kolmogorov.
Dimensional Analysis. Experimentation and modeling are widely used techniques in fluid mechanics.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Building Valid, Credible & Appropriately Detailed Simulation Models
KNN & Naïve Bayes Hongning Wang
Cluster Analysis II 10/03/2012.
Why the Normal Distribution is Important
Investigating the Hausdorff Distance
Overview of our studies
Image Coding and Compression
MECH 3550 : Simulation & Visualization
Group 9 – Data Mining: Data
Building Valid, Credible, and Appropriately Detailed Simulation Models
State University of Telecommunications
Presentation transcript:

Experiments in Dialectology with Normalized Compression Metrics Kiril Simov and Petya Osenova Linguistic Modelling Laboratory Bulgarian Academy of Sciences ( Общ семинар по информатика, ИМИ, БАН 15 February 2006

2 Plan of the Talk Similarity Metrics based on Compression (based on: Rudi Cilibrasi and Paul Vitanyi, Clustering by Compression, IEEE Trans. Information Theory, 51:4(2005) Also: (2003).) Experiments Conclusion Future Work

3 Feature-Based Similarity Task: Establishing of similarity between different data sets Each data set is characterized by a set of features and their values Different classifiers for definition of similarity Problem: definition of features, which features are important

4 Non-Feature Similarity The same task: Establishing of similarity between different data sets No features are specially compared Single similarity metric for all features Problem: the features that are important and play major role remain hidden in the data

5 Similarity Metric Metric: distance function d(.,.) such that: d(a,b)≥0; d(a,b)=0 iff a=b; d(a,b)=d(b,a); d(a,b)≤d(a,c)+d(c,b) (triangle inequality) Density: For each object there are objects at different distances from it Normalization: The distance between two objects depends on the size of the objects. Distances are in the interval [0,1]

6 Kolmogorov Complexity For each file x, k(x) (Kolmogorov complexity of x) is the length in bits of the ultimately compressed version of the file x (undecidable) Metric based on Kolmogorov complexity k(x|y) – Kolmogorov complexity of y, if k(x) is known k(x,y) = k(x|y) ≈ k(xy), where xy is the concatenation of x and y, is almost a metric: k(x,x) = k(xx) ≈ k(x) k(x,y) = k(y,x) k(x,y) ≤ k(x,z) + k(z,y)

7 Normalized Kolmogorov Metric A normalized Kolmogorov metric has to consider also Kolmogorov complexity of x and y We can see that min(k(x),k(y)) ≤ k(x,y) ≤ k(x) + k(y) 0 ≤ k(x,y) - min(k(x),k(y)) ≤ k(x) + k(y) - min(k(x),k(y)) 0 ≤ k(x,y) - min(k(x),k(y)) ≤ max(k(x),k(y)) 0 ≤ ( k(x,y) - min(k(x),k(y)) ) / max(k(x),k(y)) ≤ 1

8 Normalized Compression Distance Kolmogorov complexity is undecidable Thus, it can be only approximated by a real life compressor c Normalized compression distance ncd(.,.) is defined by ncd(x,y) = ( c(x,y) - min(c(x),c(y)) ) / max(c(x),c(y)) where c(x) is the size of the compressed file x The properties of ncd(.,.) depends of the compressor c

9 Normal Compressor The compressor c is normal if it satisfies (asymptotically to the length of the files): –Stream-basedness: first x, then y –Idempotency: c(xx) = c(x) –Symmetry: c(xy) = c(yx) –Distributivity: c(xy) + c(z) ≤ c(xz) + c(yz) If c is normal, then ncd(.,.) is a similarity metric

10 Problems with ncd(.,.) Real compressors are imperfect, thus ncd(.,.) is imperfect Good results can be obtained only for large data sets Each feature in the data set is a basis for a comparison Most compressors are byte-based, thus some intra-byte features can not be captured well

11 Real Compressors are Imperfect For a small data set the compression size depends on additional information like version number, etc –The compressed file could be bigger than the original file Some small reordering of the data does not play a role for the size of the compression –Series of ‘a b a b’ is treated the same as ‘a a b b’ Substitution of one letter with another one could have no impact Cycles (most of them) in the data are captured by the compressors

12 Large Dialectological Data Sets Ideally, large dialectological, naturally created data sets are necessary In practice, we can try to create such data by –Simulating ‘naturalness’ –Hiding of features that are unimportant to the comparison of dialects –Encoding that allows direct comparison of the important features: p b (no), p p* (yes)

13 Generation of Dialectological Data Sets We decided to generate dialectological ‘texts’ First we did some experiments with non- dialectological data in order to study the characteristics of the compressor. Results show: –The repetition of the lexical items has to be non cyclic –The features explication needs to be systematic –Linear order has to be the same for each site

14 Experiment Setup We have used the 36 words from the experiments of Petya in Groningen, transcribed in X-Sampa We have selected ten villages which are grouped in three clusters by the methods developed in Groningen: –[Alfatar, Kulina-voda] –[Babek, Malomir, Srem] –[Butovo, Bylgarsko-Slivovo, Hadjidimitrovo, Kozlovets, Tsarevets]

15 Corpus-Based Text Generation The idea is the result to be as much as possible close to a natural text. We performed the following step: From a corpus of about 55 million words we deleted all word forms except for the 36 from the list Then we concatenated all remaining word forms in one document For each site we substituted the normal word forms with corresponding dialect word forms

16 v/vAlfatarBabekButovoBylgarsko -Slivovo Hadjidi- mitrovo Kozlo- vets Kulina- voda MalomirSremTsare- vets Alfatar Babek Butovo Bylgarsko -Slivovo Hadji- dimitrovo Kozlovets Kulina- voda Malomir Srem Tsarevets Distances for Corpus-Based Text

17 Clusters According to Corpus-Based Text –[Kulina-voda] –[Alfatar] –[Babek] –[Hadjidimitrovo, Malomir, Srem] –[Butovo, Bylgarsko-Slivovo, Kozlovets, Tsarevets]

18 Some Preliminary Analyses More frequent word forms play a bigger role: няма – times vs. млекар – 5 times from word forms The repetition of the word forms is not easily predictable thus close to natural text

19 Permutation-Based Text Generation The idea is the result to be as much as possible with not predictable linear order. We performed the following step: All 36 words were manually segmented in meaningful segments: ['t_S','i','"r','E','S','a'] Then for each site we did all permutation for each word and concatenated them: ["b,E,l,i]["b,E,i,l]["b,l,E,i]["b,l,i,E]["b,i,E,l]["b,i,l,E][E,"b,l,i]...

20 Distances for Permutation-Based Text v/vAlfatarBabekButovoBylgarsko- Slivovo Hadjidi- mitrovo Kozlo- vets Kulina- voda MalomirSremTsare- vets Alfatar Babek Butovo Bylgarsko- Slivovo Hadjidi- mitrovo Kozlovets Kulina- voda Malomir Srem Tsarevets

21 Clusters According to Permutation- Based Text –[Kulina-voda, Alfatar, Malomir] –[Babek, Srem] –[Hadjidimitrovo, Butovo, Bylgarsko-Slivovo, Kozlovets, Tsarevets]

22 Conclusions Compression methods are feasible with generated data sets Two different measurements of the distance of dialects: –Presence of given features –Additionally distribution of the features

23 Future Work Evaluation with different compressors (7-zip is the best for the moment) Better explication of the features Better text generation: more words and application of (sure) rules Implementation of the whole process of application of the method Comparison with other methods Expert validation (human intuition)