Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand
Overview Problem Statement Kolmogorov distance Experimental methods Results Clustering Conclusions
Problem statement It is often desirable for information retrieval systems to calculate a measure of similarity between documents. Similarity measures generally rely on some sort of parsing, or understanding of documents, but effective parsing often depends on detailed knowledge of document structure.
General-purpose similarity Acts on any string of data points. Useful for: –Clustering –Verification –Filtering –Motif analysis –Exception detection.
Use of the “zip” technique In 2002 Benedetto, Caglioti, & Loreto used the “Zip” compression algorithm to identify the language documents. Technique involved concatenating a known language file with an unknown one and comparing the length of the zipped file. The shortest concatenated zip file occurred when the known file was written in the same language as the unknown file.
Extensions to this technique This approach was also used for author confirmation. Used an hierarchical clustering algorithm for the construction of language trees.
Kolmogorov Distance Li, Chen, Li, Ma, & Vitenyi, Assuming C(A|B) is the compressed size of A using the compression dictionary used in compressing B, and vice versa for C(B|A) and C(A), C(B) represent the compressed length of A and B using their own compression dictionaries. The kolmogorov distance between A and B, D(A,B) is given by:
Modified approach Obtain the two files – file 1 and file 2 Concatenate them in two ways, file 1 + file 2 = (file 12 ) and file 2 + file 1 =(file 21 ) Calculate the compressed length of: file 1 as zip 1 file 2 as zip 2 file 12 as zip 12 file 21 as zip 21 The Kolmogorov distance (D) is then given by:
Experiments Author Identification from an online discussion board Domain detection from sets of WWW pages Topic detection from a collection of related WWW pages.
Methods Load files from WWW Compare test file with 10 others, one of which is {by the same author,from the same domain,on the same topic} Use the modified kolomogorov distance algorithm. Select the combination with the shortest distance.
Analysis Chi-squared used to analyse the results. Not really an IR system, as the number of documents “retrieved” always =1, from 10. Precision can be related to the percentage of times when the lowest Kolmogorov distance is found for the desired outcome.
Results – Authorship Status Percent Shortest KD Percent in sample Author1<>Author251.88% 90% Author1=Author248.13% 10% Using Chi-Squared, this result is significant at the p<0.001 level (SPSS 11) 2 =(1,N=160)=258,p< initial documents, 1600 total,
Web domains sampled Domain NameNumber of PagesAverage File Length AUT OBGYN Microsoft Hon Apple Guardian Total
Results – Web domain StatusPercent lowest KD Percent in sample Different Domain18.75% 90% Same Domain81.25% 10% Using Chi-Squared, this result is significant at the p<0.001 level =(1,N=80)=451,p< seed files, from 6 domains
Results - Topics SourceOccurrences with shortest distance Percent in sample Different topic domain 17.89%90% Same topic domain 82.11%10% 2 =(1,N=665)=3839,p<0.001
Conclusions The modified Kolomogorov distance algorithm is capable of identifying related documents more often than chance. This distance measure does not rely on parsing or semantic analysis. This method may have application as part of an IR system.