Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Google Similarity Distance Presented by: Akshay Kumar Pankaj Prateek.
Chapter 4 Probability and Probability Distributions
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
A Quality Focused Crawler for Health Information Tim Tang.
CodeSimian CS491B – Andrew Weng. Motivation Academic integrity is a universal issue Plagiarism is still common today Kaavya Viswanathan (Harvard Student)
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
INFO 624 Week 3 Retrieval System Evaluation
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
1 MARG-DARSHAK: A Scrapbook on Web Search engines allow the users to enter keywords relating to a topic and retrieve information about internet sites (URLs)
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
Analysing the link structures of the Web sites of national university systems Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
1 L07SoftwareDevelopmentMethod.pptCMSC 104, Version 8/06 Software Development Method Topics l Software Development Life Cycle Reading l Section 1.4 – 1.5.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Analysis of DOM Structures for Site-Level Template Extraction (PSI 2015) Joint work done in colaboration with Julián Alarte, Josep Silva, Salvador Tamarit.
Concept Unification of Terms in Different Languages for IR Qing Li, Sung-Hyon Myaeng (1), Yun Jin (2),Bo-yeong Kang (3) (1) Information & Communications.
CS0004: Introduction to Programming Variables – Strings.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
CSC-682 Cryptography & Computer Security Sound and Precise Analysis of Web Applications for Injection Vulnerabilities Pompi Rotaru Based on an article.
Algorithmic Information Theory, Similarity Metrics and Google Varun Rao.
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
EXPLOITING DYNAMIC VALIDATION FOR DOCUMENT LAYOUT CLASSIFICATION DURING METADATA EXTRACTION Kurt Maly Steven Zeil Mohammad Zubair WWW/Internet 2007 Vila.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
CIKM Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 1 Department of.
Probability. probability The chance or likelihood that an event will occur. - It is always a number between zero and one. - It is stated as a fraction,
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
D. Heynderickx DH Consultancy, Leuven, Belgium 22 April 2010EuroPlanet, London, UK.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
More on Document Similarity and Clustering How similar are these two documents (Again) ? Are these two documents about the same topic ?
Text Clustering Hongning Wang
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
Measuring the Structural Similarity of Semistructured Documents Using Entropy Sven Helmer University of London, Birkbeck VLDB’07, September 23-28, 2007,
 Used to be applicable to literary corpus/ academia only  Source code similarity/plagiarism detection is very important  “Moss” is the most widely.
Manuscript Review: A Checklist From: Seals, D.R and H Tanaka Advances in Physiology Education 23:52-58.
MGT 3213 – 07. © 2009 Cengage Learning. All rights reserved.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
An Integrated Approach for Relation Extraction from Wikipedia Texts Yulan Yan Yutaka Matsuo Mitsuru Ishizuka The University of Tokyo WWW 2009.
University of Malta CSA3080: Lecture 10 © Chris Staff 1 of 18 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Clustering of Web pages
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Web Crawling.
Word AdHoc Network: Using Google Core Distance to extract the most relevant information Presenter : Wei-Hao Huang   Authors : Ping-I Chen, Shi-Jen.
Information Retrieval on the World Wide Web
MIS Professor Sandvig MIS 324 Professor Sandvig
Research Areas Christoph F. Eick
University of Ljubljana – Slovenia Faculty of electrical engineering
MIS Professor Sandvig MIS 324 Professor Sandvig
Retrieval Utilities Relevance feedback Clustering
Presentation transcript:

Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand

Overview Problem Statement Kolmogorov distance Experimental methods Results Clustering Conclusions

Problem statement It is often desirable for information retrieval systems to calculate a measure of similarity between documents. Similarity measures generally rely on some sort of parsing, or understanding of documents, but effective parsing often depends on detailed knowledge of document structure.

General-purpose similarity Acts on any string of data points. Useful for: –Clustering –Verification –Filtering –Motif analysis –Exception detection.

Use of the “zip” technique In 2002 Benedetto, Caglioti, & Loreto used the “Zip” compression algorithm to identify the language documents. Technique involved concatenating a known language file with an unknown one and comparing the length of the zipped file. The shortest concatenated zip file occurred when the known file was written in the same language as the unknown file.

Extensions to this technique This approach was also used for author confirmation. Used an hierarchical clustering algorithm for the construction of language trees.

Kolmogorov Distance Li, Chen, Li, Ma, & Vitenyi, Assuming C(A|B) is the compressed size of A using the compression dictionary used in compressing B, and vice versa for C(B|A) and C(A), C(B) represent the compressed length of A and B using their own compression dictionaries. The kolmogorov distance between A and B, D(A,B) is given by:

Modified approach Obtain the two files – file 1 and file 2 Concatenate them in two ways, file 1 + file 2 = (file 12 ) and file 2 + file 1 =(file 21 ) Calculate the compressed length of: file 1 as zip 1 file 2 as zip 2 file 12 as zip 12 file 21 as zip 21 The Kolmogorov distance (D) is then given by:

Experiments Author Identification from an online discussion board Domain detection from sets of WWW pages Topic detection from a collection of related WWW pages.

Methods Load files from WWW Compare test file with 10 others, one of which is {by the same author,from the same domain,on the same topic} Use the modified kolomogorov distance algorithm. Select the combination with the shortest distance.

Analysis Chi-squared used to analyse the results. Not really an IR system, as the number of documents “retrieved” always =1, from 10. Precision can be related to the percentage of times when the lowest Kolmogorov distance is found for the desired outcome.

Results – Authorship Status Percent Shortest KD Percent in sample Author1<>Author251.88% 90% Author1=Author248.13% 10% Using Chi-Squared, this result is significant at the p<0.001 level (SPSS 11)  2 =(1,N=160)=258,p< initial documents, 1600 total,

Web domains sampled Domain NameNumber of PagesAverage File Length AUT OBGYN Microsoft Hon Apple Guardian Total

Results – Web domain StatusPercent lowest KD Percent in sample Different Domain18.75% 90% Same Domain81.25% 10% Using Chi-Squared, this result is significant at the p<0.001 level   =(1,N=80)=451,p< seed files, from 6 domains

Results - Topics SourceOccurrences with shortest distance Percent in sample Different topic domain 17.89%90% Same topic domain 82.11%10%   2 =(1,N=665)=3839,p<0.001

Conclusions The modified Kolomogorov distance algorithm is capable of identifying related documents more often than chance. This distance measure does not rely on parsing or semantic analysis. This method may have application as part of an IR system.