1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date ： 2014/04/15 Source ： KDD’13 Authors ： Chi Wang, Marina Danilevsky, Nihit.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

1 Chinese Whispers an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems Chris Biemann University of Leipzig,

Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.

Data Mining Techniques: Clustering

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

Information Retrieval in Practice

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Measuring Monolinguality Chris Biemann NLP Department, University of Leipzig LREC-06 Workshop on Quality Assurance and Quality Measurement for Language.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Distinguishing Photographic Images and Photorealistic Computer Graphics Using Visual Vocabulary on Local Image Edges Rong Zhang,Rand-Ding Wang, and Tian-Tsong.

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Computer Science 1 Web as a graph Anna Karpovsky.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.

Chapter 5: Information Retrieval and Web Search

Overview of Search Engines

1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.

Clustering Unsupervised learning Generating “classes”

The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

Protecting Sensitive Labels in Social Network Data Anonymization.

Clustering Spatial Data Using Random Walk David Harel and Yehuda Koren KDD 2001.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.

Chapter 6: Information Retrieval and Web Search

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,

Semantic Wordfication of Document Collections Presenter: Yingyu Wu.

Algorithmic Detection of Semantic Similarity WWW 2005.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.

Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.

LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor ： Dr. Koh Jia-Ling Speaker ： Tu.

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.

Your caption here POLYPHONET: An Advanced Social Network Extraction System from the Web Yutaka Matsuo Junichiro Mori Masahiro Hamasaki National Institute.

Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.

Single Document Key phrase Extraction Using Neighborhood Knowledge.

Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.

2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.

Chris Biemann University of Leipzig, NLP-Dept. Leipzig, Germany

Information Retrieval in Practice

Neighborhood - based Tag Prediction

Measuring Monolinguality

Clustering of Web pages

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

Zhenjiang Lin, Michael R. Lyu and Irwin King

Efficient Processing of Top-k Spatial Preference Queries

Chris Biemann University of Leipzig, NLP-Dept. Leipzig, Germany

Presentation transcript:

1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig, Germany Monday 5, 2007 WEBIST07 Barcelona

2 Outline Motivation Attributes with Low Noise - Low frequency terms - Link similarity Chinese Whispers Graph Clustering Experimental Result - Low frequency terms - Link similarity Conclusion

3 Motivation Document clustering groups documents in meaningful clusters that can be used for - document collection overview - associative browsing - basis for multi-document summarisation -... In the WWW, documents can be characterized by at least - Terms contained in the document - (external) links from and to the document In a WWW setting, the clustering algorithm must be efficient, as datasets are huge We use a graph representation and graph clustering

4 Low Frequency Terms Documents are more similar, the more low frequncy terms they share For IR, this is not a good idea, but for clustering. Restriction on low frequency terms reduces noise (no stop words) and allows efficient computation of similarity graph: For each word do { list all pairs of documents containing this word; sort the resulting list of pairs; } For each pair (i,j) in this list, count the number of occurrences as s ij ;

5 Co-occurrence of links Web pages are regarded more similar, the more often other pages contain a link to both External links are a good source of information, as they are normally set up intellectually Co-occurrence computation is a standard method in NLP and can be performed efficiently

6 Graph Representation Many datasets are naturally represented as graph with nodes encoding entities and edges encoding their relation In nature, many graphs possess the small world property, which especially exhibits skewed distributions that are not grasped well in vector space models Here, documents form nodes and edges indicate statistically extracted relations between them

7 Dataset: Terms Part of year 2000's German press newswire 202,086 documents, classified in 309 classes Classification is used to measure quality Class size distribution

8 Dataset: Links Part of German Web No classification available -> manual evaluation Two datasets: servers and URLs type# nodes# of edges# nodes with edges servers2,201,42118,892,068876,577 URLs680,23919,465,650624,332

9 Chinese Whispers Algorithm Nodes have a class and communicate it to their adjacent nodes A node adopts one of the the majority class in its neighbourhood Nodes are processed in random order for some iterations Algorithm: initialize: forall v i in V: class(v i )=i; while changes: forall v in V, randomized order: class(v)=highest ranked class in neighborhood of v; A L1 D L2 E L3 B L4 C L deg=1 deg=2 deg=3 deg=5 deg=4

10 Example: CW-Partitioning in two steps

11 Properties of CW PRO: Efficiency: CW is time-linear in the number of edges. This is bound by n² with n= number of nodes, but in real world data, graphs are much sparser Parameter-free: this includes number of clusters CON: Non-deterministic: due to random order processing and possible ties w.r.t. the majority. Does not converge: See tie example: However, the CONs are not severe for real world data...

12 Experiments with Terms Let D = {d 1,... d q } be the set of documents, G = {G 1,... G m } the gold standard classification and C = {C 1,... C p } be the clustering result. Then, the cluster purity CP is calculated as given:

13 Results on Terms Almost in any case, CW clustering improves the cluster purity compared to components. The lower the threshold t, the worse are the results in general, and the larger is the improvement, especially when breaking very large components into smaller clusters. It is possible to obtain very high cluster purity values by simply increasing t, but at the cost of reducing coverage significantly. A typical precision/recall trade off arises.

14 Results on URLs Examining 20 randomly chosen clusters with a size around 100, the results can be divided into (6) aggressive interlinking on the same server: pharmacy, concert tickets, celebrity pictures (4) (5) link farms: servers with different names, but of same origin: a bookstore, gambling, two different pornography farms and a Turkish link farm (3) serious portals that contain many intra-server links: a web directory, a news portal, a city portal (3) thematic clusters of different origins: Polish hotels, USA golf, Asian hotels (2) mixed clusters with several types of sites (1) partially same server, partially thematic cluster: hotels and insurances in India

15 Results on Servers We randomly chose 20 clusters with a size around 100, which can be described as follows: (9) thematically related clusters: software, veg(etari)an, Munich technical institutes, porn, city of Ulm, LAN parties, satellite TV, Uni Osnabrück, astronomy (6) mixed but dominated by one topic: bloggers, Swiss web design, link farm, motor racing, Uni Mainz, media in Austria (2) link farms using different domains (3) more or less unrelated clusters

16 Summary Efficient methods for constructing similarity graphs of (web) documents Experiments show that similarity measure is useful Efficient graph clustering for large datasets Methodology to discover link farms Examining differences of the similarity sources could give rise to a combined measure Download a GUI implementation in Java of Chinese Whispers (Open Source) at

17 Questions ? THANK YOU