Download presentation
Presentation is loading. Please wait.
1
1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig, Germany Monday 5, 2007 WEBIST07 Barcelona
2
2 Outline Motivation Attributes with Low Noise - Low frequency terms - Link similarity Chinese Whispers Graph Clustering Experimental Result - Low frequency terms - Link similarity Conclusion
3
3 Motivation Document clustering groups documents in meaningful clusters that can be used for - document collection overview - associative browsing - basis for multi-document summarisation -... In the WWW, documents can be characterized by at least - Terms contained in the document - (external) links from and to the document In a WWW setting, the clustering algorithm must be efficient, as datasets are huge We use a graph representation and graph clustering
4
4 Low Frequency Terms Documents are more similar, the more low frequncy terms they share For IR, this is not a good idea, but for clustering. Restriction on low frequency terms reduces noise (no stop words) and allows efficient computation of similarity graph: For each word do { list all pairs of documents containing this word; sort the resulting list of pairs; } For each pair (i,j) in this list, count the number of occurrences as s ij ;
5
5 Co-occurrence of links Web pages are regarded more similar, the more often other pages contain a link to both External links are a good source of information, as they are normally set up intellectually Co-occurrence computation is a standard method in NLP and can be performed efficiently
6
6 Graph Representation Many datasets are naturally represented as graph with nodes encoding entities and edges encoding their relation In nature, many graphs possess the small world property, which especially exhibits skewed distributions that are not grasped well in vector space models Here, documents form nodes and edges indicate statistically extracted relations between them
7
7 Dataset: Terms Part of year 2000's German press newswire 202,086 documents, classified in 309 classes Classification is used to measure quality Class size distribution
8
8 Dataset: Links Part of German Web No classification available -> manual evaluation Two datasets: servers and URLs type# nodes# of edges# nodes with edges servers2,201,42118,892,068876,577 URLs680,23919,465,650624,332
9
9 Chinese Whispers Algorithm Nodes have a class and communicate it to their adjacent nodes A node adopts one of the the majority class in its neighbourhood Nodes are processed in random order for some iterations Algorithm: initialize: forall v i in V: class(v i )=i; while changes: forall v in V, randomized order: class(v)=highest ranked class in neighborhood of v; A L1 D L2 E L3 B L4 C L3 5 8 6 3 deg=1 deg=2 deg=3 deg=5 deg=4
10
10 Example: CW-Partitioning in two steps
11
11 Properties of CW PRO: Efficiency: CW is time-linear in the number of edges. This is bound by n² with n= number of nodes, but in real world data, graphs are much sparser Parameter-free: this includes number of clusters CON: Non-deterministic: due to random order processing and possible ties w.r.t. the majority. Does not converge: See tie example: However, the CONs are not severe for real world data...
12
12 Experiments with Terms Let D = {d 1,... d q } be the set of documents, G = {G 1,... G m } the gold standard classification and C = {C 1,... C p } be the clustering result. Then, the cluster purity CP is calculated as given:
13
13 Results on Terms Almost in any case, CW clustering improves the cluster purity compared to components. The lower the threshold t, the worse are the results in general, and the larger is the improvement, especially when breaking very large components into smaller clusters. It is possible to obtain very high cluster purity values by simply increasing t, but at the cost of reducing coverage significantly. A typical precision/recall trade off arises.
14
14 Results on URLs Examining 20 randomly chosen clusters with a size around 100, the results can be divided into (6) aggressive interlinking on the same server: pharmacy, concert tickets, celebrity pictures (4) (5) link farms: servers with different names, but of same origin: a bookstore, gambling, two different pornography farms and a Turkish link farm (3) serious portals that contain many intra-server links: a web directory, a news portal, a city portal (3) thematic clusters of different origins: Polish hotels, USA golf, Asian hotels (2) mixed clusters with several types of sites (1) partially same server, partially thematic cluster: hotels and insurances in India
15
15 Results on Servers We randomly chose 20 clusters with a size around 100, which can be described as follows: (9) thematically related clusters: software, veg(etari)an, Munich technical institutes, porn, city of Ulm, LAN parties, satellite TV, Uni Osnabrück, astronomy (6) mixed but dominated by one topic: bloggers, Swiss web design, link farm, motor racing, Uni Mainz, media in Austria (2) link farms using different domains (3) more or less unrelated clusters
16
16 Summary Efficient methods for constructing similarity graphs of (web) documents Experiments show that similarity measure is useful Efficient graph clustering for large datasets Methodology to discover link farms Examining differences of the similarity sources could give rise to a combined measure Download a GUI implementation in Java of Chinese Whispers (Open Source) at http://wortschatz.informatik.uni-leipzig.de/~cbiemann/software/CW.html
17
17 Questions ? THANK YOU
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.