Mining di dati web Lezione n° 6 Clustering di Documenti Web Gli Algoritmi Basati sul Contenuto A.A 2005/2006.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Fast Algorithms For Hierarchical Range Histogram Constructions
Near-Duplicates Detection
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, Java Version, Third Edition.
The Assembly Language Level
Dimensionality Reduction PCA -- SVD
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Mining Time Series.
Data Mining Association Analysis: Basic Concepts and Algorithms
The number of edge-disjoint transitive triples in a tournament.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Searching for the Minimal Bézout Number Lin Zhenjiang, Allen Dept. of CSE, CUHK 3-Oct-2005
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Third Edition Additions by Shannon Steinfadt SP’05.
Reduced Support Vector Machine
1 Lecture 18 Syntactic Web Clustering CS
Near Duplicate Detection
Chapter 3: The Efficiency of Algorithms Invitation to Computer Science, C++ Version, Fourth Edition.
Chapter 3: The Efficiency of Algorithms
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
ECE 667 Synthesis and Verification of Digital Systems
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Radial Basis Function Networks
Clustering Unsupervised learning Generating “classes”
Ch. 8 & 9 – Linear Sorting and Order Statistics What do you trade for speed?
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Basic Concepts in Number Theory Background for Random Number Generation 1.For any pair of integers n and m, m  0, there exists a unique pair of integers.
Universit at Dortmund, LS VIII
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Copyright © Curt Hill Query Evaluation Translating a query into action.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
A Membrane Algorithm for the Min Storage problem Dipartimento di Informatica, Sistemistica e Comunicazione Università degli Studi di Milano – Bicocca WMC.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Syntactic Clustering of the Web By Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, Geoffrey Zweig CSCI 572 Ameya Patil Syntactic Clustering of the.
Algorithm Complexity is concerned about how fast or slow particular algorithm performs.
Information Retrieval in Practice
Database Management System
Near Duplicate Detection
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Chapter 15 QUERY EXECUTION.
Chapter 3: The Efficiency of Algorithms
Machine Learning: Lecture 3
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Minwise Hashing and Efficient Search
On the resemblance and containment of documents (MinHash)
Presentation transcript:

Mining di dati web Lezione n° 6 Clustering di Documenti Web Gli Algoritmi Basati sul Contenuto A.A 2005/2006

Document Clustering  Classical clustering algorithms are not suitable for high dimensional data.  Dimensionality Reduction is a viable but expensive solution.  Different kind of clustering exists:  Partitional (or Top-Down)  Hierarchical (or Bottom-Up)

Partitional Clustering  Directly decomposes the data set into a set of disjoint clusters.  The most famous is the K-Means algorithm.  Usually they are linear in the number of elements to cluster.

Hierarchical Partitioning  Proceeds successively by either merging smaller clusters into larger ones, or by splitting larger clusters.  The clustering methods differ in the rule by which it is decided which two small clusters are merged or which large cluster is split.  The end result of the algorithm is a tree of clusters called a dendrogram, which shows how the clusters are related.  By cutting the dendrogram at a desired level a clustering of the data items into disjoint groups is obtained.

Dendrogram Example

Clustering in Web Content Mining  Possible uses of clustering in Web Content Mining.  Automatic Document Classification.  Search Engine Results Presentation.  Search Engine Optimization:  Collection Reorganization.  Index Reorganization.  Dimensionality Reduction!!!!

Advanced Document Clustering Techniques  Co-Clustering  Dhillon, I. S., Mallela, S., and Modha, D. S Information-theoretic co-clustering. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August , 2003). KDD '03. ACM Press, New York, NY,  Syntactic Clustering  Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G Syntactic clustering of the Web. Comput. Netw. ISDN Syst. 29, 8-13 (Sep. 1997),

Co-Clustering  Idea: represent a collection with its term- document matrix and then cluster both rows and columns.  It has a strong theoretical foundation.  It is based on the assumption that the best clustering is the one that leads to the largest mutual information between the clustered random variables.

Information Theory  Entropy of a random variable X with probability distribution p(x):  The Kullback-Leibler(KL) Divergence or “Relative Entropy” between two probability distributions p and q:  Mutual Information between random variables X and Y:

Contingency Table  Let X and Y be discrete random variables that take values in the sets {x 1, x 2, …, x m } and {y 1, y 2, …, y n }.  Let p(X,Y) denote the joint probability distribution between X and Y.

Problem Formulation  Co-clustering is concerned with simulteously clustering X into (at most) k disjoint clusters and Y into (at most) l disjoint clusters.  Let the k clusters of X be written as: {x’ 1, x’ 2, …, x’ k }, and let the l clusters of Y be written as: {y’ 1, y’ 2, …, y’ l }.  (C X,C Y ) is defined co-clustering, where:  C x : {x 1, x 2, …, x m }  {x’ 1, x’ 2, …, x’ k }  C Y : {y 1, y 2, …, y n }  {y’ 1, y’ 2, …, y’ l }  An optimal co-clustering minimizes I(X;Y) - I(X’=C X (X);Y’=C Y (Y)) = I(X;Y) - I(X’-Y’)

Lemma 2.1  For a fixed co-clustering (C X, C Y ), we can write the loss in mutual information as I(X;Y) - I(X’;Y’) = D(p(X,Y)||q(X,Y)), where D(-||-) denotes the Kullback- Leibler divergence and q(X,Y) is a distribution of the form q(x,y)=p(x’,y’)p(x|x’)p(y|y’) where x  x’, y  y’.

The Approximation Matrix q(X,Y)  q(x,y)=p(x’,y’)p(x|x’)p(y|y’).  p(x’)=  x  x’ p(x)  p(y’)=  y  y’ p(y)  p(x|x’)=p(x)/p(x’)  p(y|y’)=p(y)/p(y’)

Proof of Lemma 2.1

Some Useful Equalities

Co-Clustering Algorithm

Co-Clustering Soundness  Theorem: The co-clustering algorithm monotonically decreases loss in mutual information (objective function value)  Marginals p(x) and p(y) are preserved at every step (q(x)=p(x) and q(y)=p(y) )

Co-Clustering Complexity  The algorithm is computationally efficient  Even for sparse data  If nz is the number of nonzeros in the imput joint distribution p(X,Y), t is the number of iterations: O(nz * t * (k + l))  Experimentally t = 20.

A Toy Example

A Real Example Before

A Real Example After

Application: Dimensionality Reduction Document Bag-of-words 1 m Vector Of words Document Bag-of-words Vector Of words Do not throw away words Cluster words instead Use clusters as features Word#1 Word#k Select the “best” words Throw away rest Frequency based pruning Information criterion based pruning  Feature Selection  Feature Clustering 1 m Cluster#1 Cluster#k

Syntactic Clustering  Finding syntactically similar documents.  Approach based on two different similarity measures:  Resemblance  Containment  A sketch of few hundreds bytes is kept for each document.

Document Model  We view each document as a sequence of words.  Start by lexically analyzing the doc into a canonical sequence of tokens.  This canonical form ignores minor details such as formatting, html commands, and capitalization.  We then associate with every document D a set of subsequences of tokens S(D,w).

Shingling  A contiguous subsequence contained in D is called a shingle.  Given a document D we define its w- shingling S(D,w) as the set of all unique shingles of size w contained in D.  For instance the 4-shingling of (a,rose,is,a,rose,is,a,rose) is the set:  {(a,rose,is,a);(rose,is,a,rose);(is,a,rose,is)}.

Resemblace  For a given shingle size, the resemblance r of two documents A and B is defined as where |A| is the size of set A.

Containment  For a given shingle size, the containment c of two documents A and B is defined as where |A| is the size of set A.

Properties of r and c  The resemblance is a number between 0 and 1.  r(A,A) = 1  The containment is a number between 0 and 1.  If A  B then c(A,B)=1.  Experiments show that the definitions capture the informal notions of “roughly the same” and “roughly contained”.

Resemblance Distance  Resemblance is not transitive.  Version 100 of a document is probably quite different from version 1.  The Resemblance Distance d(A,B)=1-r(A,B) is a not metric but obeys the triangle inequality.

Resemblance and Containment Estimates  Fix a shingle size w.  Let U be the set of all shingles of size w.  U is countable thus we can view its elements as numbers.  Fix a parameter s.  For a set W  U define MIN s (W) as where “smallest” refers to numerical order on U, and define

Resemblance and Containment Estimates  Theorem. Let  :U  U a permutation of U chosen u.a.r. Let F(A)=MIN s (  (S(A))) and V(A)=MOD m (  (S(A))). Define F(B) and V(B) analogously. Then  is an unbiased estimate of the resemblance of A and B.  is an unbiased estimate of the containment of A in B.

The Sketch  Choose a random permutation of U.  The Sketch of a document D consists of the set F(D) and/or V(D).  F(D) has fixed size. Allows only the estimation of resemblance.  V(D) has variable size. Grows as D grows.

Practical Sketches Representation  Canonicalize documents by removing HTML formatting and converting all words to lowercase.  The shingle size w is 10.  Use a 40 bit fingerprint function, based on Rabin Fingerprints, enhanced to behave as a random permutation. Now a shingle is this fingerprint value.  m in the modulus is set to 25.

Rabin Fingerprints  Is based on the use of irreducible polynomials with coefficients in Galois Field 2.  Let A=(a 1, …, a m ) be a binary string. a 1 =1.  A(t)=a 1 t m-1 +a 2 t m-2 +…+a m  Let P(t) be an irriducible polynomial of degree k, over Z 2.  f(A)=A(t) mod P(t) is the Rabin Fingerprint of A.

Shingle Clustering  Retrieve every document on the Web.  Calculate the sketch for each document.  Compare the sketches for each pair of documents to see if they exceed a threshold of resemblance.  Combine the pairs of similar documents to make the clusters of similar documents.

Efficiency  30,000,000 HTML docs  A pairwaise comparison would involve O(10 15 ) comparisons!!!!  Just one bit per document in a data structure requires 4 Mbytes. A sketch size of 800 bytes per documents requires 24 Gbytes!!!  One millisecond of computation per document translates into 8 hours of computation!!!  Any algorithm involving random disk accesses or that causes paging activity is completely infeasible. INEfficiency

Divide, Compute, Merge  Take the data, divide it into pieces of size m (in order to fit the data entirely in memory)  Compute on each piece separately  Merge the results.  The merging process is I/O bound:  Each merge pass is linear  log(n/m) passes are required.  The overall performance is O(n log(n/m)).

The “real” Clustering Algorithm (I phase)  Calculate a sketch for every document. This step is linear in the total lengths of documents.

The “real” Clustering Algorithm (II phase)  Produce a list of all the shingles and the documents they appear in, sorted by shingle value. To do this, the sketch for each document is expanded into a list of pairs. Sort the list using the divide, sort merge approach.  Remember: shingle value, means rabin fingerprint of the sketch.

The “real” Clustering Algorithm (III phase)  Generate a list of all the pairs of documents that share any shingles, along with the number of shingles they have in common. To do this, take the file of sorted couples and expand it into a list of triplets:  take each shingle that appears in multiple documents and generate the complete set of triplets.  Apply divide, sort, merge procedure (summing up the counts for matching ID-ID pairs) to produce a single file of all triplets sorted by the first document ID. This phase requires the greatest amount of disk space because the initial expansion of the document ID triplets is quadratic in the number of documents sharing a shingle, and initially produces many triplets with a count of 1.

The “real” Clustering Algorithm (IV phase)  Produce the complete clustering. Examine each triplet and decide if the document pair exceeds our threshold for resemblance. If it does, we add a link between the two documents in a union-find algorithm. The connected components output by the union-find algorithm form the final clusters. This phase has the greatest memory requirements because we need to hold the entire union-find data structure in memory.

Performance Issues  Common Shingles.  Shared by more than 1,000 documents.  The number of document ID pairs is quadratic in the number of documents sharing a shingle.  Remove shingles that are more frequent than a given threshold.  Identical Documents.  Identical documents do not need to be handled. Remove identical documents from collection. Remove documents having the same fingerpring.  Super shingles.  Compute a meta-sketch shingling the shingles  Documents sharing shingles in the meta-sketch are very likely to have a high resemblance value.  Need to carefully choose super-shingle size.

Super-shingles based Clustering  Compute the list of super shingles for each document  Expand the list of super shingles into a sorted list of pairs.  Any documents that share a super shingle resemble each other and are added into the cluster.

Problems with Super-shingles  Super shingles are not as flexible or as accurate as computing resemblance with regular sketches.  They do not work well for shor documents. Short documents do not contain many shingles, even regular shingles are not accurate in computing resemblance.  Super-shingles represent sequence of shingles, and so, shorter documents, with fewer super shingles, have a lower probability of producing a common super shingle.  Super-shingles cannot detect containment.

A Nice Application: Page Changing Characterization  We can use the technique of comparing sketches over time to characterize the behavior of pages on the web.  For instance, we can observe a page at different times and see how similar each version is to the preceding version.  We can thus answer some basic questions like:  How often do pages change?  How much do they change per time interval?  How often do pages move? Within a server? Between servers?  How long do pages live? How many are created? How many die?

Experiments  30,000,000 HTML Pages. 150Gbytes (5k per document)  The file containing just the URLs of the documents took up 1.8Gbytes (an average of 60 bytes per URL).  10 word long shingles, 5 byte fingerprint. 1 in 25 of the shingles found were kept.  600M shingles and the raw sketch files took up 3 Gbytes.

Experiments  In the third phase - the creation of triples - the storage required was 20 Gbytes. At the end the file took 6 Gbytes.  The final clustering phase is the most memory intensive. The final file took up less than 100MBytes.

Experiments  Resemblance threshold set to 50%.  3.6 million clusters found containing a total of 12.3 million documents.  2.1 million clusters contained only identical documents (5.3 million documents).  The remainig 1.5 million clusters contained 7 million documents (a mixture of exact duplicates and similar).

Experiments PhaseTime (CPU-days) Paralle- lizable Sketching4.6YES Duplicate elimination0.3 Shingle merging1.7YES ID-ID pair formation0.7 ID-ID merging2.6YES Cluster formation0.5 Total  10.5