The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 The 5th annual UK Workshop on Computational Intelligence London, 5-7.

Slides:



Advertisements
Similar presentations
Copyright Jiawei Han, modified by Charles Ling for CS411a
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Aggregating local image descriptors into compact codes
Chapter 5: Introduction to Information Retrieval
Multimedia Database Systems
Clustering Basic Concepts and Algorithms
The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College.
PARTITIONAL CLUSTERING
Fast Algorithms For Hierarchical Range Histogram Constructions
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Introduction to Bioinformatics
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER.
Cluster Analysis (1).
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
S IMILARITY M EASURES FOR T EXT D OCUMENT C LUSTERING Anna Huang Department of Computer Science The University of Waikato, Hamilton, New Zealand BY Farah.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Birch: An efficient data clustering method for very large databases
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Lecture 20: Cluster Validation
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Prepared by: Mahmoud Rafeek Al-Farra
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Baskent University Text Filtering1 A Text Filtering Method For Digital Libraries Mustafa Zafer BOLAT Hayri SEVER BASKENT UNIVERSITY COMPUTER ENGINEERING.
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
A new initialization method for Fuzzy C-Means using Fuzzy Subtractive Clustering Thanh Le, Tom Altman University of Colorado Denver July 19, 2011.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 8. Text Clustering.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Similarity Measures for Text Document Clustering
Sampath Jayarathna Cal Poly Pomona
Semi-Supervised Clustering
Data Mining K-means Algorithm
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Block Matching for Ontologies
Machine Learning on Data Lecture 9b- Clustering
Topic 5: Cluster Analysis
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 The 5th annual UK Workshop on Computational Intelligence London, 5-7 September 2005 Department of Electronic & Electrical Engineering University College London, UK Learning Topic Hierarchies from Text Documents using a Scalable Hierarchical Fuzzy Clustering Method E. Mendes Rodrigues and L. Sacks {mmendes, E. Mendes Rodrigues and L. Sacks {mmendes,

Outline Document clustering process H-FCM: Hyper-spherical Fuzzy C-Means H 2 -FCM: Hierarchical H-FCM Clustering experiments Topic hierarchies

Document Clustering Process Document Representation Document Encoding Document Clustering Pre- processing Document Clusters Document Similarity Clustering Method Cluster Validity Document Collection Application Document Clustering Document Similarity Clustering Method Document Collection Document Representation Document Encoding Pre- processing Document Clusters Cluster Validity Application Identify all unique words in the document collection Discard common words that are included in the stop list Apply stemming algorithm and combine identical word stems Apply term weighting scheme to the final set of k indexing terms Discard terms using pre-processing filters Document Vectors x 11 x 12  x 1k x 21 x 22 x N1 x N2  x Nk    X = Vector-Space Model of Information Retrieval Very high-dimensional Very sparse (+95%)

Measures of Document Relationship FCM applies the Euclidean distance, which is inappropriate for high-dimensional text clustering  non-occurrence of the same terms in both documents is handled in a similar way as the co-occurrence of terms Cosine (dis)similarity measure:  widely applied in Information Retrieval  represents the cosine of the angle between two document vectors  insensitive to different document lengths, since it is normalised by the length of the document vectors

H-FCM: Hyper-spherical Fuzzy C-Means Applies the cosine measure to assess document relationships Modified objective function: Subject to an additional constraint: Fuzzy memberships (u) and cluster centroids (v):

How many clusters? Usually the final number of clusters is not know a priori  Run the algorithm for a range of c values  Apply validity measures and determine which c leads to the best partition (clusters compactness, density, separation, etc.) How compact and dense are clusters in a sparse high- dimensional problem space?  Very small percentage of documents within a cluster present high similarity to the respective centroid  clusters are not compact  However, there is always a clear separation between intra- and inter- cluster similarity distributions

H 2 -FCM: Hierarchical Hyper-spherical Fuzzy C-Means Key concepts  Apply partitional algorithm (H-FCM) to obtain a sufficiently large number of clusters  Exploit the granularity of the topics associated with each cluster to link cluster centroids hierarchically  Form a topic hierarchy Asymmetric similarity measure  Identify parent-child type relationships between cluster centroids  Child should be less similar to parent, than parent to child 

S(v 8,v 5 )<t PCS C1 C3 C6 C9 C10 C12 C11 C8 C7 C4 C2 C5 Document Cluster centroid The H 2 -FCM Algorithm Asymmetric Similarity v1v1 v3v3 v6v6 v9v9 v 10 v 12 v 11 v8v8 v7v7 v4v4 v2v2 v5v5 v1v1 v3v3 v6v6 v9v9 v 10 v 12 v 11 v8v8 v7v7 v4v4 v2v2 v5v5 v  V F  S(v ,v  ) = max[S(v ,v  )], v ,v  V F v3v3 v6v6 v9v9 v 10 v 12 v 11 v7v7 v4v4 v1v1 v8v8 v2v2 v5v5 v1v1 v8v8 v2v2 S(v 1,v 5 )≥t PCS v 10 VFVF VHVH S(v 8,v 1 )<t PCS Compute S(v ,v  ),  Y Apply H-FCM (c, m) All clusters have size≥t ND ? Select centroid While V F ≠  VH=?VH=? N Add root Select parent S≥t PCS ? Add child Y N N c=c-Kc=c-K

Scalability of the Algorithm H 2 -FCM time complexity depends on H-FCM and centroid linking heuristic H-FCM computation time is O(Nc 2 k) Linking heuristic is at most O(c 2 k)  Computation of the asymmetric similarity between every pair of cluster centroids - O(c 2 k)  Generation of the cluster hierarchy - O(c 2 ) Overall, H2-FCM time complexity is O(Nc 2 k) Scales well to large document sets!

Description of Experiments Goal: evaluate the H 2 -FCM performance Evaluation measures: clustering Precision (P) and Recall (R) H 2 -FCM algorithm run for a range of c values No. hierarchy roots=No. reference classes  t PCS dynamically set Are sub-clusters of the same topic assigned to the same branch? In reference class Not in reference class  Assigned to cluster  true positives (tp)false positives (fp) Not assigned to cluster  false negatives (fn)true negatives (tn)

Test Document Collections Reuters test collection: Open Directory Project (ODP): INSPEC database: Collection SizeClassesDocument lengthDocument sparsity Nkno.labelsavgstdevavgstdev reuters acq earn trade %0.26 % reuters crude interest money-fx ship trade %0.47 % odp game lego math safety sport %0.50 % inspec back-propagation fuzzy control Pattern clustering %0.14 %

Clustering Results: H2-FCM Precision and Recall odpinspec reuters1 reuters2

Topic Hierarchy Each centroid vector consists of a set of weighted terms Terms describe the topics associated with the document cluster Centroid hierarchy produces a topic hierarchy  Useful for efficient access to individual documents  Provides context to users in exploratory information access

Topic Hierarchy Example

Concluding Remarks H 2 -FCM clustering algorithm  Partitional clustering (H-FCM)  Linking heuristic organizes centroids hierarchically bases on asymmetric similarity Scales linearly with the number of documents Exhibits good clustering performance Topic hierarchy can be extracted from the centroid hierarchy

Clustering in Sparse High-dimensional Spaces reuters1 reuters2 Intra- and inter-cluster similarity CDFs cc cc

Clustering in Sparse High-dimensional Spaces (contd.) odp inspec Intra- and inter-cluster similarity CDFs cc cc

Iterative optimization of an objective function: Subject to constraints: Fuzzy memberships (u) and cluster centroids (v): FCM: Fuzzy C-Means