Download presentation
Presentation is loading. Please wait.
Published byDominic Shaw Modified over 9 years ago
1
Matjaž Juršič, Vid Podpečan, Nada Lavrač
2
O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference Papers Clustering (Phase 1) - Combining Constraint-Based & Fuzzy Clustering - Conference Papers Clustering (Phase 2) F UZZY C LUSTERING OF D OCUMENTS - C-Means Algorithm - Distance Measure - Comparison of Crisp & Fuzzy Clustering - Time Complexity F URTHER W ORK 1/13
3
C LUSTERING Important unsupervised learning problem that deals with finding a structure in a collection of unlabeled data. Dividing data into groups (clusters) such that: - “similar” objects are in the same cluster, - “dissimilar” objects are in different clusters. Problems: - correct similarity/distance function between objects, - evaluating clustering results. 2/13
4
F UZZY C LUSTERING No sharp boundaries between clusters. Each data object can belong to more than one cluster (with certain probability). 3/13 e.g. membership of “red square” data object: - 70% in “red” cluster - 30% in “green” cluster
5
4/13 C LUSTERING OF D OCUMENTS B AG OF W ORDS & V ECTOR S PACE M ODEL - text represented as an unordered collection of words - using tf-idf (term frequency–inverse document frequency) - document = one vector in high dimensional space - similarity = cosine similarity between vectors T EXT -G ARDEN S OFTWARE L IBRARY (www.textmining.net) - collection of text-minig software tools (text analysis, model generation, documents classification/clustering, web crawling,...) - c ++ library - developed at JSI
6
5/13 C ONFERENCE P APERS C LUSTERING (P HASE 1) P ROBLEM Grouping conference papers with regard to their contents into predefined sessions schedule. Session A (3 papers) Coffee break E XAMPLE Session B (4 papers) Lunch break Session C (4 papers) Session D (3 papers) Coffee break Papers Sessions schedule Constraint-based clustering Session A – TitleSession B – TitleSession C – TitleSession D – Title
7
6/13 C OMBINING C ONSTRAINT -B ASED & F UZZY C LUSTERING P HASE 1 S OLUTION - constrained-based clustering (CBC) D IFFICULTIES - CBC can get stuck in local minimum - often low quality result (created schedule) - user interaction needed to repair schedule P HASE 2 N EEDED - run fuzzy clustering (FC) with initial clusters from CBC - if output clusters of FC differ from CBC repeat everything - if the clusters of FC equal to CBC show new info to user
8
7/13 C ONFERENCE P APERS C LUSTERING (P HASE 2) R UN F UZZY C LUSTERING ON P HASE 1 R ESULTS - insight into result quality - identify problematic papers Coffee break E XAMPLE Lunch break Coffee break Sessions schedule Session A – TitleSession B – TitleSession C – TitleSession D – Title 25% 13% 42% 10% 37%
9
8/13 C-M EANS A LGORITHM generate initial (random) clusters centres repeat for each example calculate membership weights for each cluster recompute new centre until the difference of the clusters between two iterations drops under some threshold
10
9/13 D ISTANCE M EASURE V ECTOR S PACE - Usual similarity measure: cosine similarity C-M EANS EXPLICITLY NEEDS DISTANCE ( DISSIMILARITY ), NOT SIMILARITY : - There are many possibilities: - None has ideal properties. - Experimental evaluation shows no significant difference. - We used
11
10/13 C OMPARISON OF C RISP & F UZZY C LUSTERING
12
11/13 T IME C OMPLEXITY If dimensionality of the vector is much higher than the number of clusters then comparable to k-means (this holds for document clustering).
13
12/13 F URTHER W ORK E VALUATION - Test scenarios - Benchmarks - Using data from past conferences U SER I NTERFACE - Web interface for semi-automatic conference schedule creation A LGORITHMS F INE -T UNING …
14
D ISCUSSION CONTACTS matjaz.jursic@ijs.si, vid.podpecan@ijs.si, nada.lavrac@ijs.si T HANK YOU FOR YOUR ATTENTION
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.