Presentation is loading. Please wait.

Presentation is loading. Please wait.

Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

Similar presentations


Presentation on theme: "Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference."— Presentation transcript:

1 Matjaž Juršič, Vid Podpečan, Nada Lavrač

2 O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference Papers Clustering (Phase 1) - Combining Constraint-Based & Fuzzy Clustering - Conference Papers Clustering (Phase 2) F UZZY C LUSTERING OF D OCUMENTS - C-Means Algorithm - Distance Measure - Comparison of Crisp & Fuzzy Clustering - Time Complexity F URTHER W ORK 1/13

3 C LUSTERING Important unsupervised learning problem that deals with finding a structure in a collection of unlabeled data. Dividing data into groups (clusters) such that: - “similar” objects are in the same cluster, - “dissimilar” objects are in different clusters. Problems: - correct similarity/distance function between objects, - evaluating clustering results. 2/13

4 F UZZY C LUSTERING No sharp boundaries between clusters. Each data object can belong to more than one cluster (with certain probability). 3/13 e.g. membership of “red square” data object: - 70% in “red” cluster - 30% in “green” cluster

5 4/13 C LUSTERING OF D OCUMENTS B AG OF W ORDS & V ECTOR S PACE M ODEL - text represented as an unordered collection of words - using tf-idf (term frequency–inverse document frequency) - document = one vector in high dimensional space - similarity = cosine similarity between vectors T EXT -G ARDEN S OFTWARE L IBRARY (www.textmining.net) - collection of text-minig software tools (text analysis, model generation, documents classification/clustering, web crawling,...) - c ++ library - developed at JSI

6 5/13 C ONFERENCE P APERS C LUSTERING (P HASE 1) P ROBLEM Grouping conference papers with regard to their contents into predefined sessions schedule. Session A (3 papers) Coffee break E XAMPLE Session B (4 papers) Lunch break Session C (4 papers) Session D (3 papers) Coffee break Papers Sessions schedule Constraint-based clustering Session A – TitleSession B – TitleSession C – TitleSession D – Title

7 6/13 C OMBINING C ONSTRAINT -B ASED & F UZZY C LUSTERING P HASE 1 S OLUTION - constrained-based clustering (CBC) D IFFICULTIES - CBC can get stuck in local minimum - often low quality result (created schedule) - user interaction needed to repair schedule P HASE 2 N EEDED - run fuzzy clustering (FC) with initial clusters from CBC - if output clusters of FC differ from CBC repeat everything - if the clusters of FC equal to CBC show new info to user

8 7/13 C ONFERENCE P APERS C LUSTERING (P HASE 2) R UN F UZZY C LUSTERING ON P HASE 1 R ESULTS - insight into result quality - identify problematic papers Coffee break E XAMPLE Lunch break Coffee break Sessions schedule Session A – TitleSession B – TitleSession C – TitleSession D – Title 25% 13% 42% 10% 37%

9 8/13 C-M EANS A LGORITHM  generate initial (random) clusters centres  repeat  for each example calculate membership weights  for each cluster recompute new centre until the difference of the clusters between two iterations drops under some threshold

10 9/13 D ISTANCE M EASURE V ECTOR S PACE - Usual similarity measure: cosine similarity C-M EANS EXPLICITLY NEEDS DISTANCE ( DISSIMILARITY ), NOT SIMILARITY : - There are many possibilities: - None has ideal properties. - Experimental evaluation shows no significant difference. - We used

11 10/13 C OMPARISON OF C RISP & F UZZY C LUSTERING

12 11/13 T IME C OMPLEXITY If dimensionality of the vector is much higher than the number of clusters then comparable to k-means (this holds for document clustering).

13 12/13 F URTHER W ORK E VALUATION - Test scenarios - Benchmarks - Using data from past conferences U SER I NTERFACE - Web interface for semi-automatic conference schedule creation A LGORITHMS F INE -T UNING …

14 D ISCUSSION CONTACTS matjaz.jursic@ijs.si, vid.podpecan@ijs.si, nada.lavrac@ijs.si T HANK YOU FOR YOUR ATTENTION


Download ppt "Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference."

Similar presentations


Ads by Google