Matjaž Juršič, Vid Podpečan, Nada Lavrač. O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference.

Slides:



Advertisements
Similar presentations
XML R ETRIEVAL Tarık Teksen Tutal I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric.
Advertisements

C ONVEX M IXTURE M ODELS FOR M ULTI - VIEW C LUSTERING Grigorios Tzortzis and Aristidis Likas Department of Computer Science, University of Ioannina, Greece.
Data Mining Techniques: Clustering
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
CS728 Web Clustering II Lecture 14. K-Means Assumes documents are real-valued vectors. Clusters based on centroids (aka the center of gravity or mean)
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Clustering.
Unsupervised Learning and Data Mining
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
S IMILARITY M EASURES FOR T EXT D OCUMENT C LUSTERING Anna Huang Department of Computer Science The University of Waikato, Hamilton, New Zealand BY Farah.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Lecture 09 Clustering-based Learning
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
Image segmentation by clustering in the color space CIS581 Final Project Student: Qifang Xu Advisor: Dr. Longin Jan Latecki.
CPSC 386 Artificial Intelligence Ellen Walker Hiram College
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Radial Basis Function Networks
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Text mining.
DATA MINING CLUSTERING K-Means.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Data Clustering 2 – K Means contd & Hierarchical Methods Data Clustering – An IntroductionSlide 1.
So Far……  Clustering basics, necessity for clustering, Usage in various fields : engineering and industrial fields  Properties : hierarchical, flat,
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
IAT Text ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT]
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
DATA CLUSTERING WITH KERNAL K-MEANS++ PROJECT OBJECTIVES o PROJECT GOAL  Experimentally demonstrate the application of Kernel K-Means to non-linearly.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
E VENT D ETECTION USING A C LUSTERING A LGORITHM Kleisarchaki Sofia, University of Crete, 1.
A new initialization method for Fuzzy C-Means using Fuzzy Subtractive Clustering Thanh Le, Tom Altman University of Colorado Denver July 19, 2011.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 A Fuzzy Logic Framework for Web Page Filtering Authors : Vrettos, S. and Stafylopatis, A. Source : Neural Network Applications in Electrical Engineering,
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky.
Data Mining and Text Mining. The Standard Data Mining process.
CLUSTERING EE Class Presentation. TOPICS  Clustering basic and types  K-means, a type of Unsupervised clustering  Supervised clustering type.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Information Organization: Clustering
Representation of documents and queries
Presentation transcript:

Matjaž Juršič, Vid Podpečan, Nada Lavrač

O VERVIEW B ASIC C ONCEPTS - Clustering - Fuzzy Clustering - Clustering of Documents P ROBLEM D OMAIN - Conference Papers Clustering (Phase 1) - Combining Constraint-Based & Fuzzy Clustering - Conference Papers Clustering (Phase 2) F UZZY C LUSTERING OF D OCUMENTS - C-Means Algorithm - Distance Measure - Comparison of Crisp & Fuzzy Clustering - Time Complexity F URTHER W ORK 1/13

C LUSTERING Important unsupervised learning problem that deals with finding a structure in a collection of unlabeled data. Dividing data into groups (clusters) such that: - “similar” objects are in the same cluster, - “dissimilar” objects are in different clusters. Problems: - correct similarity/distance function between objects, - evaluating clustering results. 2/13

F UZZY C LUSTERING No sharp boundaries between clusters. Each data object can belong to more than one cluster (with certain probability). 3/13 e.g. membership of “red square” data object: - 70% in “red” cluster - 30% in “green” cluster

4/13 C LUSTERING OF D OCUMENTS B AG OF W ORDS & V ECTOR S PACE M ODEL - text represented as an unordered collection of words - using tf-idf (term frequency–inverse document frequency) - document = one vector in high dimensional space - similarity = cosine similarity between vectors T EXT -G ARDEN S OFTWARE L IBRARY ( - collection of text-minig software tools (text analysis, model generation, documents classification/clustering, web crawling,...) - c ++ library - developed at JSI

5/13 C ONFERENCE P APERS C LUSTERING (P HASE 1) P ROBLEM Grouping conference papers with regard to their contents into predefined sessions schedule. Session A (3 papers) Coffee break E XAMPLE Session B (4 papers) Lunch break Session C (4 papers) Session D (3 papers) Coffee break Papers Sessions schedule Constraint-based clustering Session A – TitleSession B – TitleSession C – TitleSession D – Title

6/13 C OMBINING C ONSTRAINT -B ASED & F UZZY C LUSTERING P HASE 1 S OLUTION - constrained-based clustering (CBC) D IFFICULTIES - CBC can get stuck in local minimum - often low quality result (created schedule) - user interaction needed to repair schedule P HASE 2 N EEDED - run fuzzy clustering (FC) with initial clusters from CBC - if output clusters of FC differ from CBC repeat everything - if the clusters of FC equal to CBC show new info to user

7/13 C ONFERENCE P APERS C LUSTERING (P HASE 2) R UN F UZZY C LUSTERING ON P HASE 1 R ESULTS - insight into result quality - identify problematic papers Coffee break E XAMPLE Lunch break Coffee break Sessions schedule Session A – TitleSession B – TitleSession C – TitleSession D – Title 25% 13% 42% 10% 37%

8/13 C-M EANS A LGORITHM  generate initial (random) clusters centres  repeat  for each example calculate membership weights  for each cluster recompute new centre until the difference of the clusters between two iterations drops under some threshold

9/13 D ISTANCE M EASURE V ECTOR S PACE - Usual similarity measure: cosine similarity C-M EANS EXPLICITLY NEEDS DISTANCE ( DISSIMILARITY ), NOT SIMILARITY : - There are many possibilities: - None has ideal properties. - Experimental evaluation shows no significant difference. - We used

10/13 C OMPARISON OF C RISP & F UZZY C LUSTERING

11/13 T IME C OMPLEXITY If dimensionality of the vector is much higher than the number of clusters then comparable to k-means (this holds for document clustering).

12/13 F URTHER W ORK E VALUATION - Test scenarios - Benchmarks - Using data from past conferences U SER I NTERFACE - Web interface for semi-automatic conference schedule creation A LGORITHMS F INE -T UNING …

D ISCUSSION CONTACTS T HANK YOU FOR YOUR ATTENTION