Download presentation
Presentation is loading. Please wait.
Published byTiffany Barker Modified over 9 years ago
1
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1
2
What is Machine Learning? Algorithms for inferring new results from known data Not magic, though presented as such Not infallible, though sometimes presented as such Mainly based on Probability and Statistics Examples: Detecting email spam Recognizing handwritten text Reading license plates and human faces Recognizing speech Recommending new movies to watch 2
3
Supervised vs. Unsupervised Learning Two Fundamental Methods in Machine Learning Supervised Learning (“learn from my example”) Goal: A program that performs a task as good as humans. TASK – well defined (the target function) EXPERIENCE – training data provided by a human PERFORMANCE – error/accuracy on the task Unsupervised Learning (“see what you can find”) Goal: To find some kind of structure in the data. TASK – vaguely defined No EXPERIENCE No PERFORMANCE (but, there are some evaluations metrics) 3
4
What is Clustering? The most common form of Unsupervised Learning Clustering is the process of grouping a set of physical or abstract objects into classes (“clusters”) of similar objects It can be used in IR: To improve recall in search For better navigation of search results
5
Ex1: Cluster to Improve Recall Cluster hypothesis: Documents with similar text are related Thus, when a query matches a document D, also return other documents in the cluster containing D. 5
6
Ex2: Cluster for Better Navigation 6
7
Clustering Characteristics Flat Clustering vs Hierarchical Clustering Flat: just dividing objects in groups (clusters) Hierarchical: organize clusters in a hierarchy Evaluating Clustering Internal Criteria The intra-cluster similarity is high (tightness) The inter-cluster similarity is low (separateness) External Criteria Did we discover the hidden classes? (we need gold standard data for this evaluation) 7
8
Clustering for Web IR Representation for clustering Document representation Need a notion of similarity/distance How many clusters? Fixed a priori? Completely data driven? Avoid “trivial” clusters - too large or small 8
9
Recall: Documents as vectors Each doc j is a vector of tf.idf values, one component for each term. Can normalize to unit length. Vector space terms are axes - aka features N docs live in this space even with stemming, may have 20,000+ dimensions What makes documents related? 9 t 1 D2 D1 D3 D4 t 2 x y
10
What makes documents related? Ideal: semantic similarity. Practical: statistical similarity We will use cosine similarity. We will describe algorithms in terms of cosine similarity. 10 This is known as the “normalized inner product”.
11
Clustering Algorithms Hierarchical algorithms Bottom-up, agglomerative clustering Partitioning “flat” algorithms Usually start with a random partitioning Refine it iteratively The famous k-means partitioning algorithm: Given: a set of n documents and the number k Compute: a partition of k clusters that optimizes the chosen partitioning criterion 11
12
K-means Assumes documents are real-valued vectors. Cluster C i based on centroid of points x in a cluster (= the center of gravity or mean) : Reassignment of instances to clusters tries to maximize cohesion (= minimize distance) to the current cluster centroids. 12
13
K-Means Algorithm 13 Select K points as initial centroids Repeat form K clusters by assigning each point to its closest centroid recompute the centroid of each cluster Until centroids do not change See AnimationAnimation
14
K-means: Different Issues When to stop? When a fixed number of iterations is reached When centroid positions do not change Seed Choice Results can vary based on random seed selection. Try out multiple starting points 14 Example showing sensitivity to seeds A B DE C F If you start with centroids: B and E you converge to If you start with centroids D and F you converge to:
15
K-Means in Orange http://orange.biolab.si/ Machine Learning Software with Visual Programming Component Starting tutorial at http://wiki.sdakak.com/ml:getting-started-with-orange 15
16
Hierarchical clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of unlabeled examples. 16 animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate
17
Hierarchical Agglomerative Clustering We assume there is a similarity function that determines the similarity of two instances. 17 Start with all instances in their own cluster. Until there is only one cluster: Among the current clusters, determine the two clusters, c i and c j, that are most similar. Replace c i and c j with a single cluster c i c j Algorithm: Watch animation of HAC
18
What is the most similar cluster? Single-link Similarity of the most cosine-similar (single-link) Complete-link Similarity of the “furthest” points, the least cosine-similar Group-average agglomerative clustering Average cosine between pairs of elements Centroid clustering Similarity of clusters’ centroids 18
19
Single link clustering 19 1) Use maximum similarity of pairs: 2) After merging c i and c j, the similarity of the resulting cluster to another cluster, c k, is:
20
Complete link clustering 20 1) Use minimum similarity of pairs: 2) After merging c i and c j, the similarity of the resulting cluster to another cluster, c k, is:
21
Hierarchical Clustering in Orange 21
22
Major issue - labeling After clustering algorithm finds clusters - how can they be useful to the end user? Need a concise label for each cluster In search results, say “Animal” or “Car” in the jaguar example. In topic trees (Yahoo), need navigational cues. Often done by hand, a posteriori. 22
23
How to Label Clusters Show titles of typical documents Titles are easy to scan Authors create them for quick scanning! But you can only show a few titles which may not fully represent cluster Show words/phrases prominent in cluster More likely to fully represent cluster Use distinguishing words/phrases But harder to scan 23
24
Further issues Complexity: Clustering is computationally expensive. Implementations need careful balancing of needs. How to decide how many clusters are best? Evaluating the “goodness” of clustering There are many techniques, some focus on implementation issues (complexity/time), some on the quality of 24
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.