Last lecture summary.

Name: Last lecture summary.
Uploaded: 2017-07-11T12:09:59+00:00
Duration: PTM52S44
Channel: Carol Allison
Description: Last lecture summary.

Last lecture summary

Test-data and Cross Validation

testing error training error model complexity
- the above examples were for different algorithms, this one is about the model complexity (for the given algorithm)

Test set method Split the data set into training and test data sets.
Common ration – 70:30 Train the algorithm on training set, assess its performance on the test set. Disadvantages This is simple, however it wastes data. Test set estimator of performance has high variance Train Test -common ratio: 70:30 adopted from Cross Validation tutorial, Andrew Moore

stratified division same proportion of data in the training and test sets

Compare them on independent data – Validation set.
Training error can not be used as an indicator of model’s performance due to overfitting. Training data set - train a range of models, or a given model with a range of values for its parameters. Compare them on independent data – Validation set. If the model design is iterated many times, then some overfitting to the validation data can occur and so it may be necessary to keep aside a third Test set on which the performance of the selected model is finally evaluated. - terms validation and test data sets are often used interchangeably, it is important to find out, what test/validation means in the conrete application

LOOCV choose one data point remove it from the set
fit the remaining data points note your error using the removed data point as test Repeat these steps for all points. When you are done report the mean square error (in case of regression).

k-fold crossvalidation
randomly break data into k partitions remove one partition from the set fit the remaining data points note your error using the removed partition as test data set Repeat these steps for all partitions. When you are done report the mean square error (in case of regression).

Selection and testing Complete procedure to algorithm selection and estimation of its quality Divide data to train/test By Cross Validation on the Train choose the algorithm Use this algorithm to construct a classifier using Train Estimate its quality on the Test Train Test Train Val Train Test

Model selection via CV polynomial regression degree MSEtrain
adopted from Cross Validation tutorial by Andrew Moore, Model selection via CV polynomial regression degree MSEtrain MSE10-fold Choice 1 2 3 4 5 6

Nearest Neighbors Classification

instances - assign to the class of most similar point

Distance dij measures dissimilarity
Similarity sij is quantity that reflects the strength of relationship between two objects or two features. Distance dij measures dissimilarity Dissimilarity measure the discrepancy between the two objects based on several features. Distance satisfies the following conditions: distance is always positive or zero (dij ≥ 0) distance is zero if and only if it measured to itself distance is symmetric (dij = dji) In addition, if distance satisfies triangular inequality |x+y| ≤ |x|+|y|, then it is called metric.

Distances for quantitative variables
Minkowski distance (Lp norm) distance matrix – matrix with all pairwise distances

Manhattan distance y2 x2 x1 y1

Euclidean distance y2 x2 x1 y1

- assign to thE class of most similar point

k-NN supervised learning target function f may be
dicrete-valued (classification) real-valued (regression) We assign to the class which instance is most similar to the given point.

k-NN is a lazy learner lazy learning
generalization beyond the training data is delayed until a query is made to the system opposed to eager learning – system tries to generalize the training data before receiving queries -lazy learners – lini studenti (nevim, nekoho mi to pripomina)

fitting noise, outliers
Which k is best? Crossvalidation k = 1 k = 15 fitting noise, outliers overfitting value not too small smooth out distinctive behavior Hastie et al., Elements of Statistical Learning

Real-valued target function
Algorithm calculates the mean value of the k nearest training examples. k = 3 value = 12 value = ( )/3 = 12 value = 14 value = 10

Distance-weighted NN Give greater weight to closer neighbors
unweighted 2 votes weighted 1/12 + 1/22 = 1.25 votes 1/42 + 1/52 = votes k = 4 4 2 5 numbers are distances votes are weighted by squared distances 1

k-NN issues Curse of dimensionality is a problem.
Significant computation may be required to process each new query. To find nearest neighbors one has to evaluate full distance matrix. Efficient indexing of stored training examples helps kd-tree

Cluster Analysis - shlukova analyza

We have data, we don’t know classes.
Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters.

Stages of clustering process
On clustering validation techniques, M. Halkidi, Y. Batistakis, M. Vazirgiannis

How would you solve the problem?
How to find clusters? Group together most similar patterns.

Single linkage (metoda nejbližšího souseda)
metoda nejblizsiho souseda the distance of clusters is given as a distance of two closest points in different clusters based on A Tutorial on Clustering Algorithms

Milano Torino Florence Rome Bari Naples BA FL MI NA RM TO 662 877 255
MI/TO NA RM Torino Milano Florence Rome Naples Bari BA FL MI NA RM TO 662 877 255 412 996 295 468 268 400 754 564 138 219 869 669

Milano Torino Florence Rome Bari Naples 877 996 BA FL MI NA RM TO 662
MI/TO NA RM 877 Torino Milano Florence Rome Naples Bari 877 996 BA FL MI NA RM TO 662 877 255 412 996 295 468 268 400 754 564 138 219 869 669

MI/TO NA RM 877 295 Torino Milano Florence Rome Naples Bari 295 400 BA FL MI NA RM TO 662 877 255 412 996 295 468 268 400 754 564 138 219 869 669

MI/TO NA RM 877 295 754 Torino Milano Florence Rome Naples Bari 754 869 BA FL MI NA RM TO 662 877 255 412 996 295 468 268 400 754 564 138 219 869 669

MI/TO NA RM 754 877 295 Torino Milano Florence Rome Naples Bari 564 564 669 BA FL MI NA RM TO 662 877 255 412 996 295 468 268 400 754 564 138 219 869 669

Milano Torino Florence Rome Bari Naples BA FL MI/TO NA RM 662 877 255
662 877 255 412 295 468 268 754 564 219

Milano Torino Florence Rome Bari Naples BA FL MI/TO NA/RM 662 877 255
662 877 255 295 268 564

Torino Milano Florence Rome Naples Bari BA/NA/RM FL MI/TO 268 564 295

Torino Milano Florence Rome Naples Bari BA/FL/NA/RM MI/TO 295

Dendrogram Torino → Milano Rome → Naples
→ Bari → Florence Join Torino–Milano and Rome–Naples–Bari–Florence

Dendrogram Torino → Milano (138) Rome → Naples (219)
→ Bari (255) → Florence (268) Join Torino–Milano and Rome–Naples–Bari–Florence (295) 295 dissimilarity 268 255 219 138 BA NA RM FL MI TO

dissimilarity BA NA RM FL MI TO Torino Milano Florence Rome Naples
Bari Torino Milano Florence Rome Naples Bari Torino Milano Florence Rome Naples Bari

Complete linkage (metoda nejvzdálenějšího souseda)
metoda nejvzdalenejsiho souseda the distance of clusters is given as a distance of two farthest points in different clusters

Milano Torino Florence Rome Bari Naples BA FL MI NA RM TO 662 877 255
MI/TO NA RM Torino Milano Florence Rome Naples Bari BA FL MI NA RM TO 662 877 255 412 996 295 468 268 400 754 564 138 219 869 669

MI/TO NA RM 996 Torino Milano Florence Rome Naples Bari 877 996 BA FL MI NA RM TO 662 877 255 412 996 295 468 268 400 754 564 138 219 869 669

MI/TO NA RM 996 400 Torino Milano Florence Rome Naples Bari 295 400 BA FL MI NA RM TO 662 877 255 412 996 295 468 268 400 754 564 138 219 869 669

MI/TO NA RM 996 400 869 Torino Milano Florence Rome Naples Bari 754 869 BA FL MI NA RM TO 662 877 255 412 996 295 468 268 400 754 564 138 219 869 669

MI/TO NA RM 996 400 869 Torino Milano Florence Rome Naples Bari 669 564 669 BA FL MI NA RM TO 662 877 255 412 996 295 468 268 400 754 564 138 219 869 669

Milano Torino Florence Rome Bari Naples BA FL MI/TO NA RM 662 996 255
662 996 255 412 400 468 268 869 669 219

Milano Torino Florence Rome Bari Naples BA FL MI/TO NA/RM 662 996 412
662 996 412 400 468 869

Torino Milano Florence Rome Naples Bari BA MI/TO/FL NA/RM 996 412 869

MI TO BA NA RM FL complete linkage single linkage

Average linkage (metoda průměrné vazby)
metoda prumerne vazby Also known as UPGMA : Unweighted Pair Group Method with Arithmetic Mean the distance of clusters is given as an average value of all pairs of object from different clusters

MI/TO NA RM 936.5 Torino Milano Florence Rome Naples Bari ( )/2=936.5 877 996 BA FL MI NA RM TO 662 877 255 412 996 295 468 268 400 754 564 138 219 869 669

Centroid linkage

cluster is represented by its centroid
BA FL MI/TO NA RM 895 Milano cluster is represented by its centroid Torino Florence 895 Rome BA FL MI NA RM TO 662 877 255 412 996 295 468 268 400 754 564 138 219 869 669 Bari Naples

Summary single linkage (MIN) complete linkage (MAX) average linkage
Similarity? single linkage (MIN) complete linkage (MAX) average linkage centroids

centroids

  single linkage (MIN) complete linkage (MAX) average linkage centroids

Ward’s linkage (method)
Wardova metoda This method is distinct from all other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the Sum of Squares (SS) of any two (hypothetical) clusters that can be formed at each step.

In Ward’s method metrics are not used, they do not have to be chosen
In Ward’s method metrics are not used, they do not have to be chosen. Instead, sums of squares (i.e. squared Euclidean distances) between centroids of clusters are computed. Using Ward's Method we will start out with all sample units in n clusters of size 1 each. In the first step of the algorithm, n - 1 clusters are formed, one of size two and the remaining of size 1. Ward's Method has a strong tendency to split data in groups of roughly equal size. This means that when the "natural" clusters differ much in size, then the big ones will be split in smaller parts roughly equal in size to the smaller "natural" clusters.

Ward‘s method keeps this growth as small as possible.
Ward's method says that the distance between two clusters, A and B, is how much the sum of squares will increase when we merge them. At the beginning of clustering, the sum of squares starts out at zero (because every point is in its own cluster) and then grows as we merge clusters. Ward‘s method keeps this growth as small as possible. Using Ward's Method we will start out with all sample units in n clusters of size 1 each. In the first step of the algorithm, n - 1 clusters are formed, one of size two and the remaining of size 1. Ward‘s algorithm can give us a hint about the number of clusters through the merging cost. If the cost of merging increases a lot, it's probably going too far, and losing a lot of structure. So a rule of thumb is to keep reducing k until the cost jumps, and then use the k right before the jump. Of course this leaves you to decide how big a merging cost is acceptable, and there's no theory whatsoever to say that this will often or even usually lead to good choices, but it does make a kind of sense. Of course, the same rule of thumb can be applied to other hierarchical clustering techniques: pick the k just before the merging cost takes off. However, keep in mind that this is HEURISTICS!!! -The sum of squares measures distance equally in all directions, so it wants the clusters to be round. This is not always very sensible.

Types of clustering hierarchical partitional
groups data with a sequence of nested partitions agglomerative bottom-up Start with each data point as one cluster, join the clusters up to the situation when all points form one cluster. divisive top-down Initially all objects are in one cluster, then the cluster is subdivided into smaller and smaller pieces. partitional divides data points into some prespecified number of clusters without the hierarchical structure i.e. divides the space

Hierarchical clustering
Agglomerative methods are used more widely. Divisive methods need to consider (2N − 1 −1) possible subset divisions, which is very computationally intensive. computational difficulties of finding the optimum partitions Divisive clustering methods are better at finding large clusters than hierarchical methods.

Hierarchical clustering
Disadvantages High computational complexity – at least O(N2). Needs to calculate all mutual distances. Inability to adjust once the splitting or merging is performed no undo

k-means How to avoid the computing of all mutual distances?
Calculate distances from representatives (centroids) of clusters. Advantage: number of centroids is much lower than the number of data points. Disadvantage: number of centroids k must be given in advance - k-stredy

k-means – kids algorithm
Once there was a land with N houses. One day K kings arrived to this land. Each house was taken by the nearest king. But the community wanted their king to be at the center of the village, so the throne was moved there. Then the kings realized that some houses were closer to them now, so they took those houses, but they lost some.. This went on and on… Until one day they couldn't move anymore, so they settled down and lived happily ever after in their village.

k-means – adults algorithm
decide on the number of clusters k randomly initialize k centroids repeat until convergence (centroids do not move) assign each point to the cluster represented by the centroid it is nearest to move the centroids to the position given as a mean of all points in the cluster

k-means applet

Disadvantages: k must be determined in advance.
Sensitive to initial conditions. The algorithm minimizes the following “energy” function, but may be trapped in the local minima. Applicable only when mean is defined, then what about categorical data? E.g. replace mean with mode (k-modes). Arithmetic mean is not robust to outliers (use median – k-medoids). Clusters are spherical because the algorithm is based on distance. energie - Pro K shluků sečti vzdálenost všech vektorů daného shluku od jeho centroidu medoid - the most centrally located object in a cluster

Cluster validation How many clusters are there in data set?
Does the resulting clustering scheme fit our data set? Is there a better partitioning for the data set? The quantitative evaluations of the clustering results are known under the general term cluster validity methods.

external internal relative
evaluate clustering results based on the knowledge of the correct class labels internal no correct class labels are available, the quality estimate is based on the information intrinsic to the data alone relative several different classifications of one set of data are compared using the same algorithm of classification with different parameters

External validation measures
binary measures based on the contingency table of the pairwise assignment of data items we have two divisions of data consisting of N objects (S = {S1 … SN}): original data set (i.e. known) C = {C1, …, Cn} generated (i.e. after clustering) P = {P1, …, Pm} -from Computational cluster validation in post-genomic data analysis, Julia Handl, Joshua Knowles and Douglas B. Kell -supplementary at

Rand index 𝑅= 𝑎+𝑏 𝑎+𝑏+𝑐+𝑑 0,1 , should be maximized
Known Clustered a b c d Known Clustered Known Clustered Known Clustered Four possible cases: two points are in the same cluster in original data set C as well as in generated set P, …. Following measures can be calculated (a) - the number of pairs of elements in S that are in the same set in C and in the same set in P (b) - the number of pairs of elements in S that are in different sets in C and in different sets in P (c)- the number of pairs of elements in S that are in the same set in C and in different sets in P (d) - the number of pairs of elements in S that are in different sets in C and in the same set in P Rand index 𝑅= 𝑎+𝑏 𝑎+𝑏+𝑐+𝑑 ,1 , should be maximized

Rand index (example) Data assigned to: Same cluster Different clusters
Same clusters in ground truth C 20 24 Different clusters in ground truth C 72 Rand index = (20+72) / ( ) = 92/136 = 0.68

Cophenetic correlation coefficient
How good is the hierarchical clustering that we just performed? Meaning how faithfully a dendrogram preserves the pairwise distances between the original data points? To calculate it we need two matrices: distance matrix and cophenetic matrix.

Cophenetic matrix TO → MI (138) RM → NA (219)
→ BA (255) → FL (268) Join TO–MI and RM–NA–BA–FL (295) 295 268 BA FL MI NA RM TO 255 219 138 268 BA NA RM FL MI TO 295 295 255 268 295 255 268 295 219 295 295 138 295 295

CPCC is a correlation coeffcient between these two columns.
Distance matrix BA FL MI NA RM TO 662 877 295 255 468 754 412 268 564 219 996 400 138 869 669 Dist CP 662 268 877 295 255 412 996 468 400 754 564 138 219 869 669 CPCC = 0.64 (64%) Cophenetic matrix BA FL MI NA RM TO 268 295 295 255 268 295 255 268 295 219 295 295 138 295 295

Interpretation of CPCC
if CPCC < cca 0.8, all data belong to just one cluster The higher the CPCC is, the less information is lost in the clustering process. CPCC can be calculated at each step of the building of the dendrogram taking into account only entities built into to the tree to that point. plot of CPCC vs. number decrease in the CPCC indicates that the cluster just formed has made the dendrogram less faithful to data i.e. stop clustering one step before - CPCC criterion about number of clusters is again just heuristics!

CPCC = 0.64 < 0.8 Milano Torino Florence Rome Bari Naples
- looking at the data one sees there are probably not real clusters

- looking at the data one sees there are probably not real clusters

Silhouette validation technique
Using this approach each cluster could be represented by the so-called silhouette, which is based on the comparison of its cohesion (tightness) and separation. separation cohesion Cohesion: measures how closely related are objects in a cluster Separation: measure how distinct or well-separated a cluster is from other clusters

In this technique, several measures can be calculated:
silhouette width for each sample you’ll see later why I call these numbers “widths” average silhouette width for each cluster overall average silhouette width for a total data set

s(i) – sillhouete of one data point (sample)
cohesion a(i) – average distance of sample i to all other objects in the same cluster separation b(i) – average distance of sample i to the objects in other clusters. Find the minimum among the clusters.

cohesion a(i): average distance in the cluster separation b(i): average distances to others clusters, find minimal

-1 ≤ s(i) ≤ 1 close to 1 … ith sample has been assigned to an appropriate cluster (i.e. good) close to 0 … the sample could also be assigned to the nearest neighbouring cluster (i.e. indiferent) close to -1 … such a sample has been “misclassified” (i.e. bad)

For the given cluster j it is possible to calculate a cluster silhouette Sj as the average of all samples’ silhouette widths in that cluster. It characterises the heterogenity and isolation properties of such a cluster.

Global silhouette (silhouette index) GS is calculated as (c is the number of clusters)
The largest GS indicates the best number of clusters. -from look also at IRIS example at

Iris data k-means – silhouette
most points in both clusters have a large silhouette value, greater than 0.8, indicating that those points are well-separated from neighboring clusters However, each cluster also contains a few points with low silhouette values, indicating that they are nearby to points from other clusters

-petal width is correlated with petal length, thus 3D graph can be used for vizualization
-If you plot the data, using different symbols for each cluster created by kmeans, you can identify the points with small silhouette values, as those points that are close to points from other clusters. -The centroids of each cluster are plotted using circled X's. Three of the points from the lower cluster, plotted with triangles, are very close to points from the upper cluster, plotted with squares. But, in fact, because the upper cluster is so spread out, those three points are closer to the centroid of the lower cluster than to that of the upper cluster, even though they are separated from the bulk of the points in their own cluster by a gap. Because K-means clustering only considers distances, and not densities, this kind of result can occur. -

-3 clusters

GS2clusters = 0.8504 GS3clusters = 0.7357
kmeans has split the upper cluster from the two-cluster solution, and that those two clusters are very close to each other. Depending on what you intend to do with these data after clustering them, this three-cluster solution may be more or less useful than the previous, two-cluster, solution. The average silhouette value was larger for the two-cluster solution, indicating that it is a better answer purely from the point of view of creating distinct clusters.

Last lecture summary.

Similar presentations

Presentation on theme: "Last lecture summary."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Last lecture summary.

Similar presentations

Presentation on theme: "Last lecture summary."— Presentation transcript:

Similar presentations

About project

Feedback