Análisis de Cluster.

Análisis de Cluster

Selección de variables:
Consideraciones teórico- conceptuales y prácticas. La técnica de AC no puede determinar cuales variables son relevantes y cuales no Las variables no apropiadas pueden afectar el análisis

What is clustering? A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

Why do we cluster? Clustering : given a collection of data objects group them so that Similar to one another within the same cluster Dissimilar to the objects in other clusters Clustering results are used: As a stand-alone tool to get insight into data distribution Visualization of clusters may unveil important information As a preprocessing step for other algorithms Efficient indexing or compression often relies on clustering

Applications of clustering?
Image Processing cluster images based on their visual content Web Cluster groups of users based on their access patterns on webpages Cluster webpages based on their content Bioinformatics Cluster similar proteins together (similarity wrt chemical structure and/or functionality etc) Many more…

The clustering task Group observations into groups so that the observations belonging in the same group are similar, whereas observations in different groups are different Basic questions: What does “similar” mean What is a good partition of the objects? I.e., how is the quality of a solution measured How to find a good partition of the observations

Observations to cluster
Real-value attributes/variables e.g., salary, height Binary attributes e.g., gender (M/F), has_cancer(T/F) Nominal (categorical) attributes e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.) Ordinal/Ranked attributes e.g., military rank (soldier, sergeant, lutenant, captain, etc.) Variables of mixed types multiple attributes with various types

Algoritmos jerárquicos

Aglomerativos

vecino más proximo

Ejemplo

Hierarchical joining algorithms
Centroid Single (nearest-neighbour): distance between two clusters = distance between two closest members of the two clusters. Complete (furthest neighbour): distance between two clusters = distance between two most distant cluster members. Centroid : distance between two clusters = distance between multivariate means of each cluster. Cluster 1 Single Cluster 2 Cluster 3 Complete 2001 Bio 8100s Applied Multivariate Biostatistics

Hierarchical joining algorithms (cont’d)
Cluster 1 Average: distance between two clusters = average distance between all members of the two clusters. Median: distance between two clusters = median distance between all members of the two clusters. Ward: distance between two clusters = average distance between all members of the two clusters with adjustment for covariances. Cluster 2 Cluster 3 Mean/median/adjusted mean of all pairwise distances 2001 Bio 8100s Applied Multivariate Biostatistics

Partitioned clustering: the procedure
X1 Choose k “seed” cases which are spread apart from center of all objects as much as possible. Assign all remaining objects to nearest seed. Reassign objects so that within-group sum of squares is reduced… …and continue to do so until SSwithin is minimized. Seed 1 Seed 2 Seed 3 X2 Objects Seeds Object center Bio 8100s Applied Multivariate Biostatistics 2001

Análisis de Cluster.

Similar presentations

Presentation on theme: "Análisis de Cluster."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Análisis de Cluster.

Similar presentations

Presentation on theme: "Análisis de Cluster."— Presentation transcript:

Similar presentations

About project

Feedback