OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír 1, Chytry Milan 1, Botta-Dukát Zoltán 2, Hájek Michal 1 ; Talbot Stephen S. 3 1 Masaryk University, Brno, Czech Republic 2 Hungarian Academy of Sciences, Vácrátot, Hungary 3 U.S. Fish and Wildlife Service, Anchorage, USA

Why do we need a method for identification of optimal clustering algorithm and optimal number of clusters? The same dataset

-A huge variety of clustering methods produce “reasonable” results. -Subjective selection of the clustering method and no. of clusters is usually based on empirical experience Why do we need a method for identification of optimal clustering algorithm and optimal number of clusters? Methods published: Most algorithms identify the optimal partition mathematically, without considering ecological interpretation

The Method A posteriori description of phytosociological tables is based on diagnostic species Diagnostic species describes a cluster. Therefore, the number of diagnostic species determines whether the classified table can be sufficiently interpreted. Species1 98788 12112 3.211 Species2 51123 1223. 11132 Species 3 23132.......... Species4..2.4 112.. 1..5. Species5......1.1. 1.213

The Method The same dataset:

The Method Measure of the classification quality: the total sum of diagnostic species Fisher’s Exact Test calculates the probability of observed occurrence of species across clusters for a right-tailed test hypothesis –The measure reduces the importance of very small clusters. –Easy interpretation: the more diagnostic species in the dataset, the better description of the clusters.

The Method Test on three different datasets Southern Siberia, Sayan Mountains (310 plots; forest, steppe and tundra vegetation) Central Europe, Carpathians (241 plots; mire vegetation) Alaska, Kenai Peninsula (171 plots; wetlands)

The Method Classifications tested Flexible beta clustering WARD‘s clustering UPGMA (PC-ORD) Cover transformations (percentages, log percentages, Braun-Blanquet, presence/absence) Distance measures (Bray-Curtis, Manhattan, Euclidean) Ordinal cluster analysis (SYN-TAX) Modified TWINSPAN classification (JUICE) The sequence of splits in divisive classification is determined by internal heterogeneity of clusters. Therefore, any number of clusters is possible (three modifications of pseudospecies cut levels) Distance measures (Kruskal-Wallis, Kendall, Gower-Podani coefficient)

Results Sayan Mountains, Siberia (310 plots, 1036 species) Probability = 10 -3 Probability = 10 -6 Probability = 10 -9 No. of clusters No. of diagnostic species No. of clusters No. of diag. spec.

Results Sayan Mountains, Siberia (310 plots, 1036 species) Untransformed cover data Number of diagnostic species Number of clusters

Results Sayan Mountains, Siberia (310 plots, 1036 species) Euclidean distance measure Number of diagnostic species Number of clusters

Results Sayan Mountains, Siberia (310 plots, 1036 species) Manhattan distance measure Number of diagnostic species Number of clusters

Results Sayan Mountains, Siberia (310 plots, 1036 species) Bray-Curtis distance measure Number of diagnostic species Number of clusters

Results Sayan Mountains, Siberia (310 plots, 1036 species) UPGMA Number of diagnostic species Number of clusters

Results Sayan Mountains, Siberia (310 plots, 1036 species) Ward‘s method Number of diagnostic species Number of clusters

Results Sayan Mountains, Siberia (310 plots, 1036 species) Flexible beta -0.25 Number of diagnostic species Number of clusters

Results Sayan Mountains, Siberia (310 plots, 1036 species) Ordinal cluster analyses (SYN-TAX) Number of diagnostic species Number of clusters

Results Sayan Mountains, Siberia (310 plots, 1036 species) Modified TWINSPAN Number of diagnostic species Number of clusters

The Method Test on three different datasets Southern Siberia, Sayan Mountains (310 plots; forest, steppe and tundra vegetation) Central Europe, Carpathians (241 plots; mire vegetation) Alaska, Kenai Peninsula (171 plots; wetlands) Similar results:

Conclusions Classifications based on transformed cover values give better results than percentage covers. Euclidean distance - slightly poorer results than Manhattan or Bray-Curtis distances. UPGMA clustering method - poorer results than Ward’s and Flexible beta methods. No significant difference between ordinal cluster analysis proposed by Podani (SYN-TAX 2000) and other clustering methods. Modified TWINSPAN – performs well with small numbers of clusters.

Number of clusters Number of diagnostic species occurrences Modified TWINSPAN classification

Number of clusters Sum of diagnostic species Modified TWINSPAN classification

Number of clusters Number of clusters with more than 4 diagnostic species Modified TWINSPAN classification

OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Similar presentations

Presentation on theme: "OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír.

Similar presentations

Presentation on theme: "OPTIMCLASS: Simultaneous identification of optimal clustering method and optimal number of clusters in vegetation classification studies Tichy Lubomír."— Presentation transcript:

Similar presentations

About project

Feedback