Weighted Chinese Restaurant Process for clustering barcodes Javier Cabrera John Lau Albert Lo DIMACS, Bristol U, and HKUST
Cluster Analysis: Group the observations into k distinct natural groups. Non Bayesian Cluster Analysis: Hierarchical clustering: Build a hierarchical tree - SIMILARITY: Inter point distance: Euclidean, Manhattan… - Inter cluster distance: Single Linkage, Complete, Average, Ward -Build a hierarchical tree Non Hierarchical clustering: -K-means -Divisive -PAM -Model Based -Many Other Methods
Specimen 1 Specimen 2 Specimen 3 Specimen 4 Specimen 5 Specimen 6 Specimen 7 Hierarchical Clustering
Weighted Chinese Restaurant Process 1.The Restaurant is full of tables. 2. Customers are sited on tables by a sitting rule. 3. Customers are allowed to move from one table to another or to a new empty one. Partition: Each sitting arrangement for all the customers in the restaurant.
Partitions: p : Partition of specimens into species. p P : {Space of all possible partitions. All arrangements of specimens into species} Bayes basics: Prior Distribution: π(p) Likelihood: f(x|p) = 1 i n(p) k(x j, j C i ). Posterior: π(p|data) f(x|p) π(p)
Weighted Chinese Restaurant Process Approximate Posterior distribution with WCRP Run the process for a while and obtain frequency table of partitions visited. Estimate final partition with posterior mode. Compare posterior probabilities of most probable partitions. New Specimens: -Placed in one existing table. -Open a new table=>New Species
Future Work WCRP Algorithm for Barcode data: Data Visualization: Final partition => similarities => Euclidean Representation -Multidimensional Scaling -Multivariate Data Visualization (used in taxonomy) -Projection Pursuit Entropy scanning Lo (1984), Ishwaran and James (2003b), Cabrera, Lau, Lo (2006) Javier Cabrera John Lau Albert Lo