CLEVER: CLustEring using representatiVEs and Randomized hill climbing Rachana Parmar and Christoph F. Eick:

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Local optimization technique G.Anuradha. Introduction The evaluation function defines a quality measure score landscape/response surface/fitness landscape.
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Random Forest Predrag Radenković 3237/10
Hierarchical Clustering, DBSCAN The EM Algorithm
Non-Parametric Density Functions Christoph F. Eick 12/24/08.
© J. Christopher Beck Lecture 17: Tabu Search.
LOCAL SEARCH AND CONTINUOUS SEARCH. Local search algorithms  In many optimization problems, the path to the goal is irrelevant ; the goal state itself.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
For Friday Finish chapter 5 Program 1, Milestone 1 due.
Christoph F. Eick Questions and Topics Review Nov. 22, Assume you have to do feature selection for a classification task. What are the characteristics.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Introduction to Function Minimization. Motivation example Data on height of a group of people, men and women Data on gender not recorded, not known.
Chung Sheng CHEN, Nauful SHAIKH, Panitee CHAROENRATTANARUK, Christoph F. EICK, Nouhad RIZK and Edgar GABRIEL Department of Computer Science, University.
Classification and Decision Boundaries
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Evaluation.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
CHAPTER twelve Basic Sampling Issues Copyright © 2002
Collaborative Filtering Matrix Factorization Approach
Genetic Algorithm.
Issues with Data Mining
Vilalta&Eick: Informed Search Informed Search and Exploration Search Strategies Heuristic Functions Local Search Algorithms Vilalta&Eick: Informed Search.
1 CO Games Development 1 Week 6 Introduction To Pathfinding + Crash and Turn + Breadth-first Search Gareth Bellaby.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Learning Objectives Copyright © 2004 John Wiley & Sons, Inc. Basic Sampling Issues CHAPTER Ten.
Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.
Particle Filters for Shape Correspondence Presenter: Jingting Zeng.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Topic9: Density-based Clustering
FORS 8450 Advanced Forest Planning Lecture 11 Tabu Search.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Local Search Algorithms
Data Mining & Machine Learning Group ADMA09 Rachsuda Jianthapthaksin, Christoph F. Eick and Ricardo Vilalta University of Houston, Texas, USA A Framework.
Introduction to Loops For Loops. Motivation for Using Loops So far, everything we’ve done in MATLAB, you could probably do by hand: Mathematical operations.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.
Education 793 Class Notes Inference and Hypothesis Testing Using the Normal Distribution 8 October 2003.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
Validation methods.
GENETIC ALGORITHM Basic Algorithm begin set time t = 0;
Copyright © 2014 Curt Hill Algorithms From the Mathematical Perspective.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Ch. Eick: Randomized Hill Climbing Techniques Randomized Hill Climbing Neighborhood Hill Climbing: Sample p points randomly in the neighborhood of the.
Ch. Eick Project 2 COSC Christoph F. Eick.
Escaping Local Optima. Where are we? Optimization methods Complete solutions Partial solutions Exhaustive search Hill climbing Exhaustive search Hill.
1. Randomized Hill Climbing Neighborhood Randomized Hill Climbing: Sample p points randomly in the neighborhood of the currently best solution; determine.
Local Search Algorithms CMPT 463. When: Tuesday, April 5 3:30PM Where: RLC 105 Team based: one, two or three people per team Languages: Python, C++ and.
Christoph F. Eick Questions Review October 12, How does post decision tree post-pruning work? What is the purpose of applying post-pruning in decision.
Discovering Interesting Regions in Spatial Data Sets Christoph F. Eick for Data Mining Class 1.Motivation: Examples of Region Discovery 2.Region Discovery.
MLDM16, Newark Supervised Clustering— Algorithms and Applications Christoph F. Eick Department of Computer Science University of Houston Organization of.
Eick: Informed Search Informed Search and Exploration Search Strategies Heuristic Functions Local Search Algorithms Vilalta&Eick: Informed Search.
Data Science Credibility: Evaluating What’s Been Learned
Reinforcement Learning
1. Randomized Hill Climbing
Randomized Hill Climbing
Heuristic search INT 404.
Collaborative Filtering Matrix Factorization Approach
Computing the Entropy (H-Function)
Randomized Hill Climbing
Gaussian Mixture Models And their training with the EM algorithm
Text Categorization Berlin Chen 2003 Reference:
Reinforcement Learning
Presentation transcript:

CLEVER: CLustEring using representatiVEs and Randomized hill climbing Rachana Parmar and Christoph F. Eick:

Rachana Parmar and Christoph F. Eick 2 SPAM (“Supervised PAM’) Starts with a randomly selected set of k representatives Uses q(X) as a fitness functions that is tries to maximize Otherwise, works the same as PAM

Rachana Parmar and Christoph F. Eick 3 Motivation CLEVER The PAM/SPAM algorithm uses a simple hill climbing procedure PAM terminates very quickly in terms of no. of replacements; mostly, because it only considers simple replacements. PAM/SPAM does not perform well in terms of clustering quality PAM is inefficient in the sense that it samples all possible replaced of a representative with a non-representative PAM/SPAM search for a fixed number of clusters k; k has to be known in advance.

Rachana Parmar and Christoph F. Eick 4 Randomized Hill Climbing A variant of hill climbing where p sample points are randomly selected from the neighborhood of the current solution Two main parameters: p and neighborhood definition Usually fast, because no exhaustive sampling

Rachana Parmar and Christoph F. Eick 5 CLEVER—General Loop Select initial solution by randomly selecting k-hat representatives Find p neighbors of current solution using a given neighborhood definition If the best neighbor improves fitness, it becomes new solution. If no neighbor improves fitness, the neighborhood is resampled using p*2 and then p*(q-2) no. of samples If even after resampling fitness does not improve, the algorithm terminates

Rachana Parmar and Christoph F. Eick 6 Neighborhood Definition CLEVER samples solution based on a given neighborhood size that is defined as the number of operators applied to the current solution Operators Used: Insert, Replace, Delete Each operator is assigned a certain selection probability. Insert and Delete operator change no. of representatives and in turn no. of clusters, enabling the algorithm to search for a variable number of clusters k.

Rachana Parmar and Christoph F. Eick 7 Example --- Neighborhood-size=3 Dataset: {1, 2, 3, 4, 5, 6, 7, 8, 9, 0} Current Solution: {1, 3, 5} Non-representatives: {2, 4, 6, 7, 8, 9, 0} {1, 3, 5} Insert 7 {1, 3, 5, 7} Replace 3 with 4 {1, 4, 5, 7} Delete 7 Neighbor-> {1, 4, 5} Remark: operators parameters are chosen by a random generator giving each Parameter the same chance to be selected.

Rachana Parmar and Christoph F. Eick 8 Advantages CLEVER Over Other Representative-based Algorithms Searches for the optimal number of clusters k Quite generic: can be used with any reward-based fitness function  can be applied to a large set of tasks Less likely to terminate prematurely—reason: uses neighborhood sizes of larger than 1 and changes the number of clusters k during the search, which make it less likely that it terminates prematurely. Usually, significantly better cluster quality than SPAM At an average, samples less neighbors than PAM and SPAM Uses dynamic sampling; only uses a large number of samples when it gets stuck.

Rachana Parmar and Christoph F. Eick 9 Disadvantages and Problems with CLEVER Relatively slow; particularly, if dataset sizes are above 5000 or q(X) is expensive to compute. Bad choices for k-hat might lead to poor results Sensitive to initialization Cluster shapes are limited to convex polygons Cannot detect nested clusters CLEVER has no concept of outliers: therefore, the algorithm has to waste resources and representatives to separate non-reward regions from reward regions. Remark: Disadvantages in blue are general disadvantage of representative-based clustering algorithms.

CLEVER(, neigbhborhoodDef, p, q) 1. Randomly select initial set of -representatives. Set it to currReps. 2. regions = findRegions (currReps). {Find regions for current set of representatives} 3. currentFitness = findFitness (regions) {Find total fitness for the current regions} 4. Set nonReps to the data objects that are non-representatives. 5. bestFitness = currentFitness 6. while TRUE do 7. neighbors = findNeighbors (currReps, nonReps, p, neighborhoodDef) {Pick p neighbors from the neighborhood of current representatives as defined by the parameter 'neighborhoodDef'}. 8. for each neighbor  neigbhors do 9. newRegions = findRegions (neighbor). {Find regions using the neighboring solution} 10. newFitness = findFitness (newRegions) {Find total fitness for newly found regions} 11. if newFitness > bestFitness 12. bestNeighbor = neighbor {replacement} 13. bestFitness = newFitness 14. bestRegions = newRegions 15. end 16.end for 17. if fitness improved 18. currReps = bestNeighbor 19. currentFitness = bestFitness 20. regions = bestRegions 21. if resampling done 22. revert to original p value. 23. end 24. Go back to step else 26. if resampling done twice and fitness does not improve after resampling 27. Terminate and return obtained regions, region representatives, no. of representatives (k), and fitness. 28. else if first resampling 29. p = p * 2 {resample with higher p value} 30. Go back to step else {Resample again} 32. p = p * (q – 2) (resample with much higher p value) 33. Go back to step end 35. end 36.end CLEVER - CLustEring using representatiVEs and Randomized hill climbing

Rachana Parmar and Christoph F. Eick 11 CLEVER Parameters Initial no. of clusters p Sample size. No. of samples to be selected randomly from the solution neighborhood q Resampling factor. If algorithm fails to improve fitness with p and p*2 samples, sample size is increased by factor of (q-2).

findNeighbors (reps, nonReps, p, neigbhborhoodDef, neighborhoodSize) 1. for 1 to p do 2. currentReps = reps 3. currentNonReps = nonReps 4. for 1 to neighborhoodSize do 5. op = selectOperator (operators, opProbabilities) {Randomly select an operator as per user specified operator selection probabilities} 6. if op = 'insert' 7. newRep = selectNewRep (currentNonReps) {Randomly select a non-representative} 8.delete(currentNonReps, newRep) 9. neighbor = add (currentReps, newRep) {Add newRep to the currentReps to get a new neighbor} 10. else if op = 'delete' OR op = 'replace' 11. rep = selectRep(currentReps) {Randomly select a Representative} 12. if op = 'delete' 13. neighbor = delete (currentReps, rep) {Delete select rep from current set of representative to get a new neighbor} 14. add(currentNonReps, rep) 15. else if op = 'replace' 16. newRep = selectNewRep (nonReps) {Randomly select a non-representative} 17. neighbor = replace (currentReps, rep, newRep) {Replace selected representative with new representative to get a new neighbor} replace(currentNonReps,newRep, rep) 18. end 19. end 20. currentReps = neighbor 21. end for 22. Add neighbor to neighbors. 23. end for 23. Return neighbors.

Rachana Parmar and Christoph F. Eick 13 findNeighbors() Parameters operatorsSet of operators that can be applied on a solution. Possible values are ‘Insert’, ‘Replace’ and ‘Delete’. opProbabilitiesUser-specified probabilities for each operator selection. Sum of probabilities of all operators is 1.