Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Using Clustering to Learn Distance Functions for Supervised Similarity Assessment.

Slides:



Advertisements
Similar presentations
Heuristic Search techniques
Advertisements

Random Forest Predrag Radenković 3237/10
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
CHAPTER 9: Decision Trees
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Data Mining Classification: Alternative Techniques
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CLEVER: CLustEring using representatiVEs and Randomized hill climbing Rachana Parmar and Christoph F. Eick:
Experiments on Query Expansion for Internet Yellow Page Services Using Log Mining Summarized by Dongmin Shin Presented by Dongmin Shin User Log Analysis.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
A Probabilistic Framework for Semi-Supervised Clustering
Classification and Decision Boundaries
Machine Learning Neural Networks
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Spatial Semi- supervised Image Classification Stuart Ness G07 - Csci 8701 Final Project 1.
Clustering.
Three kinds of learning
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,
Radial-Basis Function Networks
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
Efficient Model Selection for Support Vector Machines
Vilalta&Eick: Informed Search Informed Search and Exploration Search Strategies Heuristic Functions Local Search Algorithms Vilalta&Eick: Informed Search.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Chapter 9 Neural Network.
Introduction to Artificial Neural Network Models Angshuman Saha Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg.
Ch. Eick: Region Discovery Project Part3 Region Discovery Project Part3: Overview The goal of Project3 is to design a region discovery algorithm and evaluate.
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
Benk Erika Kelemen Zsolt
Ch. Eick: Supervised Clustering --- Algorithms and Applications Supervised Clustering --- Algorithms and Applications Christoph F. Eick Department of Computer.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
1. Data Mining (or KDD) Let us find something interesting! Definition := “Data Mining is the non-trivial process of identifying valid, novel, potentially.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 Eick, Zeidat, Vilalta: Using Representative-based Clustering for NN Dataset Editing (ICDM04) Using Representative-Based Clustering For Nearest Neighbour.
Patch Based Prediction Techniques University of Houston By: Paul AMALAMAN From: UH-DMML Lab Director: Dr. Eick.
Zeidat&Eick, MLMTA, Las Vegas K-medoid-style Clustering Algorithms for Supervised Summary Generation Nidal Zeidat & Christoph F. Eick Dept. of Computer.
Supervised Clustering --- Algorithms and Applications
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Flat clustering approaches
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg
Using Clustering to Enhance Classifiers Christoph F. Eick Organization of the Talk 1.Brief Introduction to KDD 2.Using Clustering a. for Nearest Neighbour.
Classification and Regression Trees
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
1. Randomized Hill Climbing Neighborhood Randomized Hill Climbing: Sample p points randomly in the neighborhood of the currently best solution; determine.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Supervised Learning – Network is presented with the input and the desired output. – Uses a set of inputs for which the desired outputs results / classes.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Eick: Informed Search Informed Search and Exploration Search Strategies Heuristic Functions Local Search Algorithms Vilalta&Eick: Informed Search.
Ensemble Classifiers.
Semi-Supervised Clustering
Applying Deep Neural Network to Enhance EMPI Searching
Sofus A. Macskassy Fetch Technologies
Constrained Clustering -Semi Supervised Clustering-
Using Supervised Clustering to Enhance Classifiers
HC-edit: A Hierarchical Clustering Approach To Data Editing
Computing the Entropy (H-Function)
Presentation transcript:

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Using Clustering to Learn Distance Functions for Supervised Similarity Assessment Christoph F. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta Department of Computer Science University of Houston Organization of the Talk 1.Similarity Assessment 2.A Framework for Distance Function Learning 3.Inside Outside Weight Updating 4.Distance Function Learning Research at UH-DMML 5.Experimental Evaluation 6.Other Distance Function Learning Research 7.Summary

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM Similarity Assessment Definition: Similarity assessment is the task of determining which objects are similar to each other and which are dissimilar to each other. Goal of Similarity Assessment: Construct a distance function! Applications of Similarity Assessment: Case-based reasoning Classification techniques that rely on distance functions Clustering … Complications: Usually, there is no universal “good” distance function for a set of objects; the usefulness of a distance depends on the task it used for (“no free lunch in similarity assessment either”). Defining the distance between objects is more an art than a science.

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 The following relation is given (with tuples): Patient(ssn, weight, height, cancer-sev, eye-color, age,…) Attribute Domains –ssn: 9 digits –weight between 30 and 650; m weight =158 s weight =24.20 –height between 0.30 and 2.20 in meters; m height =1.52 s height =19.2 –cancer-sev: 4=serious 3=quite_serious 2=medium 1=minor –eye-color: {brown, blue, green, grey } –age: between 3 and 100; m age =45 s age =13.2 Task: Define Patient Similarity Motivating Example: How To Find Similar Patients?

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Data Extraction Tool DBMS Clustering Tool User Interface A set of clusters Similarity measure Similarity Measure Tool Default choices and domain information Library of similarity measures Type and weight information Object View Library of clustering algorithms CAL-FULL/UH Database Clustering & Similarity Assessment Environments Learning Tool Training Data Today’s topic For more details: see [RE05]

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM A Framework for Distance Function Learning Assumption: The distance between two objects is computed as the weighted sum of the distances with respect to their attributes. Objective: Learn a “good” distance function  for classification tasks. Our approach: Apply a clustering algorithm with the object distance function  to be evaluated that returns k clusters. Our goal is to learn the weights of an object distance function  such that pure clusters are obtained (or as pure is possible) --- a pure cluster contains example belonging to a single class.

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Idea: Coevolving Clusters and Distance Functions Clustering X Distance Function  Cluster Goodness of the Distance Function  q(X) Clustering Evaluation Weight Updating Scheme / Search Strategy x x x x o o o o xx o o xx o o o o “Bad” distance function   “Good” distance function   x x o o

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM Inside/Outside Weight Updating Cluster1: distances with respect to Att1 Action: Increase weight of Att1 Action: Decrease weight for Att2 Cluster1: distances with respect to Att2 Idea: Move examples of the majority class closer to each other xo oo ox o o xx o o o:=examples belonging to majority class x:= non-majority-class examples

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Inside/Outside Weight Updating Algorithm 1.Cluster the dataset using a given weight vector w=(w 1,…,w p ) using k-means 2.FOR EACH cluster-attribute pair DO 1.Modify w using inside/outside weight updating 3.IF NOT DONE, CONTINUE with Step1; OTHERWISE, RETURN w.

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Inside/Outside Weight Updating Heuristic o o xx o oxo oo ox Example 1: Example 2: (W)(W) The weight of the i-th attribute w i is updated as follows for a given cluster:

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Idea: Weight Inside/Outside Weight Updating Cluster k Attribute1Attribute2Attribute3 Initial Weights: w 1 =w 2 =w 3 =1; Updated Weights: w 1 =1.14,w 2 =1.32, w 3 =0.84

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Illustration: Net Effect of Weight Adjustments New Object DistancesOld Object Distances Cluster k

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 A Slight Enhanced Weight Update Formula

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Sample Run of IOWU for the Diabetes Dataset

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM Distance Function Learning Research at UH-DMML Randomized Hill Climbing Adaptive Clustering Inside/Outside Weight Updating K-Means Supervised Clustering NN-Classifier Weight-Updating Scheme / Search Strategy Distance Function Evaluation … … Work By Karypis [BECV05] Other Research [ERBV04] Current Research [EZZ04]

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM Experimental Evaluation Used a benchmark consisting of 7/15 UCI datasets Inside/outside weight updating was run for 200 iterations  was set to 0.3 Evaluation (10-fold cross validation repeated 10 times was used to determine accuracy) –Used 1-NN classifier as the base line classifer –Usee the learned distance function for a 1-NN –Used the learned distance function for a NCC classifier (new!)

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 NCC-Classifier A C E a. Dataset clustered by K-meansb. Dataset edited using cluster centroids that carry the class label of the cluster majority class Attribute1 D B Attribute2 F Attribute1 Idea: the training set is replaced by k (centroid, majority class) pairs that are computed using k-means; the so generated dataset is then used to classify the examples in the test set.

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Experimental Evaluation Datasetnk1-NNLW1NNNCCC4.5 DIABETES VEHICLE HEART-STATLOG GLASS HEART-C HEART-H IONOSPHERE Remark: Statistically significant improvements are in red.

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 DF-Learning With Randomized Hill Climbing Random: random number : rate of change for example:[-0.3,0.3] Generate R solutions in the neighborhood of w and pick the best one to be the new weight vector w

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Accuracy IOWA and Randomized Hill Climbing DatasetRHC(1c)RHC(2c)RHC(5c)IOWA(1c)IOWA(2c)IOWA(5c) autos breast-cancer wisconsin-breast-cancer credit-rating pima_diabetes german_credit Glass cleveland-14-heart-diseas hungarian-14-heart-diseas heart-statlog ionosphere sonar vehicle vote zoo

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Uses reinforcement learning to adapt distance functions for k-means clustering. Employs search strategies that explores multiple paths in parallel. The algorithm maintains an open-list with maximum size |L| --- bad performers a dropped from the open list. Currently, beam search is used which creates 2p successors (increasing and decreasing the weight of each attribute exactly once) and evaluates those 2p*|L| successors and keeps the best |L| of those. Discretizes the search space in which states are (, ) tuples into a grid, and memorizes and updates the fitness values of the grid; value iteration is limited to “interesting states” by employing prioritized sweeping. Weights are updated by increasing / decreasing the weight of an attribute by a randomly chosen percentage that fall within an interval [min-change, max-change]; our current implementation uses: [25%,50%]. Employs entropy H(X) as the fitness function (low entropy  pure cluster) Distance Function Learning With Adaptive Clustering

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM Related Distance Function Learning Research Interactive approaches that use user feedback and reinforcement learning to derive a good distance function. Other work uses randomized hill climbing and neural networks to learn distance functions for classification tasks; mostly, NN-queries are used to evaluate the quality of a clustering. Other work, mostly in the area of semi-supervised clustering, adapts object distances to cope with constraints.

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM Summary Described an approach that employs clustering for distance function evaluation. Introduced an attribute weight updating heuristic called inside/outside weight- updating and evaluated its performance. The inside/weight updating approach enhanced a 1-NN classifier significantly for some UCI datasets, but not for all data sets that were tested. The quality of the employed approach is dependent on the number of cluster k which is an input parameter; our current research centers on determining k automatically with a supervised clustering algorithm [EZZ04] The general idea to replace a dataset by cluster representatives to enhance NN-classifiers shows a lot of promise in this (as exemplified in the NCC classifier) and other research we are currently conducting. Distance function learning is quite time consuming; one run of 200 iterations of inside/outside weight updating takes between 5 seconds and 5 minutes depending on dataset size and k-value; other techniques we currently investigate are significantly slower; therefore, we are currently moving to high performance computing facilities for the empirical evaluation of the distance function learning approaches.

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Links to 4 Papers 1.[EZZ04] C. Eick, N. Zeidat, Z. Zhao, Supervised Clustering --- Algorithms and Benefits, short version appeared in Proc. International Conference on Tools with AI (ICTAI), Boca Raton, Florida, November [RE05] T. Ryu and C. Eick, A Clustering Methodology and Tool, in Information Sciences 171(1-3): (2005). 3.[ERBV04] C. Eick, A. Rouhana, A. Bagherjeiran, R. Vilalta, Using Clustering to Learn Distance Functions for Supervised Similarity Assessment, in Proc. MLDM'05, Leipzig, Germany, July [BECV05] A. Bagherjeiran, C. Eick, C.-S. Chen, R. Vilalta, Adaptive Clustering: Obtaining Better Clusters Using Feedback and Past Experience, submitted for publication.

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Question?

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Randomized Hill Climbing Fast start: algorithm starts from small neighborhood size until it can not find any better solutions. Then it increases its neighborhood size by 3 times hopping that a better solution can be found by trying more points Shoulder condition: When the algorithm has moved to a shoulder or flat hill, it will keep getting solutions with the same fitness value. Our algorithm terminates when it has tried for 3 times and still getting the same results. This prevents it from been trapped in a shoulder forever

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Randomized Hill Climbing Shoulder Flat hill State space Objective function

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Purity in clusters obtained (internal) Test 2.2 (Beta=0.4) Inside outside weight updating (Repeat 200 times) SCEC paramet ersPS=200, n=30 Learning Rate  (%) DiabetesVehicleHeartStatlogGlassHeart-CHeart-HIONOSPHERE

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Purity in clusters obtained (internal) Test 2.2 (Beta=0.4) Randomize Hill Climbing (p=30) SCEC parametersPS=200, n=30 Learning Rate r(%) Diabete sVehicle HeartStatlo gGlassHeart-CHeart-HIONOSPHERE

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 Ch. Eick Objectives Supervised Clustering: Minimize cluster impurity while keeping the number of clusters low (expressed by a fitness function q(X)). Different Forms of Clustering

Ch. Eick et al.: Using Clustering to Learn Distance Functions MLDM 2005 A Fitness Function for Supervised Clustering q(X) := Impurity(X) + β*Penalty(k) k: number of clusters used n: number of examples the dataset c: number of classes in a dataset. β: Weight for Penalty(k), 0< β ≤2.0 Penalty(k) increase sub-linearly. because the effect of increasing the # of clusters from k to k+1 has greater effect on the end result when k is small than when it is large. Hence the formula above