1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman.

1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman and Richard D. Sisson Jr. ACM SIGKDD: MDM-05 Chicago, Illinois, USA Aug 21, 2005

2Introduction Experimental results in scientific domains often plotted graphically Experimental results in scientific domains often plotted graphically Graph: 2-d plot of one experimental parameter versus another Graph: 2-d plot of one experimental parameter versus another Heat Transfer Curve Domain: Heat Treating of Materials Domain Semantics LF: Leidenfrost point BP: Boiling point of cooling medium MAX: Maximum heat transfer

3 Motivating Example Clustering used to group graphs to compare them Clustering used to group graphs to compare them Inferences drawn from comparison help in decision support Inferences drawn from comparison help in decision support Clustering groups objects based on similarity Clustering groups objects based on similarity Notion of similarity: distance Notion of similarity: distance Not known what distance metric best preserves semantics in clustering Not known what distance metric best preserves semantics in clustering Need to learn such a metric Need to learn such a metric Problem clustering with Euclidean Distance: Graphs in same cluster, LF points differ

4 Proposed Approach: LearnMet Given training set with actual clusters of graphs Given training set with actual clusters of graphs Correct as per domain Correct as per domain Compare these with predicted clusters Compare these with predicted clusters Obtained from an algorithm Obtained from an algorithm Process Process Guess initial metric for clustering Guess initial metric for clustering Refine metric using error between predicted, actual clusters Refine metric using error between predicted, actual clusters Output metric with error below threshold as learned metric Output metric with error below threshold as learned metric

5 Categories of Distance Metrics In the literature In the literature Position-based: e.g., Euclidean [HK-01] Position-based: e.g., Euclidean [HK-01] Absolute position of objects Absolute position of objects Statistical: e.g., Maximum distance [PNC-99] Statistical: e.g., Maximum distance [PNC-99] Statistical observations Statistical observations Introduced in our work [VRRMS-03/05] Introduced in our work [VRRMS-03/05] Critical Distance: Distance between critical regions on graphs Critical Distance: Distance between critical regions on graphs Calculated in a domain- specific manner Calculated in a domain- specific manner D Max (A,B) Graph AGraph B D LF (A,B) D BP (A,B) Graph AGraph B

6 LearnMet Strategy 1. Initial Metric Step: Guess initial metric 2. Clustering Step: Do clustering with metric 3. Evaluation Step: Evaluate cluster accuracy 4. Adjustment Step: Adjust & re-execute / halt 5. Final Metric Step: Output final metric

7 1. Initial Metric Step Input from domain experts Input from domain experts Distance types applicable to graphs Distance types applicable to graphs Relative Importance of each type (Optional input) Relative Importance of each type (Optional input) Initial Metric Initial Metric Each distance type forms a component Each distance type forms a component Initial Weight Heuristic Initial Weight Heuristic Assign weights to components based on relative importance Assign weights to components based on relative importance Or assign random weights Or assign random weights Metric Definition Metric Definition D = w 1 D 1 + w 2 D 2 + …….. w m D m D = w 1 D 1 + w 2 D 2 + …….. w m D m Example: D = 5D Euclidean + 3D Mean + 4D Critical Example: D = 5D Euclidean + 3D Mean + 4D Critical

8 2. Clustering Step Clustering Algorithm, e.g. k-means k = number of actual clusters Notion of Distance = D

9 3. Cluster Evaluation Step Pair of graphs Pair of graphs same actual, same predicted cluster, e.g., (g 1, g 2 ) True Positive (TP): same actual, same predicted cluster, e.g., (g 1, g 2 ) different actual, different predicted clusters, e.g., (g 2,g 3 ) True Negative (TN): different actual, different predicted clusters, e.g., (g 2,g 3 ) different actual, same predicted cluster, e.g., (g 3,g 4 ) False Positive (FP): different actual, same predicted cluster, e.g., (g 3,g 4 ) same actual, different predicted clusters, e.g., (g 4,g 5 ) False Negative (FN): same actual, different predicted clusters, e.g., (g 4,g 5 ) Error Measure: Failure Rate “FR” Error Measure: Failure Rate “FR” FR = (FP+FN)/(TP+FP+TN+FN) FR = (FP+FN)/(TP+FP+TN+FN)

10 3. Cluster Evaluation (Contd.) Total number of graphs = G Total number of graphs = G No. of pairs = G C 2 =G!/2!(G-2)! No. of pairs = G C 2 =G!/2!(G-2)! E.g., 25 graphs, 300 pairs E.g., 25 graphs, 300 pairs Pairs per epoch: ppe Pairs per epoch: ppe Distinct combination of pairs Distinct combination of pairs If ppe = 25, total distinct pairs = 300 C 25 = 1.95 x 10 36 If ppe = 25, total distinct pairs = 300 C 25 = 1.95 x 10 36 Avoids overfitting, reduces time complexity Avoids overfitting, reduces time complexity Error Threshold (t) Error Threshold (t) Extent of error (FR) allowed Extent of error (FR) allowed If (FR < t) clustering is accurate If (FR < t) clustering is accurate Consider ppe = 15, t = 0.1 TP = 2, TN = 10 FP = 1 (g3, g4) FN = 2 (g4, g5), (g4,g6) FR = (1+2) / 15 = 0.2 FR > t, clustering not accurate

11 3. Cluster Evaluation (Contd.) Distance between pair of graphs Distance between pair of graphs D(g a,g b ) = w 1 D 1 (g a,g b ) + … w m D m (g a,g b ) D(g a,g b ) = w 1 D 1 (g a,g b ) + … w m D m (g a,g b ) D_FN=(1/FN)∑ i= 1 to FN D(g a,g b ) D_FN=(1/FN)∑ i= 1 to FN D(g a,g b ) Cause of error: D_FN too high Cause of error: D_FN too high D_FP=(1/FP)∑ i= 1 to FP D(g a,g b ) D_FP=(1/FP)∑ i= 1 to FP D(g a,g b ) Cause of error: D_FP too low Cause of error: D_FP too low D_FN = [D(g 4,g 5 )+D(g 4,g 6 )] / 2 D_FP = D(g 3,g 4 ) / 1

12 4. Weight Adjustment Step FN pairs FN pairs To reduce error decrease D_FN To reduce error decrease D_FN FN Heuristic FN Heuristic Decrease weights in proportion to distance contribution of each component Decrease weights in proportion to distance contribution of each component For each component For each component w i `= w i – D_FN i /D_FN w i `= w i – D_FN i /D_FN where D_FN i = where D_FN i = (1/FN)∑ i= 1 to FN D i (g a,g b ) (1/FN)∑ i= 1 to FN D i (g a,g b ) D = 5D Euclidean + 3D Mean + 4D Critical D_FN = 100, D_FN Euclidean = 80, D_FN Mean = 1, D_FN Critical = 19 w Euclidean ` = 5 – 80/100 = 5 – 0.8 = 4.2 w Mean ` = 3 – 1/100 = 3 – 0.01 = 2.99 w Critical ` = 4 – 19/100 = 4 – 0.19 = 3.81

13 4. Weight Adjustment (Contd.) FP pairs FP pairs To reduce error increase D_FP To reduce error increase D_FP FP Heuristic FP Heuristic Increase weights in proportion to distance contribution of each component Increase weights in proportion to distance contribution of each component For each component For each component w i ``= w i + D_FN i /D_FN w i ``= w i + D_FN i /D_FN where D_FN i = where D_FN i = (1/FN)∑ i= 1 to FN D i (g a,g b ) (1/FN)∑ i= 1 to FN D i (g a,g b ) D = 5D Euclidean + 3D Mean + 4D Critical D_FP = 200, D_FP Euclidean = 15, D_FP Mean = 85, D_FP Critical = 100 w Euclidean `` = 5 + 15/200 = 5 + 0.075 = 5.075 w Mean `` = 3 + 85/200 = 3 + 0.425 = 3.425 w Critical `` = 4 + 100/200 = 4 + 0.5 = 4.5

14 4. Weight Adjustment (Contd.) Combining the two Combining the two Weight Adjustment Heuristic Weight Adjustment Heuristic w i ```= min(0, w i – FN*(D_FN i /D_FN) + FP*(D_FP i /D_FP)) w i ```= min(0, w i – FN*(D_FN i /D_FN) + FP*(D_FP i /D_FP)) D``` = w 1 ```Dc 1 + w 2 ```Dc 2 + …….. w m ```Dc m D``` = w 1 ```Dc 1 + w 2 ```Dc 2 + …….. w m ```Dc m Clustering is done with new metric D``` Clustering is done with new metric D``` If clustering accurate, confirmatory test with metric for 2 more epochs If clustering accurate, confirmatory test with metric for 2 more epochs

15 5. Final Metric Step If error below threshold, output corresponding D as learned metric If error below threshold, output corresponding D as learned metric If maximum epochs reached, output D with lowest error in all epochs If maximum epochs reached, output D with lowest error in all epochs Example: Example: D = 4.671D_Euclidean + 5.564D_Mean + 3.074D_Critical D = 4.671D_Euclidean + 5.564D_Mean + 3.074D_Critical

16 Experimental Evaluation of LearnMet Evaluated rigorously in Heat Treating domain Evaluated rigorously in Heat Treating domain Summary of evaluation Summary of evaluation Number of graphs in training set = G = 25 Number of graphs in training set = G = 25 Number of pairs in training set = G C 2 = 300 Number of pairs in training set = G C 2 = 300 Number of pairs per epoch = ppe = 25 Number of pairs per epoch = ppe = 25 Number of distinct combinations of pairs = 300 C 25 = 1.95 x 10 36 Number of distinct combinations of pairs = 300 C 25 = 1.95 x 10 36 Error threshold = t = 0.1 = 10% Error threshold = t = 0.1 = 10% Distinct test set with 300 pairs of graphs from 25 different graphs Distinct test set with 300 pairs of graphs from 25 different graphs Actual clusters over test set given by experts Actual clusters over test set given by experts Initial Metrics Initial Metrics DE1, DE2 given by domain experts DE1, DE2 given by domain experts EQU with equal weights to all components EQU with equal weights to all components Several metrics with random weights, e.g., RND1, RND2 Several metrics with random weights, e.g., RND1, RND2

17 Experimental Evaluation (Contd.) Initial Metrics in LearnMet Experiments Learned Metrics and Number of Epochs to Learn

18 Observations during Training Expt DE1 Expt DE2 Expt EQU Expt RND1 Expt RND2

19 Observations during Testing Graphs in test set clustered Graphs in test set clustered Learned Metrics Learned Metrics Euclidean Distance (ED) Euclidean Distance (ED) Predicted clusters compared with actual clusters Predicted clusters compared with actual clusters Accuracy: Success Rate “SR” Accuracy: Success Rate “SR” SR = (TP+TN)/(TP+TN+FP+FN) SR = (TP+TN)/(TP+TN+FP+FN) Test Set Observations: Accuracy with LearnMet Metrics is higher

20 Related Work Learning nearest neighbor in high-dimensional spaces: [HAK-00] Learning nearest neighbor in high-dimensional spaces: [HAK-00] Focus is dimensionality reduction, do not deal with graphs Focus is dimensionality reduction, do not deal with graphs Distance metric learning given basic formula: [XNJR-03] Distance metric learning given basic formula: [XNJR-03] Do not deal with graphs and relative importance of features Do not deal with graphs and relative importance of features Similarity search in multimedia databases: [KB-04] Similarity search in multimedia databases: [KB-04] Use various metrics in different applications, do not learn a single metric Use various metrics in different applications, do not learn a single metric Fourier Transforms: [F-55] Fourier Transforms: [F-55] Do not preserve critical regions in the domain due to nature of transform Do not preserve critical regions in the domain due to nature of transform Genetic Algorithms: [F-58] Genetic Algorithms: [F-58] If used for feature selection give less accuracy: lack domain knowledge If used for feature selection give less accuracy: lack domain knowledge Linear Regression: [A-73] Linear Regression: [A-73] Distance values between pairs of graphs not known as training set Distance values between pairs of graphs not known as training set Neural Networks: [B-96] Neural Networks: [B-96] Poor interpretability, hard to incorporate domain knowledge Poor interpretability, hard to incorporate domain knowledge

21 Conclusions LearnMet proposed to learn semantics-preserving distance metrics for graphs LearnMet proposed to learn semantics-preserving distance metrics for graphs Minimize error between predicted and actual clusters of graphs Minimize error between predicted and actual clusters of graphs Ongoing Work: Maximizing accuracy of LearnMet Ongoing Work: Maximizing accuracy of LearnMet Finding good value for number of pairs per epoch Finding good value for number of pairs per epoch Selecting components without domain expert input Selecting components without domain expert input Defining scaling factors for weight adjustments Defining scaling factors for weight adjustments

1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman.

Similar presentations

Presentation on theme: "1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman.

Similar presentations

Presentation on theme: "1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman."— Presentation transcript:

Similar presentations

About project

Feedback