(Rare) Category Detection Using Hierarchical Mean Shift Pavan Vatturi Weng-Keen Wong

(Rare) Category Detection Using Hierarchical Mean Shift Pavan Vatturi (vatturi@eecs.oregonstate.edu) Weng-Keen Wong (wong@eecs.oregonstate.edu)

1. Introduction Applications for surveillance, monitoring, scientific discovery and data cleaning require anomaly detection Anomalies often identified as statistically unusual data points Many anomalies are simply irrelevant or correspond to known sources of noise

1. Introduction Known objects (99.9% of the data) Anomalies (0.1% of the data) Uninteresting (99% of anomalies) Interesting (1% of anomalies) Pictures from: Sloan Digital Sky Survey (http://www.sdss.org/iotw/archive.html) Pelleg, D. (2004). Scalable and Practical Probability Density Estimators for Scientific Anomaly Detection. PhD Thesis, Carnegie Mellon University.

1. Introduction Category Detection [Pelleg and Moore 2004]: human-in-the-loop exploratory data analysis Data Set Build Model Spot Interesting Data Points Ask User to Label Categories of Interesting Data Points Update Model with Labels

1. Introduction Data Set Build Model Spot Interesting Data Points Ask User to Label Categories of Interesting Data Points Update Model with Labels User can: Label a query data point under an existing category Or declare data point to belong to a previous undeclared category

1. Introduction Goal: present to user a single instance from each category in as few queries as possible Difficult to detect rare categories if class imbalance is severe Interested in rare categories for anomaly detection

Outline 1.Introduction 2.Related Work 3.Background 4.Methodology 5.Results 6.Conclusion / Future Work

2. Related Work Interleave [Pelleg and Moore 2004] Nearest-Neighbor-based active learning for rare-category detection for multiple classes [He and Carbonell 2008] Multiple output identification [Fine and Mansour 2006]

3. Background: Mean Shift [Fukunaga and Hostetler 1975] Reference data set Query point Center of Mass Mean shift vector (follows density gradient) Mean shift vector with kernel k

3. Background: Mean Shift [Fukunaga and Hostetler 1975] Reference data set Query point Center of Mass Convergence to cluster center

3. Background: Mean Shift Blurring Reference data set Query point Center of Mass Blurring: When query points are the same as the reference data set Progressively blurs the original data set

3. Background: Mean Shift End result of applying mean shift to a synthetic data set

4. Methodology: Overview 1.Sphere the data 2.Hierarchical Mean Shift 3.Query user

4. Methodology: Hierarchical Mean Shift Repeatedly blur data using Mean Shift with increasing bandwidth: h new = k * h old

4. Methodology: Hierarchical Mean Shift Mean Shift complexity is O(n 2 dm) where n = # of data points d = dimensionality of data points m = # of mean shift iterations Single kd-tree optimization used to speed up Hierarchical Mean Shift

4. Methodology: Querying the User Rank cluster centers for querying to user. 1.Outlierness [Leung et al. 2000] for Cluster C i : Lifetime of C i = Log (bandwidth when cluster C i is merged with other clusters – bandwidth when cluster C i is formed)

4. Methodology: Querying the User Rank cluster centers for querying to user. 2.Compactness + Isolation [Leung et al. 2000] for Cluster C i :

4. Methodology: Tiebreaker Ties may occur in Outlierness or Compactness/Isolation values. Highest average distance heuristic: choose cluster center with highest average distance from user-labeled points.

5. Results NameDimsRecordsClassesSmallest Class Largest Class Abalone74177200.34%16% Shuttle8400070.02%64.2% OptDigits641040100.77%50% OptLetters162128260.37%24% Statlog1951271.5%50% Yeast81484100.33%31.68% Shuttle, OptDigits, OptLetters, and Statlog were subsampled to simulate class imbalance. Data sets used in experiments

5. Results (Yeast) Category detection metric: # queries before user presented with at least one example from all categories

5. Results (Statlog)

5. Results (OptLetters)

5. Results (OptDigits)

5. Results (Shuttle)

5. Results (Abalone)

5. Results DatasetHMS-CIHMS- CI+HAD HMS-OutHMS- Out+HAD NNDMInterleave Abalone119593603385124193 Shuttle4432362816235 OptDigits100 160118576117 OptLetters133 161182420489 Statlog18203412422854 Yeast73911037788111 Number of hints to discover all classes

5. Results DatasetHMS-CIHMS- CI+HAD HMS-OutNNDMInterleave Abalone0.8350.8730.8370.8460.840 Shuttle0.9250.9290.9170.4800.905 OptDigits0.855 0.8400.1990.808 OptLetters0.936 0.9170.5730.765 Statlog0.9560.9580.9440.4720.934 Yeast0.8210.8050.7930.8380.778 Area under the category detection curve

6. Conclusion / Future Work Conclusions –HMS-based methods consistently discover more categories in fewer queries than existing methods –Do not need apriori knowledge of dataset properties eg. total number of classes

6. Conclusion / Future Work Future Work Better use of user feedback Presentation of an entire cluster to the user instead of a representative data point Improved computational efficiency Theoretical analysis

(Rare) Category Detection Using Hierarchical Mean Shift Pavan Vatturi Weng-Keen Wong

Similar presentations

Presentation on theme: "(Rare) Category Detection Using Hierarchical Mean Shift Pavan Vatturi Weng-Keen Wong"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

(Rare) Category Detection Using Hierarchical Mean Shift Pavan Vatturi Weng-Keen Wong

Similar presentations

Presentation on theme: "(Rare) Category Detection Using Hierarchical Mean Shift Pavan Vatturi Weng-Keen Wong"— Presentation transcript:

Similar presentations

About project

Feedback