Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.

Similar presentations


Presentation on theme: "Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo."— Presentation transcript:

1 Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo

2 Outline  Introduction  P-Trees Concepts Implementation  Kernel Methods  Paper 1: Rule-Based Classification  Paper 2: Kernel-Based Semi-Naïve Bayes Classifier  Paper 3: Hierarchical Clustering  Outlook

3 Introduction  Data Mining Information from data Considers storage issues  P-Tree Approach Bit-column-based storage  Compression  Hardware optimization  Simple index construction  Flexibility in storage organization

4 P-Tree Concepts  Ordering (details)details New: Generalized Peano order sorting  Compression

5 Impact of Peano Order Sorting

6 P-Tree Implementation  Implementation in Java Was ported to C / C++ (Amal Perera, Masum Serazi) Fastest compressing P-tree implementation so far  Array indices as pointers (details)details  Grandchild purity

7 Kernel-Density-Based Classification  Probability of an attribute vector x  Conditional on class label value c i  [] is 1 if  is true, 0 otherwise  Depending on N training points x t  Kernel function K(x,x t ) can be, e.g., Gaussian function or step function

8 Higher Order Basic Bit Distance HOBbit  P-trees make count evaluation efficient for the following intervals

9 Paper 1: Rule-Based Classification  Goal: High accuracy on large data sets including standard ones (UCI ML Repository)UCI ML Repository  Neighborhood evaluated through Equality of categorical attributes HOBbit interval for continuous attributes  Curse of dimensionality Volume empty with high likelihood  Information gain to select attributes Attributes considered binary, based on test sample (“Lazy decision trees”, Friedman ’96 [4])4 Continuous data: Interval around test sample Exact information gain (details)details Pursuing multiple paths

10 Results: Accuracy  Comparable to C4.5 after much less development time  5 data sets from UCI Machine Learning Repository (details)details  2 additional data sets Crop Gene-function  Improvement through multiple paths (20) C4.520 paths adult15.5414.93 kr-vs-kp0.80.84 mushroom00

11 Results: Speed  Used on largest UCI data sets  Scaling of execution time as a function of training set size

12 Paper 2: Kernel-Based Semi-Naïve Bayes Classifier  Goal: Handling many attributes  Naïve Bayes x (k) is value of k th attribute  Semi-naïve Bayes Correlated attributes are joined Has been done for categorical data  Kononenko ’91 [5], Pazzani ’96 [6]56 Previously: Continuous data discretized

13 Kernel-Based Naïve Bayes  Alternatives for continuous data Discretization Distribution function  Gaussian with mean and standard deviation from data  No alternative for semi-naïve approach Kernel density estimate (Hastie [7])7

14 Correlations  Correlation between attributes a and b N: Number of training points t  Kernel function for continuous data d EH : Exponential HOBbit distance

15 Results P-tree Naïve Bayes  Difference only for continuous data Semi-Naïve Bayes  3 parameter combinations Blue: t = 1 3 iterations Red: t = 0.3 incl. anti-corr. White: t = 0.05 (t: threshold)

16 Paper 3: Hierarchical Clustering [10]10  Goal: Understand relationship between standard algorithms Combine the “best” aspects of three major ones  Partition-based Relationship to k-medoids [8] demonstrated8 Same cluster boundary definition  Density-based (kernel-based, DENCLUE [9])9 Similar cluster center definition  Hierarchical Follows naturally from above definitions

17 Results: Speed Comparison with K-Means

18 Results: Clustering Effectiveness K-means Our Algorithm

19 Summary  P-tree representation for non-spatial data  Fast implementation  Paper1: Rule-Based Algorithm Test-sample-centered intervals, multiple paths Competitive on “standard” (UCI) data  Paper 2: Kernel-Based Semi-Naïve Bayes New algorithm to handle large attribute numbers Attribute joining shown to be beneficial  Paper 3: Hierarchical Clustering [10] Competitive for speed and effectiveness Hierarchical structure

20 Outlook  Software engineering aspects Column-oriented design Relationship with P-tree API  “Non-standard” data Data with graph structure Hierarchical data, concept slices [11]11  Visualization Visualization of data on a graph

21 Software Engineering  Business-problems row-based Match between database tables and objects  Scientific / engineering problems column- based Collective properties of interest Standard OO unsuitable, instead  Fortran  Array-based languages (ZPL)  Solution? Design pattern? Library?

22 Ptree API

23 “Non-standard” Data  Types of data Biological (KDD-cup ’02: Our team got honorary mention!) Sensor Networks  Types of problems Small probability of minority class label A-ROC Evaluation Multi-valued attributes Bit-vector representation ideal for P-trees Graphs Rich supply of new problems / techniques (work with Chris Besemann) Hierarchical categorical attributes [11]

24 Visualization Idea:  Use graph visualization tool E.g. http://www.touchgraph.com/http://www.touchgraph.com/  Visualize node data through glyphs  Visualize edge data


Download ppt "Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo."

Similar presentations


Ads by Google