Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo
Outline Introduction P-Trees Concepts Implementation Kernel Methods Paper 1: Rule-Based Classification Paper 2: Kernel-Based Semi-Naïve Bayes Classifier Paper 3: Hierarchical Clustering Outlook
Introduction Data Mining Information from data Considers storage issues P-Tree Approach Bit-column-based storage Compression Hardware optimization Simple index construction Flexibility in storage organization
P-Tree Concepts Ordering (details)details New: Generalized Peano order sorting Compression
Impact of Peano Order Sorting
P-Tree Implementation Implementation in Java Was ported to C / C++ (Amal Perera, Masum Serazi) Fastest compressing P-tree implementation so far Array indices as pointers (details)details Grandchild purity
Kernel-Density-Based Classification Probability of an attribute vector x Conditional on class label value c i [] is 1 if is true, 0 otherwise Depending on N training points x t Kernel function K(x,x t ) can be, e.g., Gaussian function or step function
Higher Order Basic Bit Distance HOBbit P-trees make count evaluation efficient for the following intervals
Paper 1: Rule-Based Classification Goal: High accuracy on large data sets including standard ones (UCI ML Repository)UCI ML Repository Neighborhood evaluated through Equality of categorical attributes HOBbit interval for continuous attributes Curse of dimensionality Volume empty with high likelihood Information gain to select attributes Attributes considered binary, based on test sample (“Lazy decision trees”, Friedman ’96 [4])4 Continuous data: Interval around test sample Exact information gain (details)details Pursuing multiple paths
Results: Accuracy Comparable to C4.5 after much less development time 5 data sets from UCI Machine Learning Repository (details)details 2 additional data sets Crop Gene-function Improvement through multiple paths (20) C4.520 paths adult kr-vs-kp mushroom00
Results: Speed Used on largest UCI data sets Scaling of execution time as a function of training set size
Paper 2: Kernel-Based Semi-Naïve Bayes Classifier Goal: Handling many attributes Naïve Bayes x (k) is value of k th attribute Semi-naïve Bayes Correlated attributes are joined Has been done for categorical data Kononenko ’91 [5], Pazzani ’96 [6]56 Previously: Continuous data discretized
Kernel-Based Naïve Bayes Alternatives for continuous data Discretization Distribution function Gaussian with mean and standard deviation from data No alternative for semi-naïve approach Kernel density estimate (Hastie [7])7
Correlations Correlation between attributes a and b N: Number of training points t Kernel function for continuous data d EH : Exponential HOBbit distance
Results P-tree Naïve Bayes Difference only for continuous data Semi-Naïve Bayes 3 parameter combinations Blue: t = 1 3 iterations Red: t = 0.3 incl. anti-corr. White: t = 0.05 (t: threshold)
Paper 3: Hierarchical Clustering [10]10 Goal: Understand relationship between standard algorithms Combine the “best” aspects of three major ones Partition-based Relationship to k-medoids [8] demonstrated8 Same cluster boundary definition Density-based (kernel-based, DENCLUE [9])9 Similar cluster center definition Hierarchical Follows naturally from above definitions
Results: Speed Comparison with K-Means
Results: Clustering Effectiveness K-means Our Algorithm
Summary P-tree representation for non-spatial data Fast implementation Paper1: Rule-Based Algorithm Test-sample-centered intervals, multiple paths Competitive on “standard” (UCI) data Paper 2: Kernel-Based Semi-Naïve Bayes New algorithm to handle large attribute numbers Attribute joining shown to be beneficial Paper 3: Hierarchical Clustering [10] Competitive for speed and effectiveness Hierarchical structure
Outlook Software engineering aspects Column-oriented design Relationship with P-tree API “Non-standard” data Data with graph structure Hierarchical data, concept slices [11]11 Visualization Visualization of data on a graph
Software Engineering Business-problems row-based Match between database tables and objects Scientific / engineering problems column- based Collective properties of interest Standard OO unsuitable, instead Fortran Array-based languages (ZPL) Solution? Design pattern? Library?
Ptree API
“Non-standard” Data Types of data Biological (KDD-cup ’02: Our team got honorary mention!) Sensor Networks Types of problems Small probability of minority class label A-ROC Evaluation Multi-valued attributes Bit-vector representation ideal for P-trees Graphs Rich supply of new problems / techniques (work with Chris Besemann) Hierarchical categorical attributes [11]
Visualization Idea: Use graph visualization tool E.g. Visualize node data through glyphs Visualize edge data