Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Anne Denton Major Advisor: William Perrizo

Outline Introduction P-Trees Kernel Methods
Concepts Implementation Kernel Methods Paper 1: Rule-Based Classification Paper 2: Kernel-Based Semi-Naïve Bayes Classifier Paper 3: Hierarchical Clustering Outlook

Introduction Data Mining P-Tree Approach Information from data
Considers storage issues P-Tree Approach Bit-column-based storage Compression Hardware optimization Simple index construction Flexibility in storage organization

P-Tree Concepts Ordering (details) Compression
New: Generalized Peano order sorting Compression

Impact of Peano Order Sorting

P-Tree Implementation
Implementation in Java Was ported to C / C++ (Amal Perera, Masum Serazi) Fastest compressing P-tree implementation so far Array indices as pointers (details) Grandchild purity

Kernel-Density-Based Classification
Probability of an attribute vector x Conditional on class label value ci [] is 1 if  is true, 0 otherwise Depending on N training points xt Kernel function K(x,xt) can be, e.g., Gaussian function or step function

Higher Order Basic Bit Distance HOBbit
P-trees make count evaluation efficient for the following intervals

Paper 1: Rule-Based Classification
Goal: High accuracy on large data sets including standard ones (UCI ML Repository) Neighborhood evaluated through Equality of categorical attributes HOBbit interval for continuous attributes Curse of dimensionality Volume empty with high likelihood Information gain to select attributes Attributes considered binary, based on test sample (“Lazy decision trees”, Friedman ’96 [4]) Continuous data: Interval around test sample Exact information gain (details) Pursuing multiple paths

Results: Accuracy Comparable to C4.5 after much less development time
20 paths adult 15.54 14.93 kr-vs-kp 0.8 0.84 mushroom Comparable to C4.5 after much less development time 5 data sets from UCI Machine Learning Repository (details) 2 additional data sets Crop Gene-function Improvement through multiple paths (20)

Results: Speed Used on largest UCI data sets
Scaling of execution time as a function of training set size

Paper 2: Kernel-Based Semi-Naïve Bayes Classifier
Goal: Handling many attributes Naïve Bayes x(k) is value of kth attribute Semi-naïve Bayes Correlated attributes are joined Has been done for categorical data Kononenko ’91 [5], Pazzani ’96 [6] Previously: Continuous data discretized

Kernel-Based Naïve Bayes
Alternatives for continuous data Discretization Distribution function Gaussian with mean and standard deviation from data No alternative for semi-naïve approach Kernel density estimate (Hastie [7])

Correlations Correlation between attributes a and b N: Number of
training points t Kernel function for continuous data dEH: Exponential HOBbit distance

Results P-tree Naïve Bayes Semi-Naïve Bayes Difference only for
continuous data Semi-Naïve Bayes 3 parameter combinations Blue: t = 1 3 iterations Red: t = 0.3 incl. anti-corr. White: t = 0.05 (t: threshold)

Paper 3: Hierarchical Clustering [10]
Goal: Understand relationship between standard algorithms Combine the “best” aspects of three major ones Partition-based Relationship to k-medoids [8] demonstrated Same cluster boundary definition Density-based (kernel-based, DENCLUE [9]) Similar cluster center definition Hierarchical Follows naturally from above definitions

Results: Speed Comparison with K-Means

Results: Clustering Effectiveness
K-means Our Algorithm

Summary P-tree representation for non-spatial data Fast implementation
Paper1: Rule-Based Algorithm Test-sample-centered intervals, multiple paths Competitive on “standard” (UCI) data Paper 2: Kernel-Based Semi-Naïve Bayes New algorithm to handle large attribute numbers Attribute joining shown to be beneficial Paper 3: Hierarchical Clustering [10] Competitive for speed and effectiveness Hierarchical structure

Outlook Software engineering aspects “Non-standard” data Visualization
Column-oriented design Relationship with P-tree API “Non-standard” data Data with graph structure Hierarchical data, concept slices [11] Visualization Visualization of data on a graph

Software Engineering Business-problems row-based
Match between database tables and objects Scientific / engineering problems column-based Collective properties of interest Standard OO unsuitable, instead Fortran Array-based languages (ZPL) Solution? Design pattern? Library?

Ptree API

“Non-standard” Data Types of data Types of problems
Biological (KDD-cup ’02: Our team got honorary mention!) Sensor Networks Types of problems Small probability of minority class label A-ROC Evaluation Multi-valued attributes Bit-vector representation ideal for P-trees Graphs Rich supply of new problems / techniques (work with Chris Besemann) Hierarchical categorical attributes [11]

Visualization Idea: Use graph visualization tool
Visualize node data through glyphs Visualize edge data

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Similar presentations

Presentation on theme: "Fast Kernel-Density-Based Classification and Clustering Using P-Trees"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Similar presentations

Presentation on theme: "Fast Kernel-Density-Based Classification and Clustering Using P-Trees"— Presentation transcript:

Similar presentations

About project

Feedback