Download presentation
Presentation is loading. Please wait.
Published byDarren Dorsey Modified over 6 years ago
1
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Anne Denton Major Advisor: William Perrizo
2
Outline Introduction P-Trees Kernel Methods
Concepts Implementation Kernel Methods Paper 1: Rule-Based Classification Paper 2: Kernel-Based Semi-Naïve Bayes Classifier Paper 3: Hierarchical Clustering Outlook
3
Introduction Data Mining P-Tree Approach Information from data
Considers storage issues P-Tree Approach Bit-column-based storage Compression Hardware optimization Simple index construction Flexibility in storage organization
4
P-Tree Concepts Ordering (details) Compression
New: Generalized Peano order sorting Compression
5
Impact of Peano Order Sorting
6
P-Tree Implementation
Implementation in Java Was ported to C / C++ (Amal Perera, Masum Serazi) Fastest compressing P-tree implementation so far Array indices as pointers (details) Grandchild purity
7
Kernel-Density-Based Classification
Probability of an attribute vector x Conditional on class label value ci [] is 1 if is true, 0 otherwise Depending on N training points xt Kernel function K(x,xt) can be, e.g., Gaussian function or step function
8
Higher Order Basic Bit Distance HOBbit
P-trees make count evaluation efficient for the following intervals
9
Paper 1: Rule-Based Classification
Goal: High accuracy on large data sets including standard ones (UCI ML Repository) Neighborhood evaluated through Equality of categorical attributes HOBbit interval for continuous attributes Curse of dimensionality Volume empty with high likelihood Information gain to select attributes Attributes considered binary, based on test sample (“Lazy decision trees”, Friedman ’96 [4]) Continuous data: Interval around test sample Exact information gain (details) Pursuing multiple paths
10
Results: Accuracy Comparable to C4.5 after much less development time
20 paths adult 15.54 14.93 kr-vs-kp 0.8 0.84 mushroom Comparable to C4.5 after much less development time 5 data sets from UCI Machine Learning Repository (details) 2 additional data sets Crop Gene-function Improvement through multiple paths (20)
11
Results: Speed Used on largest UCI data sets
Scaling of execution time as a function of training set size
12
Paper 2: Kernel-Based Semi-Naïve Bayes Classifier
Goal: Handling many attributes Naïve Bayes x(k) is value of kth attribute Semi-naïve Bayes Correlated attributes are joined Has been done for categorical data Kononenko ’91 [5], Pazzani ’96 [6] Previously: Continuous data discretized
13
Kernel-Based Naïve Bayes
Alternatives for continuous data Discretization Distribution function Gaussian with mean and standard deviation from data No alternative for semi-naïve approach Kernel density estimate (Hastie [7])
14
Correlations Correlation between attributes a and b N: Number of
training points t Kernel function for continuous data dEH: Exponential HOBbit distance
15
Results P-tree Naïve Bayes Semi-Naïve Bayes Difference only for
continuous data Semi-Naïve Bayes 3 parameter combinations Blue: t = 1 3 iterations Red: t = 0.3 incl. anti-corr. White: t = 0.05 (t: threshold)
16
Paper 3: Hierarchical Clustering [10]
Goal: Understand relationship between standard algorithms Combine the “best” aspects of three major ones Partition-based Relationship to k-medoids [8] demonstrated Same cluster boundary definition Density-based (kernel-based, DENCLUE [9]) Similar cluster center definition Hierarchical Follows naturally from above definitions
17
Results: Speed Comparison with K-Means
18
Results: Clustering Effectiveness
K-means Our Algorithm
19
Summary P-tree representation for non-spatial data Fast implementation
Paper1: Rule-Based Algorithm Test-sample-centered intervals, multiple paths Competitive on “standard” (UCI) data Paper 2: Kernel-Based Semi-Naïve Bayes New algorithm to handle large attribute numbers Attribute joining shown to be beneficial Paper 3: Hierarchical Clustering [10] Competitive for speed and effectiveness Hierarchical structure
20
Outlook Software engineering aspects “Non-standard” data Visualization
Column-oriented design Relationship with P-tree API “Non-standard” data Data with graph structure Hierarchical data, concept slices [11] Visualization Visualization of data on a graph
21
Software Engineering Business-problems row-based
Match between database tables and objects Scientific / engineering problems column-based Collective properties of interest Standard OO unsuitable, instead Fortran Array-based languages (ZPL) Solution? Design pattern? Library?
22
Ptree API
23
“Non-standard” Data Types of data Types of problems
Biological (KDD-cup ’02: Our team got honorary mention!) Sensor Networks Types of problems Small probability of minority class label A-ROC Evaluation Multi-valued attributes Bit-vector representation ideal for P-trees Graphs Rich supply of new problems / techniques (work with Chris Besemann) Hierarchical categorical attributes [11]
24
Visualization Idea: Use graph visualization tool
Visualize node data through glyphs Visualize edge data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.