Download presentation
Presentation is loading. Please wait.
Published byGeoffrey Horton Modified over 9 years ago
1
Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo
2
Outline Introduction P-Trees Concepts Implementation Kernel Methods Paper 1: Rule-Based Classification Paper 2: Kernel-Based Semi-Naïve Bayes Classifier Paper 3: Hierarchical Clustering Outlook
3
Introduction Data Mining Information from data Considers storage issues P-Tree Approach Bit-column-based storage Compression Hardware optimization Simple index construction Flexibility in storage organization
4
P-Tree Concepts Ordering (details)details New: Generalized Peano order sorting Compression
5
Impact of Peano Order Sorting
6
P-Tree Implementation Implementation in Java Was ported to C / C++ (Amal Perera, Masum Serazi) Fastest compressing P-tree implementation so far Array indices as pointers (details)details Grandchild purity
7
Kernel-Density-Based Classification Probability of an attribute vector x Conditional on class label value c i [] is 1 if is true, 0 otherwise Depending on N training points x t Kernel function K(x,x t ) can be, e.g., Gaussian function or step function
8
Higher Order Basic Bit Distance HOBbit P-trees make count evaluation efficient for the following intervals
9
Paper 1: Rule-Based Classification Goal: High accuracy on large data sets including standard ones (UCI ML Repository)UCI ML Repository Neighborhood evaluated through Equality of categorical attributes HOBbit interval for continuous attributes Curse of dimensionality Volume empty with high likelihood Information gain to select attributes Attributes considered binary, based on test sample (“Lazy decision trees”, Friedman ’96 [4])4 Continuous data: Interval around test sample Exact information gain (details)details Pursuing multiple paths
10
Results: Accuracy Comparable to C4.5 after much less development time 5 data sets from UCI Machine Learning Repository (details)details 2 additional data sets Crop Gene-function Improvement through multiple paths (20) C4.520 paths adult15.5414.93 kr-vs-kp0.80.84 mushroom00
11
Results: Speed Used on largest UCI data sets Scaling of execution time as a function of training set size
12
Paper 2: Kernel-Based Semi-Naïve Bayes Classifier Goal: Handling many attributes Naïve Bayes x (k) is value of k th attribute Semi-naïve Bayes Correlated attributes are joined Has been done for categorical data Kononenko ’91 [5], Pazzani ’96 [6]56 Previously: Continuous data discretized
13
Kernel-Based Naïve Bayes Alternatives for continuous data Discretization Distribution function Gaussian with mean and standard deviation from data No alternative for semi-naïve approach Kernel density estimate (Hastie [7])7
14
Correlations Correlation between attributes a and b N: Number of training points t Kernel function for continuous data d EH : Exponential HOBbit distance
15
Results P-tree Naïve Bayes Difference only for continuous data Semi-Naïve Bayes 3 parameter combinations Blue: t = 1 3 iterations Red: t = 0.3 incl. anti-corr. White: t = 0.05 (t: threshold)
16
Paper 3: Hierarchical Clustering [10]10 Goal: Understand relationship between standard algorithms Combine the “best” aspects of three major ones Partition-based Relationship to k-medoids [8] demonstrated8 Same cluster boundary definition Density-based (kernel-based, DENCLUE [9])9 Similar cluster center definition Hierarchical Follows naturally from above definitions
17
Results: Speed Comparison with K-Means
18
Results: Clustering Effectiveness K-means Our Algorithm
19
Summary P-tree representation for non-spatial data Fast implementation Paper1: Rule-Based Algorithm Test-sample-centered intervals, multiple paths Competitive on “standard” (UCI) data Paper 2: Kernel-Based Semi-Naïve Bayes New algorithm to handle large attribute numbers Attribute joining shown to be beneficial Paper 3: Hierarchical Clustering [10] Competitive for speed and effectiveness Hierarchical structure
20
Outlook Software engineering aspects Column-oriented design Relationship with P-tree API “Non-standard” data Data with graph structure Hierarchical data, concept slices [11]11 Visualization Visualization of data on a graph
21
Software Engineering Business-problems row-based Match between database tables and objects Scientific / engineering problems column- based Collective properties of interest Standard OO unsuitable, instead Fortran Array-based languages (ZPL) Solution? Design pattern? Library?
22
Ptree API
23
“Non-standard” Data Types of data Biological (KDD-cup ’02: Our team got honorary mention!) Sensor Networks Types of problems Small probability of minority class label A-ROC Evaluation Multi-valued attributes Bit-vector representation ideal for P-trees Graphs Rich supply of new problems / techniques (work with Chris Besemann) Hierarchical categorical attributes [11]
24
Visualization Idea: Use graph visualization tool E.g. http://www.touchgraph.com/http://www.touchgraph.com/ Visualize node data through glyphs Visualize edge data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.