Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Decision Tree.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens.
Lecture Notes for Chapter 4 Introduction to Data Mining
Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.
COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Decision Tree Algorithm
1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Lecture 5 (Classification with Decision Trees)
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Gerhard Maierbacher Scalable Coding Solutions for Wireless Sensor Networks IT.
Presented by Zeehasham Rasheed
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Three Challenges in Data Mining Anne Denton Department of Computer Science NDSU.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
Data mining and machine learning A brief introduction.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.
Additive Data Perturbation: the Basic Problem and Techniques.
Data Classification with the Radial Basis Function Network Based on a Novel Kernel Density Estimation Algorithm Yen-Jen Oyang Department of Computer Science.
Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Clustering using Wavelets and Meta-Ptrees Anne Denton, Fang Zhang.
P-Tree Implementation Anne Denton. So far: Logical Definition C.f. Dr. Perrizo’s slides Logical definition Defines node information Representation of.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Image Classification for Automatic Annotation
Optimization by Model Fitting Chapter 9 Luke, Essentials of Metaheuristics, 2011 Byung-Hyun Ha R1.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Data Mining and Data Warehousing of Many-to-Many Relationships and some Applications William Perrizo Dept of Computer Science North Dakota State Univ.
Overview Data Mining - classification and clustering
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Packet Classification Using Multidimensional Cutting Sumeet Singh (UCSD) Florin Baboescu (UCSD) George Varghese (UCSD) Jia Wang (AT&T Labs-Research) Reviewed.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Predicting Mortgage Pre-payment Risk. Introduction Definition Borrower pays off the loan before the contracted term loan length. Lender loses future part.
| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Basic machine learning background with Python scikit-learn
K Nearest Neighbor Classification
A Unifying View on Instance Selection
Vertical K Median Clustering
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Group 9 – Data Mining: Data
CSE572: Data Mining by H. Liu
A task of induction to find patterns
A task of induction to find patterns
Presentation transcript:

Fast Kernel-Density-Based Classification and Clustering Using P-Trees Anne Denton Major Advisor: William Perrizo

Outline  Introduction  P-Trees Concepts Implementation  Kernel Methods  Paper 1: Rule-Based Classification  Paper 2: Kernel-Based Semi-Naïve Bayes Classifier  Paper 3: Hierarchical Clustering  Outlook

Introduction  Data Mining Information from data Considers storage issues  P-Tree Approach Bit-column-based storage  Compression  Hardware optimization  Simple index construction  Flexibility in storage organization

P-Tree Concepts  Ordering (details)details New: Generalized Peano order sorting  Compression

Impact of Peano Order Sorting

P-Tree Implementation  Implementation in Java Was ported to C / C++ (Amal Perera, Masum Serazi) Fastest compressing P-tree implementation so far  Array indices as pointers (details)details  Grandchild purity

Kernel-Density-Based Classification  Probability of an attribute vector x  Conditional on class label value c i  [] is 1 if  is true, 0 otherwise  Depending on N training points x t  Kernel function K(x,x t ) can be, e.g., Gaussian function or step function

Higher Order Basic Bit Distance HOBbit  P-trees make count evaluation efficient for the following intervals

Paper 1: Rule-Based Classification  Goal: High accuracy on large data sets including standard ones (UCI ML Repository)UCI ML Repository  Neighborhood evaluated through Equality of categorical attributes HOBbit interval for continuous attributes  Curse of dimensionality Volume empty with high likelihood  Information gain to select attributes Attributes considered binary, based on test sample (“Lazy decision trees”, Friedman ’96 [4])4 Continuous data: Interval around test sample Exact information gain (details)details Pursuing multiple paths

Results: Accuracy  Comparable to C4.5 after much less development time  5 data sets from UCI Machine Learning Repository (details)details  2 additional data sets Crop Gene-function  Improvement through multiple paths (20) C4.520 paths adult kr-vs-kp mushroom00

Results: Speed  Used on largest UCI data sets  Scaling of execution time as a function of training set size

Paper 2: Kernel-Based Semi-Naïve Bayes Classifier  Goal: Handling many attributes  Naïve Bayes x (k) is value of k th attribute  Semi-naïve Bayes Correlated attributes are joined Has been done for categorical data  Kononenko ’91 [5], Pazzani ’96 [6]56 Previously: Continuous data discretized

Kernel-Based Naïve Bayes  Alternatives for continuous data Discretization Distribution function  Gaussian with mean and standard deviation from data  No alternative for semi-naïve approach Kernel density estimate (Hastie [7])7

Correlations  Correlation between attributes a and b N: Number of training points t  Kernel function for continuous data d EH : Exponential HOBbit distance

Results P-tree Naïve Bayes  Difference only for continuous data Semi-Naïve Bayes  3 parameter combinations Blue: t = 1 3 iterations Red: t = 0.3 incl. anti-corr. White: t = 0.05 (t: threshold)

Paper 3: Hierarchical Clustering [10]10  Goal: Understand relationship between standard algorithms Combine the “best” aspects of three major ones  Partition-based Relationship to k-medoids [8] demonstrated8 Same cluster boundary definition  Density-based (kernel-based, DENCLUE [9])9 Similar cluster center definition  Hierarchical Follows naturally from above definitions

Results: Speed Comparison with K-Means

Results: Clustering Effectiveness K-means Our Algorithm

Summary  P-tree representation for non-spatial data  Fast implementation  Paper1: Rule-Based Algorithm Test-sample-centered intervals, multiple paths Competitive on “standard” (UCI) data  Paper 2: Kernel-Based Semi-Naïve Bayes New algorithm to handle large attribute numbers Attribute joining shown to be beneficial  Paper 3: Hierarchical Clustering [10] Competitive for speed and effectiveness Hierarchical structure

Outlook  Software engineering aspects Column-oriented design Relationship with P-tree API  “Non-standard” data Data with graph structure Hierarchical data, concept slices [11]11  Visualization Visualization of data on a graph

Software Engineering  Business-problems row-based Match between database tables and objects  Scientific / engineering problems column- based Collective properties of interest Standard OO unsuitable, instead  Fortran  Array-based languages (ZPL)  Solution? Design pattern? Library?

Ptree API

“Non-standard” Data  Types of data Biological (KDD-cup ’02: Our team got honorary mention!) Sensor Networks  Types of problems Small probability of minority class label A-ROC Evaluation Multi-valued attributes Bit-vector representation ideal for P-trees Graphs Rich supply of new problems / techniques (work with Chris Besemann) Hierarchical categorical attributes [11]

Visualization Idea:  Use graph visualization tool E.g.  Visualize node data through glyphs  Visualize edge data