1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database Systems Group, Department of Computer Science University of Houston Advisor: Dr. Carlos Ordonez
2 Motivation Naïve Bayes Classifier(NB) – One of the most popular and important classifiers in Machine Learning – Robust, Powerful, Fast to Compute And Easy to Understand Programming Inside A DBMS – SQL can easily handle complex computations – UDFs can use arrays and processed in memory
Data Mining Inside A DBMS Avoids Exporting the data outside the DBMS Major overhead Data Security Scales Linearly with large data sets Exploit parallelism provided by a DBMS Use optimized queries with simple database operations Objective: Push computations involving large data sets inside the DBMS
4 Bayesian Classifier Based On K-Means (BKM) A Generalization Of Naïve Bayes(NB) The Algorithm – Initialization: Randomly initialize k clusters per class from the data set. – E-Step: Compute Euclidean distance, find nearest cluster and then compute sufficient statistics. – M-Step: Re-compute cluster centers and radii. Check Convergence. The E-Step and M-Step are repeated until model converges i.e clusters do not move
BKM: Finding the clusters per class
6 Database Optimizations Five different query optimization techniques for distance computation were introduced. User Defined Functions (UDFs) – Computing distance and nearest cluster in a single UDF. Using CASE statement instead of aggregations. Sufficient Statistics of the clusters were computed in a single table scan.
7 Comparing Accuracy – NB Vs BKM Vs DT Data SetAlgorithmGlobalClass-0Class-1 pimaNB76%80%68% BKM76%87%53% DT68%76%53% SpamNB70%87%45% BKM73%91%43% DT80%85%72% BscaleNB50%51%30% BKM59% 60% DT89%96%0% WbcancerNB93%91%95% BKM93%84%97% DT95%94%96% Global Accuracy: BKM better than NB and worse than DT(Decision Tree) in most cases Class Breakdown Accuracy: BKM better than NB except 2 cases proving class decomposition is a positive step towards increasing NB accuracy. DT performs poorly here and really worse in case of the bscale.
8 BKM Scalability- Varying n,d,k Times per Iteration. Defaults: d=4,k=4,n=100k
Comparing DBMS with MapReduce MapReduce: A distributed non-transactional high performance data intensive processing framework.
Incremental Mining An UDF performing incremental data mining exploiting data parallelism Minimizing the number of scans(1-3) on the data set Provides an approximation of the model before we scan through the complete data set Requires thread safe sharing of the model without affecting performance
Papers Carlos Ordonez, Sasi K. Pitchaimalai: One-pass data mining algorithms in a DBMS with UDFs. SIGMOD Conference 2011: Carlos OrdonezSIGMOD Conference 2011 Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado : Comparing SQL and MapReduce to compute Naïve Bayes in a Single Table Scan, CloudDB, CIKM 2010 Carlos Ordonez, Sasi K. Pitchaimalai: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling, DKE 2010 Carlos Ordonez, Sasi K. Pitchaimalai - Bayesian Classifiers Programmed in SQL, TKDE 2008 Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado – Efficient Distance Computation Using SQL Queries and UDFs, ICDM 2008