1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.

Slides:

Advertisements

Similar presentations

Machine Learning on Spark

Advertisements

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.

K-means clustering Hongning Wang

Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens.

K-Means and DBSCAN Erik Zeitler Uppsala Database Laboratory.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

Introduction to Bioinformatics - Tutorial no. 12

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

FLANN Fast Library for Approximate Nearest Neighbors

“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Data mining and machine learning A brief introduction.

1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)

Introduction to Hadoop and HDFS

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.

CSIE Dept., National Taiwan Univ., Taiwan

Intelligent Database Systems Lab 1 Advisor ： Dr. Hsu Graduate ： Jian-Lin Kuo Author ： Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.

Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.

Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.

Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)

Medical Data Mining Carlos Ordonez University of Houston Department of Computer Science.

Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.

Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)

Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.

CS4432: Database Systems II Query Processing- Part 2.

Network Coordinates ： Internet Distance Estimation Jieming ZHU

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.

A new clustering tool of Data Mining RAPID MINER.

Transforming Policies into Mechanisms with Infokernel Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Nathan C. Burnett, Timothy E. Denehy, Thomas J.

1 Database Systems Group Research Overview OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:

1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.

Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.

REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

Large Scale Parallel Supervised Topic-Modeling -implementation plan- Keisuke Kamataki Jun Zhu Eric Xing Sep 27, 2010.

Machine Learning Usman Roshan Dept. of Computer Science NJIT.

Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.

Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Data Mining K-means Algorithm

A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

Applying Twister to Scientific Applications

Spatial Online Sampling and Aggregation

Random Sampling over Joins Revisited

Prepared by: Mahmoud Rafeek Al-Farra

Parallel Analytic Systems

CLUSTER BY: A NEW SQL EXTENSION FOR SPATIAL DATA AGGREGATION

Wellington Cabrera Carlos Ordonez

Wellington Cabrera Advisor: Carlos Ordonez

Carlos Ordonez, Javier Garcia-Garcia,

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Presentation transcript:

1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database Systems Group, Department of Computer Science University of Houston Advisor: Dr. Carlos Ordonez

2 Motivation Naïve Bayes Classifier(NB) – One of the most popular and important classifiers in Machine Learning – Robust, Powerful, Fast to Compute And Easy to Understand Programming Inside A DBMS – SQL can easily handle complex computations – UDFs can use arrays and processed in memory

Data Mining Inside A DBMS Avoids Exporting the data outside the DBMS Major overhead Data Security Scales Linearly with large data sets Exploit parallelism provided by a DBMS Use optimized queries with simple database operations Objective: Push computations involving large data sets inside the DBMS

4 Bayesian Classifier Based On K-Means (BKM) A Generalization Of Naïve Bayes(NB) The Algorithm – Initialization: Randomly initialize k clusters per class from the data set. – E-Step: Compute Euclidean distance, find nearest cluster and then compute sufficient statistics. – M-Step: Re-compute cluster centers and radii. Check Convergence. The E-Step and M-Step are repeated until model converges i.e clusters do not move

BKM: Finding the clusters per class

6 Database Optimizations Five different query optimization techniques for distance computation were introduced. User Defined Functions (UDFs) – Computing distance and nearest cluster in a single UDF. Using CASE statement instead of aggregations. Sufficient Statistics of the clusters were computed in a single table scan.

7 Comparing Accuracy – NB Vs BKM Vs DT Data SetAlgorithmGlobalClass-0Class-1 pimaNB76%80%68% BKM76%87%53% DT68%76%53% SpamNB70%87%45% BKM73%91%43% DT80%85%72% BscaleNB50%51%30% BKM59% 60% DT89%96%0% WbcancerNB93%91%95% BKM93%84%97% DT95%94%96% Global Accuracy: BKM better than NB and worse than DT(Decision Tree) in most cases Class Breakdown Accuracy: BKM better than NB except 2 cases proving class decomposition is a positive step towards increasing NB accuracy. DT performs poorly here and really worse in case of the bscale.

8 BKM Scalability- Varying n,d,k Times per Iteration. Defaults: d=4,k=4,n=100k

Comparing DBMS with MapReduce MapReduce: A distributed non-transactional high performance data intensive processing framework.

Incremental Mining An UDF performing incremental data mining exploiting data parallelism Minimizing the number of scans(1-3) on the data set Provides an approximation of the model before we scan through the complete data set Requires thread safe sharing of the model without affecting performance

Papers Carlos Ordonez, Sasi K. Pitchaimalai: One-pass data mining algorithms in a DBMS with UDFs. SIGMOD Conference 2011: Carlos OrdonezSIGMOD Conference 2011 Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado : Comparing SQL and MapReduce to compute Naïve Bayes in a Single Table Scan, CloudDB, CIKM 2010 Carlos Ordonez, Sasi K. Pitchaimalai: Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling, DKE 2010 Carlos Ordonez, Sasi K. Pitchaimalai - Bayesian Classifiers Programmed in SQL, TKDE 2008 Sasi K. Pitchaimalai, Carlos Ordonez, Carlos Garcia Alvarado – Efficient Distance Computation Using SQL Queries and UDFs, ICDM 2008