Wellington Cabrera Advisor: Carlos Ordonez Optimizing Bayesian Methods for High Dimensional Data Sets on Array-Based Parallel Database Systems Wellington Cabrera Advisor: Carlos Ordonez Welcome everybody. Thank you for coming this morining I am glad to defend my Master Thesis today
Motivation Nowadays, large data sets are present everywhere Many more variables ( attributes, features) than before Microarray data Sensor data Network traffic data Medical Imaging Linear regression: fundamental analytical technique: Economics Social Science Biology Statistical models in DBMS Data security Concurrency Recovery Microarray data consist of measures of gene expression level Data set with several thousand variables Sensor data: Large networks of sensor Medical Imaging, each pixel is considered as a dimension
Problem Linear regression, directly applied to high-dimensional datasets, leads to complex models. Hundreds or thousands variables Difficult to interpret A solution than cannot be applied in practice Variable selection methods effective, small subsets of predictors. Current algorithms have several issues: Cannot work with data sets larger than RAM Cannot deal with many thousand dimensions Inefficiency Consider d= dimensions, n = observations Cases: d < n , d > n Variable selection methods focus on finding effective, small subsets of predictors
Contribution summary Our work: Enables the analysis of very high dimensional data sets in the DBMS. Overcomes the problem of data sets larger than RAM (d< n) Introduces an accelerated Gibbs sampler requires only one pass on the data set thousands iterations in main memory 10s to 100s times faster than standard approach
Related Work Classical approach Information based approach Forward selection Backward elimination Information based approach Akakike’s Information Criteria (AIC) Bayesian Information Criteria (BIC) Bayesian approach Nowadays, approximation by Markov Chain Monte Carlo (MCMC) methods GVS , SSVS Classical approaches relies on Greedy algorithms MCMC increased popularity due to more computer power
Challenges Huge search space Large number of matrix operations For a dataset with p variables, space size = 2p Even for a moderate p=64 the search would take literally for ever Imagine you can test 1 million cases in a second: 245 cases in a year. We reduce the original dimensionality p (many thousands) to a smaller dimensionality d (one, two thousands ) Large number of matrix operations Each iteration of MCMC requires the evaluation of p probabilities complex formulas involving matrix inversion and several matrix multiplications Large data sets do not fit in RAM Disk-based computation too slow P probabilities ( one per variable)
The Linear regression model
Three Step Algorithm Pre-selection Summarization Accelerated Gibbs sampler
Pre-selection Reduces the data set from original dimensionality p to a smaller dimensionality d Pre-selection by marginal correlation ranking Calculate correlation between each variable and the outcome Sort in descending order Take the best d variables As a result, the top d variables are considered for further analysys
Summarization Data set properties can be summarized by the sufficient statistics n, L, Q. Such sufficient statistics are a compact representation of the dataset. Calculated in one pass. We do not load the dataset in RAM
Extended Summarization in one pass Z=[1,X,Y] Γ= Z ZT
Gibbs Sampler Markov Chain Monte Carlo method (MCMC) Consider a linear model with parameters Θ={βσγ} Where binary vector γ[i] describes the variables selected at the step i of the sequence. Gibbs sampler generates the Markov chain
Optimizing Gibbs Sampler Non-conjugate Gaussian priors require the full Markov Chain. Conjugate priors simplify the computation. β,σ integrated out. Marin-Roberts formulation Zellner-g prior for β and Jeffrey’s prior for σ Reading the data set in each iterationis not necessary. Matrix products already in Γare reused.
Algorithm in Array DBMS (most recent contribution) Pre-selection Two custom operators: preselect(), correlation() Summarization Custom operator: gamma() Gibbs Sampler R script, outside DBMS First work addressing Bayesian models in Array DBMS
System overview in Array DBMS
Summarization
Algorithm in Row DBMS ( previous contribution) Pre-selection SQL queries + User defined function Summarization Aggregate UDF Gibbs Sampler Table Valued Function
System overview in Row DBMS
Experiments Data sets used in this work Software specification
Posterior probabilities Dataset mRNA. Plots for d=1000 and d=2000
4-fold validation results. Quality of results 4-fold validation results. Data set mRNA, d=2000.
Comparison Three Step Algorithm BayesVarSel Array DBMS + R script Hardware: Intel Quad Core OS: Linux Dataset: mRNA n=248 d=500 Prior: Zellner-g N=100 time= 16 secs Model size ~ 30 R package Hardware: Intel Quad Core OS: Linux Dataset: mRNA n=248 d=500 Prior: Zellner-g N=100 time= 2104 sec Model size ~ 180
Comparison Comparing 3 versions of the Three step algorithm with an R package. Data set mRNA. 100 Iterations. Time in seconds
Performance Results Execution time for Summarization and Gibbs sampling step Data set Songs. Array DBMS
Performance Results Execution time in seconds for Pre-selection and Summarization Data set mRNA. Array DBMS. Execution time in seconds for Gibbs sampler step. Data set mRNA. Array DBMS. 30 Thousands iterations.
Execution time in seconds for Pre-selection, Summarization Performance Results Execution time in seconds for Pre-selection, Summarization and Gibbs Sampling. Data set mRNA. Row DBMS. 30 Thousand iterations
Summarization Performance
Gibbs sampling performance
Conclusions We present an Bayesian Variable selection algorithm that enables the analysis of very high dimensional data sets that may not fit in RAM. We propose a Pre-selection step based on marginal correlation ranking, in order to reduce the dimensionality of the original data set. Array DBMS more suitable for computing statistical models, supporting very large arrays. Custom operator required for optimal performace. We demonstrate that the Gibbs Sampler works optimally in Row and Array DBMS based on the summarization matrix, computed in only one pass. Our algorithm shows an outstanding performance. Compared to a public domain R package, our prototype is two orders of magnitude faster. Our algorithm identifies small models , with R-squared generally better than 0.5
Future work We will explore different kernels and proposal functions for Metropolis-Hastings algorithm To find results in less iterations To choose kernels/proposals suitable for optimization via data summarization Improve the algorithm data sets larger than RAM explore non-informative prior
List of publications Carlos Ordonez, Yiqun Zhang, Wellington Cabrera: The Gamma Operator for Big Data Summarization on an Array DBMS. BigMine 2014: 88-103 Wellington Cabrera, Carlos Ordonez, David Sergio Matusevich, Veerabhadran Baladandayuthapani: Bayesian variable selection for linear regression in high dimensional microarray data. DTMBIO 2013: 17-18 Wellington Cabrera, Carlos Ordonez, David Sergio Matusevich, Veerabhadran Baladandayuthapani: Fast Bayesian Variable Selection Algorithms for High Dimensional Genomics Data. IJDMB (under revision) Carlos Ordonez, Javier García-García, Carlos Garcia-Alvarado, Wellington Cabrera, Veerabhadran Baladandayuthapani, Mohammed S. Quraishi: Data mining algorithms as a service in the cloud exploiting relational database systems. SIGMOD Conference 2013: 1001-1004 Carlos Ordonez, Sofian Maabout, David Sergio Matusevich, Wellington Cabrera: Extending ER models to capture database transformations to build data sets for data mining. Data Knowl. Eng. 89: 38-54 (2014)
Thank you!