Wellington Cabrera Advisor: Carlos Ordonez Optimizing Bayesian Methods for High Dimensional Data Sets on Array-Based Parallel Database Systems Wellington Cabrera Advisor: Carlos Ordonez
Motivation Nowadays, large data sets are present everywhere Much more variables ( attributes, features) than before Microarray data Sensor data Network traffic data Medical Imaging Linear regression: very popular analytical technique: Economics Social Science Biology
Problem Linear regression, directly applied to high-dimensional datasets, leads to complex models. Hundreds or thousands variables Variable selection methods effective, small subsets of predictors. Current algorithms have several issues: Can not work with data sets larger than RAM Can not deal with many thousand dimensions Poor performance Variable selection methods focus on finding effective, small subsets of predictors
Contribution summary Our work: Overcomes the problem of data sets larger than RAM. Facilitates the analysis of data sets with very high dimensionality Introduces an accelerated Gibbs sampler that requires only one pass to the data set 10s to 100s times faster than standard approach
Related Work Classical approach Information based approach Forward selection Backward elimination Information based approach Akakike’s Information Criteria (AIC) Bayesian Information Criteria (BIC) Bayesian approach with Markov Chain Monte Carlo (MCMC) methods GVS , SSVS
Challenges Huge search space Large number of matrix operations For a dataset with p variables, space size = 2p Even for a moderate p=64 the search would take literally for ever Imagine you can test 1 million cases in a second: 245 cases in a year. Large number of matrix operations Each iteration of MCMC requires the evaluation of p probabilities complex formula involving matrix inversion and several matrix multiplications Large data sets do not fit in RAM Disk-based computation too slow P probabilities ( one per variable)
The Linear regression model
Three Step Algorithm Pre-selection Summarization Accelerated Gibbs sampler
Pre-selection Reduces the data set from original dimensionality p to a smaller dimensionality d Pre-selection by marginal correlation ranking Calculate correlation between each variable and the outcome Sort in descending order Take the best d variables As a result, the best d variables are considered for the rest of the process
Summarization Data set properties can be summarized by the sufficient statistics n, L, Q. Such sufficient statistics are a compact representation of the dataset. Calculated in one pass. We do not load the dataset in RAM
Extended Summarization in one pass Z=[1,X,Y] Γ= Z ZT
Gibbs Sampler Markov Chain Monte Carlo method (MCMC) Consider a linear model with parameters Θ={βσγ} Where binary vector γ[i] describes the variables selected at the step i of the sequence. Gibbs sampler generates the Markov chain
Optimizing Gibbs Sampler No-conjugate priors require the full Markov Chain. Conjugate priors simplify the computation: They do not need the computation of β,σ. Under the Zellner g prior, the probability of γis given by: We exploit the sufficient statistics calculated in the summarization step to accelerate Gibbs Sampler.
Prototype in Array DBMS Pre-selection Two custom operators: preselect(), correlation() Summarization Custom operator: gamma() Gibbs Sampler R script, outside DBMS
Prototype overview in Array DBMS
Summarization
Prototype in Row DBMS Pre-selection Summarization Gibbs Sampler SQL queries + User defined function Summarization Aggregate UDF Gibbs Sampler Table Valued Function
Overview in Row DBMS
Experiments Data sets used in this work Software specification
Results Execution time for Summarization and Gibbs sampling step Data set Songs. Array DBMS
Results Execution time in seconds for Pre-selection and Summarization Data set mRNA. Array DBMS. Execution time in seconds for Gibbs sampler step. Data set mRNA. Array DBMS. 30 Thousands iterations.
Execution time in seconds for Pre-selection, Summarization Results Execution time in seconds for Pre-selection, Summarization and Gibbs Sampling. Data set mRNA. Row DBMS. 30 Thousand iterations
Summarization Performance
Gibbs sampling performance
Posterior probabilities Dataset mRNA. Plots for d=1000 and d=2000
4-fold validation results. Quality of results 4-fold validation results. Data set mRNA, d=2000.
Comparison Three Step Algorithm BayesVarSel Array DBMS + R script Hardware: Intel Quad Core OS: Linux Dataset: mRNA n=248 d=500 Prior: Zellner-g N=100 time= 16 secs Model size ~ 30 R package Hardware: Intel Quad Core OS: Linux Dataset: mRNA n=248 d=500 Prior: Zellner-g N=100 time= 2104 sec Model size ~ 180
Comparison Comparing 3 versions of the Three step algorithm with an R package. Data set mRNA. 100 Iterations. Time in seconds
Conclusion We present an Bayesian Variable selection algorithm that enables the analysis of very high dimensional data sets that may not fit in RAM. We propose a Pre-selection step based on marginal correlation ranking, in order to reduce the dimensionality of the original data set. We demonstrate that the Gibbs Sampler works optimally in Row and Array DBMS based on the summarization matrix, computed in only one pass. Our algorithm shows an outstanding performance. Compared to a public domain R package, our prototype is two orders of magnitude faster. Our algorithm identifies small models , with R-squared generally better than 0.5
Thank you!