Wellington Cabrera Advisor: Carlos Ordonez

Wellington Cabrera Advisor: Carlos Ordonez
Optimizing Bayesian Methods for High Dimensional Data Sets on Array-Based Parallel Database Systems Wellington Cabrera Advisor: Carlos Ordonez

Motivation Nowadays, large data sets are present everywhere
Much more variables ( attributes, features) than before Microarray data Sensor data Network traffic data Medical Imaging Linear regression: very popular analytical technique: Economics Social Science Biology

Problem Linear regression, directly applied to high-dimensional datasets, leads to complex models. Hundreds or thousands variables Variable selection methods effective, small subsets of predictors. Current algorithms have several issues: Can not work with data sets larger than RAM Can not deal with many thousand dimensions Poor performance Variable selection methods focus on finding effective, small subsets of predictors

Contribution summary Our work:
Overcomes the problem of data sets larger than RAM. Facilitates the analysis of data sets with very high dimensionality Introduces an accelerated Gibbs sampler that requires only one pass to the data set 10s to 100s times faster than standard approach

Related Work Classical approach Information based approach
Forward selection Backward elimination Information based approach Akakike’s Information Criteria (AIC) Bayesian Information Criteria (BIC) Bayesian approach with Markov Chain Monte Carlo (MCMC) methods GVS , SSVS

Challenges Huge search space Large number of matrix operations
For a dataset with p variables, space size = 2p Even for a moderate p=64 the search would take literally for ever Imagine you can test 1 million cases in a second: 245 cases in a year. Large number of matrix operations Each iteration of MCMC requires the evaluation of p probabilities complex formula involving matrix inversion and several matrix multiplications Large data sets do not fit in RAM Disk-based computation too slow P probabilities ( one per variable)

The Linear regression model

Three Step Algorithm Pre-selection Summarization
Accelerated Gibbs sampler

Pre-selection Reduces the data set from original dimensionality p to a smaller dimensionality d Pre-selection by marginal correlation ranking Calculate correlation between each variable and the outcome Sort in descending order Take the best d variables As a result, the best d variables are considered for the rest of the process

Summarization Data set properties can be summarized by the sufficient statistics n, L, Q. Such sufficient statistics are a compact representation of the dataset. Calculated in one pass. We do not load the dataset in RAM

Extended Summarization in one pass
Z=[1,X,Y] Γ= Z ZT

Gibbs Sampler Markov Chain Monte Carlo method (MCMC)
Consider a linear model with parameters Θ={βσγ} Where binary vector γ[i] describes the variables selected at the step i of the sequence. Gibbs sampler generates the Markov chain

Optimizing Gibbs Sampler
No-conjugate priors require the full Markov Chain. Conjugate priors simplify the computation: They do not need the computation of β,σ. Under the Zellner g prior, the probability of γis given by: We exploit the sufficient statistics calculated in the summarization step to accelerate Gibbs Sampler.

Prototype in Array DBMS
Pre-selection Two custom operators: preselect(), correlation() Summarization Custom operator: gamma() Gibbs Sampler R script, outside DBMS

Prototype overview in Array DBMS

Summarization

Prototype in Row DBMS Pre-selection Summarization Gibbs Sampler
SQL queries + User defined function Summarization Aggregate UDF Gibbs Sampler Table Valued Function

Overview in Row DBMS

Experiments Data sets used in this work Software specification

Results Execution time for Summarization and Gibbs sampling step
Data set Songs. Array DBMS

Results Execution time in seconds for Pre-selection and Summarization
Data set mRNA. Array DBMS. Execution time in seconds for Gibbs sampler step. Data set mRNA. Array DBMS. 30 Thousands iterations.

Execution time in seconds for Pre-selection, Summarization
Results Execution time in seconds for Pre-selection, Summarization and Gibbs Sampling. Data set mRNA. Row DBMS. 30 Thousand iterations

Summarization Performance

Gibbs sampling performance

Posterior probabilities
Dataset mRNA. Plots for d=1000 and d=2000

4-fold validation results.
Quality of results 4-fold validation results. Data set mRNA, d=2000.

Comparison Three Step Algorithm BayesVarSel Array DBMS + R script
Hardware: Intel Quad Core OS: Linux Dataset: mRNA n=248 d=500 Prior: Zellner-g N=100 time= 16 secs Model size ~ 30 R package Hardware: Intel Quad Core OS: Linux Dataset: mRNA n=248 d=500 Prior: Zellner-g N=100 time= 2104 sec Model size ~ 180

Comparison Comparing 3 versions of the Three step algorithm with an R package. Data set mRNA. 100 Iterations. Time in seconds

Conclusion We present an Bayesian Variable selection algorithm that enables the analysis of very high dimensional data sets that may not fit in RAM. We propose a Pre-selection step based on marginal correlation ranking, in order to reduce the dimensionality of the original data set. We demonstrate that the Gibbs Sampler works optimally in Row and Array DBMS based on the summarization matrix, computed in only one pass. Our algorithm shows an outstanding performance. Compared to a public domain R package, our prototype is two orders of magnitude faster. Our algorithm identifies small models , with R-squared generally better than 0.5

Thank you!

Wellington Cabrera Advisor: Carlos Ordonez

Similar presentations

Presentation on theme: "Wellington Cabrera Advisor: Carlos Ordonez"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Wellington Cabrera Advisor: Carlos Ordonez

Similar presentations

Presentation on theme: "Wellington Cabrera Advisor: Carlos Ordonez"— Presentation transcript:

Similar presentations

About project

Feedback