Download presentation
Presentation is loading. Please wait.
1
Wellington Cabrera Advisor: Carlos Ordonez
Optimizing Bayesian Methods for High Dimensional Data Sets on Array-Based Parallel Database Systems Wellington Cabrera Advisor: Carlos Ordonez
2
Motivation Nowadays, large data sets are present everywhere
Much more variables ( attributes, features) than before Microarray data Sensor data Network traffic data Medical Imaging Linear regression: very popular analytical technique: Economics Social Science Biology
3
Problem Linear regression, directly applied to high-dimensional datasets, leads to complex models. Hundreds or thousands variables Variable selection methods effective, small subsets of predictors. Current algorithms have several issues: Can not work with data sets larger than RAM Can not deal with many thousand dimensions Poor performance Variable selection methods focus on finding effective, small subsets of predictors
4
Contribution summary Our work:
Overcomes the problem of data sets larger than RAM. Facilitates the analysis of data sets with very high dimensionality Introduces an accelerated Gibbs sampler that requires only one pass to the data set 10s to 100s times faster than standard approach
5
Related Work Classical approach Information based approach
Forward selection Backward elimination Information based approach Akakike’s Information Criteria (AIC) Bayesian Information Criteria (BIC) Bayesian approach with Markov Chain Monte Carlo (MCMC) methods GVS , SSVS
6
Challenges Huge search space Large number of matrix operations
For a dataset with p variables, space size = 2p Even for a moderate p=64 the search would take literally for ever Imagine you can test 1 million cases in a second: 245 cases in a year. Large number of matrix operations Each iteration of MCMC requires the evaluation of p probabilities complex formula involving matrix inversion and several matrix multiplications Large data sets do not fit in RAM Disk-based computation too slow P probabilities ( one per variable)
7
The Linear regression model
8
Three Step Algorithm Pre-selection Summarization
Accelerated Gibbs sampler
9
Pre-selection Reduces the data set from original dimensionality p to a smaller dimensionality d Pre-selection by marginal correlation ranking Calculate correlation between each variable and the outcome Sort in descending order Take the best d variables As a result, the best d variables are considered for the rest of the process
10
Summarization Data set properties can be summarized by the sufficient statistics n, L, Q. Such sufficient statistics are a compact representation of the dataset. Calculated in one pass. We do not load the dataset in RAM
11
Extended Summarization in one pass
Z=[1,X,Y] Γ= Z ZT
12
Gibbs Sampler Markov Chain Monte Carlo method (MCMC)
Consider a linear model with parameters Θ={βσγ} Where binary vector γ[i] describes the variables selected at the step i of the sequence. Gibbs sampler generates the Markov chain
13
Optimizing Gibbs Sampler
No-conjugate priors require the full Markov Chain. Conjugate priors simplify the computation: They do not need the computation of β,σ. Under the Zellner g prior, the probability of γis given by: We exploit the sufficient statistics calculated in the summarization step to accelerate Gibbs Sampler.
14
Prototype in Array DBMS
Pre-selection Two custom operators: preselect(), correlation() Summarization Custom operator: gamma() Gibbs Sampler R script, outside DBMS
15
Prototype overview in Array DBMS
16
Summarization
17
Prototype in Row DBMS Pre-selection Summarization Gibbs Sampler
SQL queries + User defined function Summarization Aggregate UDF Gibbs Sampler Table Valued Function
18
Overview in Row DBMS
19
Experiments Data sets used in this work Software specification
20
Results Execution time for Summarization and Gibbs sampling step
Data set Songs. Array DBMS
21
Results Execution time in seconds for Pre-selection and Summarization
Data set mRNA. Array DBMS. Execution time in seconds for Gibbs sampler step. Data set mRNA. Array DBMS. 30 Thousands iterations.
22
Execution time in seconds for Pre-selection, Summarization
Results Execution time in seconds for Pre-selection, Summarization and Gibbs Sampling. Data set mRNA. Row DBMS. 30 Thousand iterations
23
Summarization Performance
24
Gibbs sampling performance
25
Posterior probabilities
Dataset mRNA. Plots for d=1000 and d=2000
26
4-fold validation results.
Quality of results 4-fold validation results. Data set mRNA, d=2000.
27
Comparison Three Step Algorithm BayesVarSel Array DBMS + R script
Hardware: Intel Quad Core OS: Linux Dataset: mRNA n=248 d=500 Prior: Zellner-g N=100 time= 16 secs Model size ~ 30 R package Hardware: Intel Quad Core OS: Linux Dataset: mRNA n=248 d=500 Prior: Zellner-g N=100 time= 2104 sec Model size ~ 180
28
Comparison Comparing 3 versions of the Three step algorithm with an R package. Data set mRNA. 100 Iterations. Time in seconds
29
Conclusion We present an Bayesian Variable selection algorithm that enables the analysis of very high dimensional data sets that may not fit in RAM. We propose a Pre-selection step based on marginal correlation ranking, in order to reduce the dimensionality of the original data set. We demonstrate that the Gibbs Sampler works optimally in Row and Array DBMS based on the summarization matrix, computed in only one pass. Our algorithm shows an outstanding performance. Compared to a public domain R package, our prototype is two orders of magnitude faster. Our algorithm identifies small models , with R-squared generally better than 0.5
30
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.