Wellington Cabrera Carlos Ordonez

Bayesian variable selection for linear regression in high dimensional microarray data
Wellington Cabrera Carlos Ordonez David S. Matusevich Veerabhadran Baladandayuthapani University of Houston M.D. Anderson Cancer Center

Introduction Linear regression: a linear model with a dependent variable Y and one or more explanatory variables X1..Xp. Nowadays, high-d data sets are common: documents, biomedical Smaller models are preferred. Bayesian statistics: better prediction, uncertainty estimation MCMC variable selection used for linear regression.

Introduction Why Data analysis in a DBMS ? Speed
Improved data security Query flexibility Avoid exporting and importing data

Introduction Microarray datasets:
Small n: few hundred records (patients). Large p: thousands explanatory variables. Our dataset: n=240, p =12000 Dataset is reduced by correlation ranking to d ≤3000 Data Points Variables V1 Vp X1 Xn

Definitions Linear Regression
Variable selection is the search for the best subsets of variables that are good predictors of Y . Assumption: data set contains variables that are redundant.

Challenges Very high dimensional dataset
Combinatorial problem: 2p variable subsets An exhaustive/brute-force search unfeasible Greedy algorithms (i.e. Stepwise) help, but produce suboptimal solutions. Bayesian method identifies many promising models. MCMC requires thousands iterations. N : number of iterations. Each iteration consider up to p variables Thus, millions of probabilities (2 × p × N ) are to be calculated. Each probability calculation involves several matrix multiplications and one matrix inversion

Preselection Dimensionality is reduced from p≈10,000 to d< 3000
Two preselection methods: Coefficient of regression ranking Marginal correlations ranking Choice: marginal correlation rankings of the features

Gibbs Sampler The Gibbs Sampler is a Markov Chain Monte Carlo method to obtain a sequence of model parameters approximated from their posterior probability. This sequence of models is characterized by a vector that describes the variables selected at step i of the sequence. The Gibbs Sampler, uses the posterior probability as a criterion for the selection of promising sets of variables. Since we can sample one parameter of the model at a time, the Gibbs sampler can be applied. After N iterations we obtain the Markov Chain sequence  0…  B-1  B  B-1 …  N

Gibbs Sampler We base the computation of on the informative Zellner’s G prior, which enables sufficients statistics. Zellner’s G-prior relies on a conditional Gaussian prior for  and an improper (Jeffreys) prior for 2. A 2nd prior on the size of the model, favoring small models. Thus:

Optimizations Sufficient Statistics summarizes the dataset.
n , L and initial projection of Q are calculated in one pass and stored in memory. Selected columns of Q are calculated as the projections require it. Low frequency variable pruning: If a variable has a very low frequency after burn-in period, it is likely that such variable is not useful. Those variables are removed Integrating LAPACK To speed up the matrix inversion, numerical accuracy

Algorithm

Programming:UDFs in a RDBMS
UDF: Preselection, Data Summarization TVF: Gibbs sampler Written in a high level language: i.e. C++ or C#. As fast as SQL queries and in certain cases faster UDFs and TVFs benefit from the flexibility and speed of C-like languages. In an Array database (SciDB) Custom Operators: Data Summarization (Γ), Preselection. Written in C++ Fast, arrays, parallel, in-place data analytics

High-d Bayesian Variable Selection
Time Complexity Pre-selection: O(nd + d log d) One iteration: O(nd+ndk2 + dk3) Since pre-selection runs one time, it is negligible

Experimental Evaluation
We use our algorithm to search for parsimonious (small) models of the survival time from patients suffering brain tumors. We run experiments on 3 datasets (same patients) X1 (gene expression), X2 (microRNA expression) X1 U X2 (joint analysis of gene and microRNA expression). 1000 up to variables for X1, X2 and X1 U X2 were preselected, as explained in previous slides

Consistency of experiments
Posterior probabilities of the variables across experiments. Dataset X1 U X2 , with parameters: d=2000, c=200000, N=30000

R2 and size of the model (k) for several experiments
Dataset d c I T R2 k R2MAX kMAX X1 2000 120000 30000 2:18 0.516 20 0.736 34 1000 50000 100000 2:44 0.461 21 0.694 48 X2 534 4000 1:44 0.278 29 0.444 63 X1 U X2 200000 2:36 0.487 14 0. 690 3:42 0.531 19 0. 793 42 16

Top 5 markers for dataset X1 U X2
Our experiments find some top markers that have been previously implicated in the literature along with new markers that could merit further functional validation. hsamir-222 and hsa-mir-223 has been identified before by several researchers as part of the Glioblastoma multiforme (GBM) prediction signature. Top 5 markers for dataset X1 U X2 Variable Name Probability hsa-mir-223 >0.81 55711_at >0.39 79647_at >0.32 51164_at >0.27 hsa-mir-222 >0.25

Performance In comparisons with R, our algorithm shows a 30 to 100 fold time improvement, depending on the number of dimensions of the experiment. For instance, in the case of d = 534 (Dataset X2), R performs 1000 iterations in 8378 seconds, whereas we perform iterations in 8620 seconds. A similar run in R would take almost 3 days to complete.

Conclusions Dramatic improvement over a popular implementation in R: times faster. Accurate results: Small models are obtained, with R2 better than 0.50 and R2MAX better than 0.69 in most cases. Optimizations successfully implemented in a DBMS Sufficient statistics, infrequent variable pruning, LAPACK. Small models are obtained using appropriate tuning Parameter c large enough Prior on model size, favoring small models

Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems
C.Ordonez J.Garcia-Garcia C.Garcia-Alvarado W.Cabrera V.Baladandayuthapani M.Quraishi

Overview Smart local processing: exploit CPU/RAM of local DBMS
Integrated: Tightly integrated with two DBMSs Fast: one pass over input table for most algorithms; parallel Simple: Calling algorithm called is simple: Stored Procedure with default parameters Relational: relational tables to store models, job parameters

Statistical Models K-means clustering
SSVS: Bayesian Variable Selection in Linear Regression K-means clustering

System

Features Programmed with UDFs, queries, ODBC client
Processing modes: hybrid, local, cloud Summarization combined with sampling locally LAPACK: fast , accurate, stable Efficient: non-blocking delivery of summaries, matrix computations in RAM, parallel, local sampling Management: scheduling jobs, job history

Features All models exploit common matrices L, Q: full Q (PCA, SSVS)/diagonal (NB, K-means) UDF: arrays, RAM, parallel, multithreaded L, Q computable in one parallel scan in RAM Model computed in RAM and equations rewritten with n ,L ,Q instead of X (avoid multiple scans) Aggregate UDFs (UDAs) to summarize: RAM memory Statistical models computed in the cloud receiving summaries (n, L ,Q) from client

SUMMARY Sufficient statistics transmitted to cloud
Hybrid processing is best Job policy: FIFO->SJF->RR Parallel summarization, parallel scan Model computation in RAM in the cloud Complicated number crunching in the cloud Job and model history in the cloud All data is relational tables: they can be queried, stored securely

Wellington Cabrera Carlos Ordonez

Similar presentations

Presentation on theme: "Wellington Cabrera Carlos Ordonez"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Wellington Cabrera Carlos Ordonez

Similar presentations

Presentation on theme: "Wellington Cabrera Carlos Ordonez"— Presentation transcript:

Similar presentations

About project

Feedback