Wellington Cabrera Carlos Ordonez

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

Active Appearance Models
Chapter 5 Multiple Linear Regression
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
SVM—Support Vector Machines
Bayesian Estimation in MARK
Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.
Minimum Redundancy and Maximum Relevance Feature Selection
Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Introduction to Monte Carlo Methods D.J.C. Mackay.
Bayes Factor Based on Han and Carlin (2001, JASA).
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.
Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
Randomized Algorithms for Bayesian Hierarchical Clustering
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
COT6930 Course Project. Outline Gene Selection Sequence Alignment.
Autoregressive (AR) Spectral Estimation
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
1 Database Systems Group Research Overview OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:
Anders Nielsen Technical University of Denmark, DTU-Aqua Mark Maunder Inter-American Tropical Tuna Commission An Introduction.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.
Canadian Bioinformatics Workshops
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Data Transformation: Normalization
Chapter 7. Classification and Prediction
GEOGG121: Methods Monte Carlo methods, revision
Accelerated Sampling for the Indian Buffet Process
Statistics in MSmcDESPOT
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Big Data Analytics in Parallel Systems
Database Performance Tuning and Query Optimization
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Predictive Performance
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parallel Analytic Systems
Ch13 Empirical Methods.
Bayesian Inference for Mixture Language Models
Robust Full Bayesian Learning for Neural Networks
Chapter 11 Database Performance Tuning and Query Optimization
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
Optimized Algorithms for Data Analysis in Parallel Database Systems
Feature Selection Methods
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
The Gamma Operator for Big Data Summarization
Topological Signatures For Fast Mobility Analysis
Wellington Cabrera Advisor: Carlos Ordonez
Wellington Cabrera Advisor: Carlos Ordonez
Carlos Ordonez, Javier Garcia-Garcia,
The Gamma Operator for Big Data Summarization on an Array DBMS
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
Presentation transcript:

Bayesian variable selection for linear regression in high dimensional microarray data Wellington Cabrera Carlos Ordonez David S. Matusevich Veerabhadran Baladandayuthapani University of Houston M.D. Anderson Cancer Center

Introduction Linear regression: a linear model with a dependent variable Y and one or more explanatory variables X1..Xp. Nowadays, high-d data sets are common: documents, biomedical Smaller models are preferred. Bayesian statistics: better prediction, uncertainty estimation MCMC variable selection used for linear regression.

Introduction Why Data analysis in a DBMS ? Speed Improved data security Query flexibility Avoid exporting and importing data

Introduction Microarray datasets: Small n: few hundred records (patients). Large p: thousands explanatory variables. Our dataset: n=240, p =12000 Dataset is reduced by correlation ranking to d ≤3000 Data Points Variables   V1 Vp X1 Xn

Definitions Linear Regression Variable selection is the search for the best subsets of variables that are good predictors of Y . Assumption: data set contains variables that are redundant.

Challenges Very high dimensional dataset Combinatorial problem: 2p variable subsets An exhaustive/brute-force search unfeasible Greedy algorithms (i.e. Stepwise) help, but produce suboptimal solutions. Bayesian method identifies many promising models. MCMC requires thousands iterations. N : number of iterations. Each iteration consider up to p variables Thus, millions of probabilities (2 × p × N ) are to be calculated. Each probability calculation involves several matrix multiplications and one matrix inversion

Preselection Dimensionality is reduced from p≈10,000 to d< 3000 Two preselection methods: Coefficient of regression ranking Marginal correlations ranking Choice: marginal correlation rankings of the features

Gibbs Sampler The Gibbs Sampler is a Markov Chain Monte Carlo method to obtain a sequence of model parameters approximated from their posterior probability. This sequence of models is characterized by a vector that describes the variables selected at step i of the sequence. The Gibbs Sampler, uses the posterior probability as a criterion for the selection of promising sets of variables. Since we can sample one parameter of the model at a time, the Gibbs sampler can be applied. After N iterations we obtain the Markov Chain sequence  0…  B-1  B  B-1 …  N

Gibbs Sampler We base the computation of on the informative Zellner’s G prior, which enables sufficients statistics. Zellner’s G-prior relies on a conditional Gaussian prior for  and an improper (Jeffreys) prior for 2. A 2nd prior on the size of the model, favoring small models. Thus:

Optimizations Sufficient Statistics summarizes the dataset. n , L and initial projection of Q are calculated in one pass and stored in memory. Selected columns of Q are calculated as the projections require it. Low frequency variable pruning: If a variable has a very low frequency after burn-in period, it is likely that such variable is not useful. Those variables are removed Integrating LAPACK To speed up the matrix inversion, numerical accuracy

Algorithm

Programming:UDFs in a RDBMS UDF: Preselection, Data Summarization TVF: Gibbs sampler Written in a high level language: i.e. C++ or C#. As fast as SQL queries and in certain cases faster UDFs and TVFs benefit from the flexibility and speed of C-like languages. In an Array database (SciDB) Custom Operators: Data Summarization (Γ), Preselection. Written in C++ Fast, arrays, parallel, in-place data analytics

High-d Bayesian Variable Selection Time Complexity Pre-selection: O(nd + d log d) One iteration: O(nd+ndk2 + dk3) Since pre-selection runs one time, it is negligible

Experimental Evaluation We use our algorithm to search for parsimonious (small) models of the survival time from patients suffering brain tumors. We run experiments on 3 datasets (same patients) X1 (gene expression), X2 (microRNA expression) X1 U X2 (joint analysis of gene and microRNA expression). 1000 up to 3000 variables for X1, X2 and X1 U X2 were preselected, as explained in previous slides

Consistency of experiments Posterior probabilities of the variables across experiments. Dataset X1 U X2 , with parameters: d=2000, c=200000, N=30000

R2 and size of the model (k) for several experiments Dataset d c I T R2 k R2MAX kMAX X1 2000 120000 30000 2:18 0.516 20 0.736 34 1000 50000 100000 2:44 0.461 21 0.694 48 X2 534 4000 1:44 0.278 29 0.444 63 X1 U X2 200000 2:36 0.487 14 0. 690 3:42 0.531 19 0. 793 42 16

Top 5 markers for dataset X1 U X2 Our experiments find some top markers that have been previously implicated in the literature along with new markers that could merit further functional validation. hsamir-222 and hsa-mir-223 has been identified before by several researchers as part of the Glioblastoma multiforme (GBM) prediction signature. Top 5 markers for dataset X1 U X2 Variable Name Probability hsa-mir-223 >0.81 55711_at >0.39 79647_at >0.32 51164_at >0.27 hsa-mir-222 >0.25

Performance In comparisons with R, our algorithm shows a 30 to 100 fold time improvement, depending on the number of dimensions of the experiment. For instance, in the case of d = 534 (Dataset X2), R performs 1000 iterations in 8378 seconds, whereas we perform 30000 iterations in 8620 seconds. A similar run in R would take almost 3 days to complete.

Conclusions Dramatic improvement over a popular implementation in R: 30-100 times faster. Accurate results: Small models are obtained, with R2 better than 0.50 and R2MAX better than 0.69 in most cases. Optimizations successfully implemented in a DBMS Sufficient statistics, infrequent variable pruning, LAPACK. Small models are obtained using appropriate tuning Parameter c large enough Prior on model size, favoring small models

Data Mining Algorithms as a Service in the Cloud Exploiting Relational Database Systems C.Ordonez J.Garcia-Garcia C.Garcia-Alvarado W.Cabrera V.Baladandayuthapani M.Quraishi

Overview Smart local processing: exploit CPU/RAM of local DBMS Integrated: Tightly integrated with two DBMSs Fast: one pass over input table for most algorithms; parallel Simple: Calling algorithm called is simple: Stored Procedure with default parameters Relational: relational tables to store models, job parameters

Statistical Models K-means clustering SSVS: Bayesian Variable Selection in Linear Regression K-means clustering

System

Features Programmed with UDFs, queries, ODBC client Processing modes: hybrid, local, cloud Summarization combined with sampling locally LAPACK: fast , accurate, stable Efficient: non-blocking delivery of summaries, matrix computations in RAM, parallel, local sampling Management: scheduling jobs, job history

Features All models exploit common matrices L, Q: full Q (PCA, SSVS)/diagonal (NB, K-means) UDF: arrays, RAM, parallel, multithreaded L, Q computable in one parallel scan in RAM Model computed in RAM and equations rewritten with n ,L ,Q instead of X (avoid multiple scans) Aggregate UDFs (UDAs) to summarize: RAM memory Statistical models computed in the cloud receiving summaries (n, L ,Q) from client

SUMMARY Sufficient statistics transmitted to cloud Hybrid processing is best Job policy: FIFO->SJF->RR Parallel summarization, parallel scan Model computation in RAM in the cloud Complicated number crunching in the cloud Job and model history in the cloud All data is relational tables: they can be queried, stored securely