Wellington Cabrera Advisor: Carlos Ordonez

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

Bayesian Estimation in MARK

Introduction Paleontologists often need to model complex systems with many variables and complex relationships. In such models, information is often characterized.

Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

End of Chapter 8 Neil Weisenfeld March 28, 2005.

Active Set Support Vector Regression

Support Vector Regression David R. Musicant and O.L. Mangasarian International Symposium on Mathematical Programming Thursday, August 10, 2000

Computer vision: models, learning and inference Chapter 10 Graphical Models.

1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo

1 Statistical Mechanics and Multi- Scale Simulation Methods ChBE Prof. C. Heath Turner Lecture 11 Some materials adapted from Prof. Keith E. Gubbins:

1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.

Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

Support Vector Machines for Data Fitting and Classification David R. Musicant with Olvi L. Mangasarian UW-Madison Data Mining Institute Annual Review June.

Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.

Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.

Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.

Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;

Scalable Keyword Search on Large RDF Data. Abstract Keyword search is a useful tool for exploring large RDF datasets. Existing techniques either rely.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.

Harnessing the Cloud for Securely Outsourcing Large- Scale Systems of Linear Equations.

Introduction to Sampling Methods Qi Zhao Oct.27,2004.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:

Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.

1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.

Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

Fast search for Dirichlet process mixture models

Generalized and Hybrid Fast-ICA Implementation using GPU

Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu

GEOGG121: Methods Monte Carlo methods, revision

Bayesian data analysis

Accelerated Sampling for the Indian Buffet Process

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Multimodal Learning with Deep Boltzmann Machines

Big Data Analytics in Parallel Systems

Database Performance Tuning and Query Optimization

Jun Liu Department of Statistics Stanford University

Omiros Papaspiliopoulos and Gareth O. Roberts

Recovering Temporally Rewiring Networks: A Model-based Approach

A Cloud System for Machine Learning Exploiting a Parallel Array DBMS

Markov chain monte carlo

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Lecture 23: Feature Selection

A Non-Parametric Bayesian Method for Inferring Hidden Causes

Predictive Performance

Parallel Analytic Systems

Ch13 Empirical Methods.

Lecture 12 Model Building

Lecture 15 Sampling.

Network Inference Chris Holmes Oxford Centre for Gene Function, &,

Chapter 11 Database Performance Tuning and Query Optimization

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Sparse Principal Component Analysis

Optimized Algorithms for Data Analysis in Parallel Database Systems

Capturing Database Transformations

Wellington Cabrera Carlos Ordonez

The Gamma Operator for Big Data Summarization

Wellington Cabrera Advisor: Carlos Ordonez

Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks

University of Wisconsin - Madison

Carlos Ordonez, Javier Garcia-Garcia,

The Gamma Operator for Big Data Summarization on an Array DBMS

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.

Reinforcement Learning (2)

Presentation transcript:

Wellington Cabrera Advisor: Carlos Ordonez Optimizing Bayesian Methods for High Dimensional Data Sets on Array-Based Parallel Database Systems Wellington Cabrera Advisor: Carlos Ordonez Welcome everybody. Thank you for coming this morining I am glad to defend my Master Thesis today

Motivation Nowadays, large data sets are present everywhere Many more variables ( attributes, features) than before Microarray data Sensor data Network traffic data Medical Imaging Linear regression: fundamental analytical technique: Economics Social Science Biology Statistical models in DBMS Data security Concurrency Recovery Microarray data consist of measures of gene expression level Data set with several thousand variables Sensor data: Large networks of sensor Medical Imaging, each pixel is considered as a dimension

Problem Linear regression, directly applied to high-dimensional datasets, leads to complex models. Hundreds or thousands variables Difficult to interpret A solution than cannot be applied in practice Variable selection methods effective, small subsets of predictors. Current algorithms have several issues: Cannot work with data sets larger than RAM Cannot deal with many thousand dimensions Inefficiency Consider d= dimensions, n = observations Cases: d < n , d > n Variable selection methods focus on finding effective, small subsets of predictors

Contribution summary Our work: Enables the analysis of very high dimensional data sets in the DBMS. Overcomes the problem of data sets larger than RAM (d< n) Introduces an accelerated Gibbs sampler requires only one pass on the data set thousands iterations in main memory 10s to 100s times faster than standard approach

Related Work Classical approach Information based approach Forward selection Backward elimination Information based approach Akakike’s Information Criteria (AIC) Bayesian Information Criteria (BIC) Bayesian approach Nowadays, approximation by Markov Chain Monte Carlo (MCMC) methods GVS , SSVS Classical approaches relies on Greedy algorithms MCMC increased popularity due to more computer power

Challenges Huge search space Large number of matrix operations For a dataset with p variables, space size = 2p Even for a moderate p=64 the search would take literally for ever Imagine you can test 1 million cases in a second: 245 cases in a year. We reduce the original dimensionality p (many thousands) to a smaller dimensionality d (one, two thousands ) Large number of matrix operations Each iteration of MCMC requires the evaluation of p probabilities complex formulas involving matrix inversion and several matrix multiplications Large data sets do not fit in RAM Disk-based computation too slow P probabilities ( one per variable)

The Linear regression model

Three Step Algorithm Pre-selection Summarization Accelerated Gibbs sampler

Pre-selection Reduces the data set from original dimensionality p to a smaller dimensionality d Pre-selection by marginal correlation ranking Calculate correlation between each variable and the outcome Sort in descending order Take the best d variables As a result, the top d variables are considered for further analysys

Summarization Data set properties can be summarized by the sufficient statistics n, L, Q. Such sufficient statistics are a compact representation of the dataset. Calculated in one pass. We do not load the dataset in RAM

Extended Summarization in one pass Z=[1,X,Y] Γ= Z ZT

Gibbs Sampler Markov Chain Monte Carlo method (MCMC) Consider a linear model with parameters Θ={βσγ} Where binary vector γ[i] describes the variables selected at the step i of the sequence. Gibbs sampler generates the Markov chain

Optimizing Gibbs Sampler Non-conjugate Gaussian priors require the full Markov Chain. Conjugate priors simplify the computation. β,σ integrated out. Marin-Roberts formulation Zellner-g prior for β and Jeffrey’s prior for σ Reading the data set in each iterationis not necessary. Matrix products already in Γare reused.

Algorithm in Array DBMS (most recent contribution) Pre-selection Two custom operators: preselect(), correlation() Summarization Custom operator: gamma() Gibbs Sampler R script, outside DBMS First work addressing Bayesian models in Array DBMS

System overview in Array DBMS

Summarization

Algorithm in Row DBMS ( previous contribution) Pre-selection SQL queries + User defined function Summarization Aggregate UDF Gibbs Sampler Table Valued Function

System overview in Row DBMS

Experiments Data sets used in this work Software specification

Posterior probabilities Dataset mRNA. Plots for d=1000 and d=2000

4-fold validation results. Quality of results 4-fold validation results. Data set mRNA, d=2000.

Comparison Three Step Algorithm BayesVarSel Array DBMS + R script Hardware: Intel Quad Core OS: Linux Dataset: mRNA n=248 d=500 Prior: Zellner-g N=100 time= 16 secs Model size ~ 30 R package Hardware: Intel Quad Core OS: Linux Dataset: mRNA n=248 d=500 Prior: Zellner-g N=100 time= 2104 sec Model size ~ 180

Comparison Comparing 3 versions of the Three step algorithm with an R package. Data set mRNA. 100 Iterations. Time in seconds

Performance Results Execution time for Summarization and Gibbs sampling step Data set Songs. Array DBMS

Performance Results Execution time in seconds for Pre-selection and Summarization Data set mRNA. Array DBMS. Execution time in seconds for Gibbs sampler step. Data set mRNA. Array DBMS. 30 Thousands iterations.

Execution time in seconds for Pre-selection, Summarization Performance Results Execution time in seconds for Pre-selection, Summarization and Gibbs Sampling. Data set mRNA. Row DBMS. 30 Thousand iterations

Summarization Performance

Gibbs sampling performance

Conclusions We present an Bayesian Variable selection algorithm that enables the analysis of very high dimensional data sets that may not fit in RAM. We propose a Pre-selection step based on marginal correlation ranking, in order to reduce the dimensionality of the original data set. Array DBMS more suitable for computing statistical models, supporting very large arrays. Custom operator required for optimal performace. We demonstrate that the Gibbs Sampler works optimally in Row and Array DBMS based on the summarization matrix, computed in only one pass. Our algorithm shows an outstanding performance. Compared to a public domain R package, our prototype is two orders of magnitude faster. Our algorithm identifies small models , with R-squared generally better than 0.5

Future work We will explore different kernels and proposal functions for Metropolis-Hastings algorithm To find results in less iterations To choose kernels/proposals suitable for optimization via data summarization Improve the algorithm data sets larger than RAM explore non-informative prior

List of publications Carlos Ordonez, Yiqun Zhang, Wellington Cabrera: The Gamma Operator for Big Data Summarization on an Array DBMS. BigMine 2014: 88-103 Wellington Cabrera, Carlos Ordonez, David Sergio Matusevich, Veerabhadran Baladandayuthapani: Bayesian variable selection for linear regression in high dimensional microarray data. DTMBIO 2013: 17-18 Wellington Cabrera, Carlos Ordonez, David Sergio Matusevich, Veerabhadran Baladandayuthapani: Fast Bayesian Variable Selection Algorithms for High Dimensional Genomics Data. IJDMB (under revision) Carlos Ordonez, Javier García-García, Carlos Garcia-Alvarado, Wellington Cabrera, Veerabhadran Baladandayuthapani, Mohammed S. Quraishi: Data mining algorithms as a service in the cloud exploiting relational database systems. SIGMOD Conference 2013: 1001-1004 Carlos Ordonez, Sofian Maabout, David Sergio Matusevich, Wellington Cabrera: Extending ER models to capture database transformations to build data sets for data mining. Data Knowl. Eng. 89: 38-54 (2014)

Thank you!