Wellington Cabrera Advisor: Carlos Ordonez

Slides:



Advertisements
Similar presentations
Bayesian Belief Propagation
Advertisements

Choosing an Order for Joins
Probabilistic models Jouni Tuomisto THL. Outline Deterministic models with probabilistic parameters Hierarchical Bayesian models Bayesian belief nets.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Bayesian Estimation in MARK
Shinichi Nakajima Sumio Watanabe  Tokyo Institute of Technology
Gibbs Sampling Qianji Zheng Oct. 5th, 2010.
Introduction Paleontologists often need to model complex systems with many variables and complex relationships. In such models, information is often characterized.
Beam Sampling for the Infinite Hidden Markov Model Van Gael, et al. ICML 2008 Presented by Daniel Johnson.
Industrial Engineering College of Engineering Bayesian Kernel Methods for Binary Classification and Online Learning Problems Theodore Trafalis Workshop.
Today Introduction to MCMC Particle filters and MCMC
End of Chapter 8 Neil Weisenfeld March 28, 2005.
The Gamma Operator for Big Data Summarization
Robin McDougall, Ed Waller and Scott Nokleby Faculties of Engineering & Applied Science and Energy Systems & Nuclear Science 1.
1 A Network Traffic Classification based on Coupled Hidden Markov Models Fei Zhang, Wenjun Wu National Lab of Software Development.
Bayes Factor Based on Han and Carlin (2001, JASA).
1 CE 530 Molecular Simulation Lecture 7 David A. Kofke Department of Chemical Engineering SUNY Buffalo
1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.
Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Statistical Sampling-Based Parametric Analysis of Power Grids Dr. Peng Li Presented by Xueqian Zhao EE5970 Seminar.
Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.
Database Systems Carlos Ordonez. What is “Database systems” research? Input? large data sets, large files, relational tables How? Fast external algorithms;
Bayesian Networks Optimization of the Human-Computer Interaction process in a Big Data Scenario Candidate: Emanuele Charalambis University of Modena and.
1/18 New Feature Presentation of Transition Probability Matrix for Image Tampering Detection Luyi Chen 1 Shilin Wang 2 Shenghong Li 1 Jianhua Li 1 1 Department.
University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li Department of Computer and Information Science.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Unsupervised Mining of Statistical Temporal Structures in Video Liu ze yuan May 15,2011.
Ensemble Methods in Machine Learning
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Introduction to Sampling Methods Qi Zhao Oct.27,2004.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
1 Database Systems Group Research Overview OLAP Statistical Tests Goal: Isolate factors that cause significant changes in a measured value – Ex:
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Overview Modern chip designs have multiple IP components with different process, voltage, temperature sensitivities Optimizing mix to different customer.
Learning Recommender Systems with Adaptive Regularization
Accelerated Sampling for the Indian Buffet Process
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Multimodal Learning with Deep Boltzmann Machines
Database Performance Tuning and Query Optimization
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Lecture 23: Feature Selection
Efficient Image Classification on Vertically Decomposed Data
Predictive Performance
Markov Networks.
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Parallel Analytic Systems
Ch13 Empirical Methods.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Robust Full Bayesian Learning for Neural Networks
Chapter 11 Database Performance Tuning and Query Optimization
Big Data Analytics: Exploring Graphs with Optimized SQL Queries
Optimized Algorithms for Data Analysis in Parallel Database Systems
Feature Selection Methods
Wellington Cabrera Carlos Ordonez
The Gamma Operator for Big Data Summarization
Wellington Cabrera, Carlos Ordonez (presenter)
Markov Networks.
Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks
Wellington Cabrera Advisor: Carlos Ordonez
Carlos Ordonez, Javier Garcia-Garcia,
The Gamma Operator for Big Data Summarization on an Array DBMS
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
Derivatives and Gradients
Presentation transcript:

Wellington Cabrera Advisor: Carlos Ordonez Optimizing Bayesian Methods for High Dimensional Data Sets on Array-Based Parallel Database Systems Wellington Cabrera Advisor: Carlos Ordonez

Motivation Nowadays, large data sets are present everywhere Much more variables ( attributes, features) than before Microarray data Sensor data Network traffic data Medical Imaging Linear regression: very popular analytical technique: Economics Social Science Biology

Problem Linear regression, directly applied to high-dimensional datasets, leads to complex models. Hundreds or thousands variables Variable selection methods effective, small subsets of predictors. Current algorithms have several issues: Can not work with data sets larger than RAM Can not deal with many thousand dimensions Poor performance Variable selection methods focus on finding effective, small subsets of predictors

Contribution summary Our work: Overcomes the problem of data sets larger than RAM. Facilitates the analysis of data sets with very high dimensionality Introduces an accelerated Gibbs sampler that requires only one pass to the data set 10s to 100s times faster than standard approach

Related Work Classical approach Information based approach Forward selection Backward elimination Information based approach Akakike’s Information Criteria (AIC) Bayesian Information Criteria (BIC) Bayesian approach with Markov Chain Monte Carlo (MCMC) methods GVS , SSVS

Challenges Huge search space Large number of matrix operations For a dataset with p variables, space size = 2p Even for a moderate p=64 the search would take literally for ever Imagine you can test 1 million cases in a second: 245 cases in a year. Large number of matrix operations Each iteration of MCMC requires the evaluation of p probabilities complex formula involving matrix inversion and several matrix multiplications Large data sets do not fit in RAM Disk-based computation too slow P probabilities ( one per variable)

The Linear regression model

Three Step Algorithm Pre-selection Summarization Accelerated Gibbs sampler

Pre-selection Reduces the data set from original dimensionality p to a smaller dimensionality d Pre-selection by marginal correlation ranking Calculate correlation between each variable and the outcome Sort in descending order Take the best d variables As a result, the best d variables are considered for the rest of the process

Summarization Data set properties can be summarized by the sufficient statistics n, L, Q. Such sufficient statistics are a compact representation of the dataset. Calculated in one pass. We do not load the dataset in RAM

Extended Summarization in one pass Z=[1,X,Y] Γ= Z ZT

Gibbs Sampler Markov Chain Monte Carlo method (MCMC) Consider a linear model with parameters Θ={βσγ} Where binary vector γ[i] describes the variables selected at the step i of the sequence. Gibbs sampler generates the Markov chain

Optimizing Gibbs Sampler No-conjugate priors require the full Markov Chain. Conjugate priors simplify the computation: They do not need the computation of β,σ. Under the Zellner g prior, the probability of γis given by: We exploit the sufficient statistics calculated in the summarization step to accelerate Gibbs Sampler.

Prototype in Array DBMS Pre-selection Two custom operators: preselect(), correlation() Summarization Custom operator: gamma() Gibbs Sampler R script, outside DBMS

Prototype overview in Array DBMS

Summarization

Prototype in Row DBMS Pre-selection Summarization Gibbs Sampler SQL queries + User defined function Summarization Aggregate UDF Gibbs Sampler Table Valued Function

Overview in Row DBMS

Experiments Data sets used in this work Software specification

Results Execution time for Summarization and Gibbs sampling step Data set Songs. Array DBMS

Results Execution time in seconds for Pre-selection and Summarization Data set mRNA. Array DBMS. Execution time in seconds for Gibbs sampler step. Data set mRNA. Array DBMS. 30 Thousands iterations.

Execution time in seconds for Pre-selection, Summarization Results Execution time in seconds for Pre-selection, Summarization and Gibbs Sampling. Data set mRNA. Row DBMS. 30 Thousand iterations

Summarization Performance

Gibbs sampling performance

Posterior probabilities Dataset mRNA. Plots for d=1000 and d=2000

4-fold validation results. Quality of results 4-fold validation results. Data set mRNA, d=2000.

Comparison Three Step Algorithm BayesVarSel Array DBMS + R script Hardware: Intel Quad Core OS: Linux Dataset: mRNA n=248 d=500 Prior: Zellner-g N=100 time= 16 secs Model size ~ 30 R package Hardware: Intel Quad Core OS: Linux Dataset: mRNA n=248 d=500 Prior: Zellner-g N=100 time= 2104 sec Model size ~ 180

Comparison Comparing 3 versions of the Three step algorithm with an R package. Data set mRNA. 100 Iterations. Time in seconds

Conclusion We present an Bayesian Variable selection algorithm that enables the analysis of very high dimensional data sets that may not fit in RAM. We propose a Pre-selection step based on marginal correlation ranking, in order to reduce the dimensionality of the original data set. We demonstrate that the Gibbs Sampler works optimally in Row and Array DBMS based on the summarization matrix, computed in only one pass. Our algorithm shows an outstanding performance. Compared to a public domain R package, our prototype is two orders of magnitude faster. Our algorithm identifies small models , with R-squared generally better than 0.5

Thank you!