Event History Models: Why R? Why SabreR? Rob Crouchley.

Slides:



Advertisements
Similar presentations
Introduction to Monte Carlo Markov chain (MCMC) methods
Advertisements

MCMC estimation in MlwiN
NCeSS e-Stat quantitative node Prof. William Browne & Prof. Jon Rasbash University of Bristol.
Latent normal models for missing data Harvey Goldstein Centre for Multilevel Modelling University of Bristol.
Applied Bayesian Inference for Agricultural Statisticians Robert J. Tempelman Department of Animal Science Michigan State University 1.
BAYESIAN INFERENCE Sampling techniques
Simulation Where real stuff starts. ToC 1.What, transience, stationarity 2.How, discrete event, recurrence 3.Accuracy of output 4.Monte Carlo 5.Random.
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
John Kewley e-Science Centre GIS and Grid Computing Workshop 13 th September 2005, Leeds Grid Middleware and GROWL John Kewley
Monte Carlo methods for estimating population genetic parameters Rasmus Nielsen University of Copenhagen.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
SIMULATION. Simulation Definition of Simulation Simulation Methodology Proposing a New Experiment Considerations When Using Computer Models Types of Simulations.
FLANN Fast Library for Approximate Nearest Neighbors
Monté Carlo Simulation MGS 3100 – Chapter 9. Simulation Defined A computer-based model used to run experiments on a real system.  Typically done on a.
Scot Exec Course Nov/Dec 04 Ambitious title? Confidence intervals, design effects and significance tests for surveys. How to calculate sample numbers when.
STRATEGIES INVOLVED IN REMOTE COMPUTATION
DIRAC Web User Interface A.Casajus (Universitat de Barcelona) M.Sapunov (CPPM Marseille) On behalf of the LHCb DIRAC Team.
Tools and Utilities for parallel and serial codes in ENEA-GRID environment CRESCO Project: Salvatore Raia SubProject I.2 C.R. ENEA-Portici. 11/12/2007.
Introduction to MCMC and BUGS. Computational problems More parameters -> even more parameter combinations Exact computation and grid approximation become.
Priors, Normal Models, Computing Posteriors
Matthew Palmer, Cambridge University01/10/2015 First Use of the UK e-Science Grid Overview The Physics Experiences Looking forward Conclusions Matthew.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
R2WinBUGS: Using R for Bayesian Analysis Matthew Russell Rongxia Li 2 November Northeastern Mensurationists Meeting.
INTRODUCTION SOFTWARE HARDWARE DIFFERENCE BETWEEN THE S/W AND H/W.
An Introduction to Multivariate Multilevel GLMs Hello and welcome.
Tot 15 LTPDA Graphic User Interface summary and status N. Tateo 26/06/2007.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
1 G Lect 7M Statistical power for regression Statistical interaction G Multiple Regression Week 7 (Monday)
A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models Russell Almond Florida State University College of Education Educational Psychology.
Multilevel Modeling Software Wayne Osgood Crime, Law & Justice Program Department of Sociology.
McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved. 1.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Latent Class Regression Model Graphical Diagnostics Using an MCMC Estimation Procedure Elizabeth S. Garrett Scott L. Zeger Johns Hopkins University
Creating and running an application.
More complex event history analysis. Start of Study End of Study 0 t1 0 = Unemployed; 1 = Working UNEMPLOYMENT AND RETURNING TO WORK STUDY Spell or Episode.
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
Application of the MCMC Method for the Calibration of DSMC Parameters James S. Strand and David B. Goldstein The University of Texas at Austin Sponsored.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Lecture #9: Introduction to Markov Chain Monte Carlo, part 3
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Chapter 5 Multilevel Models
Tutorial I: Missing Value Analysis
John Kewley e-Science Centre All Hands Meeting st September, Nottingham GROWL: A Lightweight Grid Services Toolkit and Applications John Kewley.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
An Alternative Package for Estimating Multivariate Generalised Linear Mixed Models in R Damon Berridge, Robert Crouchley & Daniel Grose, Lancaster University,
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Anders Nielsen Technical University of Denmark, DTU-Aqua Mark Maunder Inter-American Tropical Tuna Commission An Introduction.
14, Chicago, IL, 2005 Science Gateways to DEISA Motivation, user requirements, and prototype example Thomas Soddemann, RZG, Germany.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
If you have a transaction processing system, John Meisenbacher
Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.
Bursts modelling Using WinBUGS Tim Watson May 2012 :diagnostics/ :transformation/ :investment planning/ :portfolio optimisation/ :investment economics/
Hierarchical Models. Conceptual: What are we talking about? – What makes a statistical model hierarchical? – How does that fit into population analysis?
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Markov Chain Monte Carlo in R
Hosting and Accessing Objects via Persistent Web Services
On the road: Test automation in practice for a BMW map update service
Parallel Objects: Virtualization & In-Process Components
CRESCO Project: Salvatore Raia
School of Mathematical Sciences, University of Nottingham.
Latent Dirichlet Analysis
Sampling Distribution
Sampling Distribution
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
Presentation transcript:

Event History Models: Why R? Why SabreR? Rob Crouchley

Contents Some science Performance of the available tools for multilevel models Breaking the technological barrier to adoption (sabreR) Demo Performance of parallel sabreR Conclusions

Some Science: BHPS Data (small dataset) Sample of males who were employed and earning a wage at some point over the period (13 years) Gives a total of 5130 individuals with a sequence of responses that occurred somewhere in the interval At the 1st sample point of the survey (1991) there were 2316 individuals of whom 945 of these males had some form of training in the previous 12 months, 106 had been promoted in the previous 12 months. The mean of the log of their weekly wage was 5.65 (Sterling)

What is the Effect of Training & Promotion on Wages? Suppose we want to disentangle the dependencies between: Promotion (P=1,0) in the last 12 months (latent var P*) On the job training (T=1,0) in the last 12 months (latent var T*) Current wages (W)

Correlated Random Effects Model

Commercial Software for MGLMMs Stata: Standard/Adapt Quadrature, Newton Raphson. See also Stata MPhttp:// SAS PROC NLMIXED: Standard/Adap Quadrature and Taylor/Laplace expansions, Quasi Newton. See also SAS PROC MPCONNECT and SAS Grid computinghttp:// Limdep: Quadrature, Quasi Newtonhttp://

MGLMMs: Other Systems MLwiN: Laplace approximation and IRLS (also MCMC) Gllamm (Stata prog): Stan/Adap Quadrature, Newton Raphsonhttp:// aML:

Packages at for GLMMs and MGLMMshttp://cran.r-project.org/ lmer ( project.org/web/packages/lme4/index.html) Laplace Approx, penalized iteratively reweighted least squareshttp://cran.r- project.org/web/packages/lme4/index.html npmlreg ( project.org/web/packages/npmlreg/index.html) Quadrature and NPML, EM algorithmhttp://cran.r- project.org/web/packages/npmlreg/index.html

Why Quadrature? PQL: Parameter estimates tend to be biased for binary dependent variables with small cluster sizes and high intraclass correlations (e.g. Rodriguez and Goldman, 1995, 2001) PQL: does not involve a likelihood, which prohibits the use of likelihood based inference Laplace Approximation: The 6 th order expansion (Raudenbush et al., 2000) worked as well as 7-point AQ in simulations of a two-level binary dependent variable model The precision of GQ and AQ can be increased by simply using more quadrature points We can not increasing the degree of the Taylor or Laplace Expansion beyond the 2, 4 or 6 terms allowed for

Simulation Based Methods Computer intensive alternatives to GQ and AQ include simulation based approaches such as Markov Chain Monte Carlo (MCMC) (e.g. Gelman et al., 2003) and maximum simulated likelihood (MSL) (Hajivassiliou and Ruud, 1994) The hierarchical structure of multilevel models lends itself naturally to MCMC using for instance Gibbs sampling. If vague priors are specified, the method essentially yields maximum likelihood estimates Unfortunately, a problem with MCMC is how to ensure that a truly stationary distribution has been obtained for MGLMMs, especially when we have a lot of structural and incidental parameters

In tests, serial sabre out performs other software lmer: GQ and AQ not yet implemented, REML and ML give Laplace approx answer npmlreg: GQ times as AQ not available Sabre used Portand Group PGF Compiler with –FAST (Level 2 optimization) Times are system times (very close to real time in all figures), very little variation between runs R and gllamm interpreted code, SAS?

MlwiN (MCMC, IGLS) are 2-25 x slower in univariate 2-level models For others see the Sabre site Other Sabre comparisons – V small to small sized data sets :

Changes in Substantive Findings Between Models Models HomogIndepDep CovariatePromo Coeff in WageTrain Equation Likelihood

Breaking the technological barrier to adoption Previously 2X harder to use the NGS than use your local HPC (private computing facility) Now It is easier to use the NGS (public computing facility) than it is to use your local HPC

Enabling Technology for grid computing All you need is: 1.An internet connection 2.The installation of our multiR or sabreR packages for R 3.A certificate to identify the client to the host -- typically a grid certificate

Also Users do not need to install or have familiarity of Globus, VDT, gsissh, gsiscp, grid-ftp, grid- proxy tools or any other GRID related software. There is very little difference between using the Sabre library from within R on the desktop, and using Sabre for statistical modelling on the grid from within R.

Desktop Vs Grid on the Windows desktop Serial sabreR sabre.model.1<sabre(proximity~factor(time)-1, case=teacher, first.family="gaussian“, first.mass=64, first.scale=0.5) #display results sabre.model.1 Parallel sabreR # load previously saved grid session object load(file=“ncess.demo.session.R") sabre.model.2<-sabre(proximity~factor(time)-1, case=teacher, first.family="gaussian", first.mass=64, first.scale=0.5, session=ncess.demo.session, description="here ya go !!") # recover the results and display them sabre.results(ncess.demo.session,sabre.model.2)

Demo rob_sabrer_edit2.mov

Master-Slave (Distributed Memory) Model for MPI as used by Sabre on the NW-Grid Li, Hi, di,  a’s i=1,...,1000 MASTER Process Slave Processes Li, Hi, di,  b’s i=1001,...,2000 Li, Hi, di,  c’s i=2001,...,3000 Li, Hi, di,  d’s i =3001,...,4000  a+  b+  c+  d for L,H and d, etc then NR There is no commercial software on the NGS or NW-GRID (licensing and cost issues)

Performance of Parallel Sabre Relative performance of Parallel Sabre compared to serial sabre (=100) on example datasets In the Wage example 5 days becomes 2.75 hours on 48 processors

Why R? Commercial Tools (Stata, SAS) are of limited use on a public grid, e.g. Stata MP can not have multiple data sets in memory and neither system provides access to their source code There are no plans to install them on the UK National Grid Service (NGS) because of cost/licensing issues R is an effective, efficient and easy to use tool for Statistical Modelling Many existing tried and tested statistical methods already available for R can easily be modified to exploit the benefits of grid computing Work flows to support the modelling process are simple to create. R is easy to install on most popular operating systems (Windows, Unix, OSX) and can be used directly from a USB memory stick R includes a programming environment, which when used in conjunction with our multiR and sabreR packages, automatically provides a data centric scripting tool for grid computing There are no licensing issues

Conclusions This approach makes all the grid middleware invisible and thus removes the biggest barrier to take up. This approach can provide researchers with more sophisticated statistical modelling tools and help increase their understanding of complex processes and thus help them to undertake more effective research Social researchers do not need to let their large scale science agenda using GLMs be set by the developments of the big statistics software houses, like SAS, Stata etc.

stop/end 23