Presentation is loading. Please wait.

Presentation is loading. Please wait.

Event History Models: Why R? Why SabreR? Rob Crouchley.

Similar presentations


Presentation on theme: "Event History Models: Why R? Why SabreR? Rob Crouchley."— Presentation transcript:

1 Event History Models: Why R? Why SabreR? Rob Crouchley

2 Contents Some science Performance of the available tools for multilevel models Breaking the technological barrier to adoption (sabreR) Demo Performance of parallel sabreR Conclusions

3 Some Science: BHPS Data (small dataset) Sample of males who were employed and earning a wage at some point over the period 1991-2003 (13 years) Gives a total of 5130 individuals with a sequence of responses that occurred somewhere in the 1991-2003 interval At the 1st sample point of the survey (1991) there were 2316 individuals of whom 945 of these males had some form of training in the previous 12 months, 106 had been promoted in the previous 12 months. The mean of the log of their weekly wage was 5.65 (Sterling)

4 What is the Effect of Training & Promotion on Wages? Suppose we want to disentangle the dependencies between: Promotion (P=1,0) in the last 12 months (latent var P*) On the job training (T=1,0) in the last 12 months (latent var T*) Current wages (W)

5 Correlated Random Effects Model

6 Commercial Software for MGLMMs Stata: http://www.stata.com/ Standard/Adapt Quadrature, Newton Raphson. See also Stata MPhttp://www.stata.com/ SAS PROC NLMIXED: http://www.sas.com/ Standard/Adap Quadrature and Taylor/Laplace expansions, Quasi Newton. See also SAS PROC MPCONNECT and SAS Grid computinghttp://www.sas.com/ Limdep: http://www.limdep.com/ Quadrature, Quasi Newtonhttp://www.limdep.com/

7 MGLMMs: Other Systems MLwiN: http://www.cmm.bristol.ac.uk/http://www.cmm.bristol.ac.uk/ Laplace approximation and IRLS (also MCMC) Gllamm (Stata prog): http://www.gllamm.org/ Stan/Adap Quadrature, Newton Raphsonhttp://www.gllamm.org/ aML: http://www.applied-ml.com/http://www.applied-ml.com/

8 Packages at http://cran.r-project.org/ for GLMMs and MGLMMshttp://cran.r-project.org/ lmer (http://cran.r- project.org/web/packages/lme4/index.html) Laplace Approx, penalized iteratively reweighted least squareshttp://cran.r- project.org/web/packages/lme4/index.html npmlreg (http://cran.r- project.org/web/packages/npmlreg/index.html) Quadrature and NPML, EM algorithmhttp://cran.r- project.org/web/packages/npmlreg/index.html

9 Why Quadrature? PQL: Parameter estimates tend to be biased for binary dependent variables with small cluster sizes and high intraclass correlations (e.g. Rodriguez and Goldman, 1995, 2001) PQL: does not involve a likelihood, which prohibits the use of likelihood based inference Laplace Approximation: The 6 th order expansion (Raudenbush et al., 2000) worked as well as 7-point AQ in simulations of a two-level binary dependent variable model The precision of GQ and AQ can be increased by simply using more quadrature points We can not increasing the degree of the Taylor or Laplace Expansion beyond the 2, 4 or 6 terms allowed for

10 Simulation Based Methods Computer intensive alternatives to GQ and AQ include simulation based approaches such as Markov Chain Monte Carlo (MCMC) (e.g. Gelman et al., 2003) and maximum simulated likelihood (MSL) (Hajivassiliou and Ruud, 1994) The hierarchical structure of multilevel models lends itself naturally to MCMC using for instance Gibbs sampling. If vague priors are specified, the method essentially yields maximum likelihood estimates Unfortunately, a problem with MCMC is how to ensure that a truly stationary distribution has been obtained for MGLMMs, especially when we have a lot of structural and incidental parameters

11 In tests, serial sabre out performs other software lmer: GQ and AQ not yet implemented, REML and ML give Laplace approx answer npmlreg: GQ times as AQ not available Sabre used Portand Group PGF90 7.1-6 Compiler with –FAST (Level 2 optimization) Times are system times (very close to real time in all figures), very little variation between runs R and gllamm interpreted code, SAS?

12 MlwiN (MCMC, IGLS) are 2-25 x slower in univariate 2-level models For others see the Sabre site http://sabre.lancs.ac.uk/ Other Sabre comparisons – V small to small sized data sets :

13 Changes in Substantive Findings Between Models Models HomogIndepDep CovariatePromo0.094990.061030.05288 Coeff 0.008240.005990.00611 in WageTrain-0.00683-0.00865-0.00864 Equation 0.005260.003960.00405 Likelihood-38471.93-29448.19-29419.52

14 Breaking the technological barrier to adoption Previously 2X harder to use the NGS than use your local HPC (private computing facility) Now It is easier to use the NGS (public computing facility) than it is to use your local HPC

15 Enabling Technology for grid computing All you need is: 1.An internet connection 2.The installation of our multiR or sabreR packages for R 3.A certificate to identify the client to the host -- typically a grid certificate

16 Also Users do not need to install or have familiarity of Globus, VDT, gsissh, gsiscp, grid-ftp, grid- proxy tools or any other GRID related software. There is very little difference between using the Sabre library from within R on the desktop, and using Sabre for statistical modelling on the grid from within R.

17 Desktop Vs Grid on the Windows desktop Serial sabreR sabre.model.1<sabre(proximity~factor(time)-1, case=teacher, first.family="gaussian“, first.mass=64, first.scale=0.5) #display results sabre.model.1 Parallel sabreR # load previously saved grid session object load(file=“ncess.demo.session.R") sabre.model.2<-sabre(proximity~factor(time)-1, case=teacher, first.family="gaussian", first.mass=64, first.scale=0.5, session=ncess.demo.session, description="here ya go !!") # recover the results and display them sabre.results(ncess.demo.session,sabre.model.2)

18 Demo rob_sabrer_edit2.mov

19 Master-Slave (Distributed Memory) Model for MPI as used by Sabre on the NW-Grid Li, Hi, di,  a’s i=1,...,1000 MASTER Process Slave Processes Li, Hi, di,  b’s i=1001,...,2000 Li, Hi, di,  c’s i=2001,...,3000 Li, Hi, di,  d’s i =3001,...,4000  a+  b+  c+  d for L,H and d, etc then NR There is no commercial software on the NGS or NW-GRID (licensing and cost issues)

20 Performance of Parallel Sabre Relative performance of Parallel Sabre compared to serial sabre (=100) on example datasets In the Wage example 5 days becomes 2.75 hours on 48 processors

21 Why R? Commercial Tools (Stata, SAS) are of limited use on a public grid, e.g. Stata MP can not have multiple data sets in memory and neither system provides access to their source code There are no plans to install them on the UK National Grid Service (NGS) because of cost/licensing issues R is an effective, efficient and easy to use tool for Statistical Modelling Many existing tried and tested statistical methods already available for R can easily be modified to exploit the benefits of grid computing Work flows to support the modelling process are simple to create. R is easy to install on most popular operating systems (Windows, Unix, OSX) and can be used directly from a USB memory stick R includes a programming environment, which when used in conjunction with our multiR and sabreR packages, automatically provides a data centric scripting tool for grid computing There are no licensing issues

22 Conclusions This approach makes all the grid middleware invisible and thus removes the biggest barrier to take up. This approach can provide researchers with more sophisticated statistical modelling tools and help increase their understanding of complex processes and thus help them to undertake more effective research Social researchers do not need to let their large scale science agenda using GLMs be set by the developments of the big statistics software houses, like SAS, Stata etc.

23 stop/end 23


Download ppt "Event History Models: Why R? Why SabreR? Rob Crouchley."

Similar presentations


Ads by Google