Professor William Browne,

Using StatJR’s Statistical Analysis assistant to assist in automating statistical analysis
Professor William Browne, Centre for Multilevel Modelling,University of Bristol

eBooks + An electronic book is a book-publication in digital form. =
In the US more books are published online than distributed in hard copy in book shops. =

Statistical (and Mathematical) eBooks
The idea is can we incorporate statistical content into an eBook? Of course a statistical textbook is no different on paper to any other document when it comes to creating a pdf file (aside from maybe more equations!) The difference is in what ‘enhancements’ we can add and so the idea here is combining the text book with the statistics package i.e. interactive examples, allowing the user to include their own dataset etc.

Navigate through pages of eBook
Hierarchical table of contents (can be expanded / collapsed at each node) Every time we start a reading process, we begin at the first page of the eBook, with no record of any inputs (because we haven’t made any!) This example eBook introduces the reader to multi-level modelling, using an example dataset taken from education, fitting a Normal model.

…further down the first page we introduce the example dataset…

…and invite the reader to explore it by plotting and cross-tabulating variables…
…for example, in this input box we ask the reader a series of questions ascertaining what variables they’d like to plot, and in what manner… …here we ask for a densityplot (which a little like a smoothed histogram) (pressing Submit after each choice we make - I’ve cut a few out of the screenshots…)

After we press Submit for the final time, having answered all the questions, we see that the progress gauge in the top-left corner changes from ‘Finished’…

…to indicate that it is running the execution behind the scenes; furthermore, the text indicating to the reader where the resulting plot will appear…

…is replaced by the plot as soon as it is rendered available…

…as we can see if we scroll down.

Statistical Analysis Assistants
We adapt our eBook system to allow workflows that will be constructed to describe how the steps in a statistical analysis fit together. There may be many SAAs adapted to different researcher’s approaches – e.g. one might want to answer a research question/analyse a dataset as a specific expert might do it. Opinion is divided on how far one can take the idea – from nowhere to complete automation i.e. pour in the dataset at the top and let the computer sort it out. Probable end point will be somewhere in between or in fact a series of SAAs that lie on this continuum. Easiest to start with automating single operations.

A statistical analysis assistant we are all happy with!

One Step further

Adding contextual text to a single operation
As we have seen with the Chi-squared example it is easy to enhance a single statistical operation like a statistical test. We can easily expose the steps required for the test in this case – 1. The tabulation of the observed counts The calculation of the corresponding expected counts The calculation of the test statistic and degree of freedom The interpretation of the test, the P value and what it means in words. What is harder is to then put what the result means into context. Statistical tests and tables are fairly easy to enhance with intelligent textual information whilst graphs and figures are harder to enhance. Generally one has to calculate a statistic related to the figure and work with that e.g. skewness and histograms as shown later.

‘The Warlock of Firetop Mountain’ approach
The first of a genre of interactive books published in 1982 and lapped up by 10 year old boys like myself! A combination of book and flowchart Worked something like: ‘The goblin advances towards you, shouting words that you can’t understand, do you try to make conversation (turn to page 231), run past the goblin (turn to page 176) or draw your sword and fight (turn to page 134)’ Basically underpinning the book was effectively a flowchart disguised by random page movements with a variety of endings (99% of them involved you dying), possible loops etc.

The use of Flowcharts in Statistics
The equivalent exists in (at least) basic statistical analysis and a variety of books have flowcharts to guide the uninitiated to the appropriate test. The branching rules are usually things like – how many variables do you have?, what type are they?, is a normality assumption appropriate? The example flowcharts usually then say you need a t test / Mann Whitney test / ANOVA etc. One could expand this idea to include branches where we haven’t written material – i.e. the equivalent of ending up dead would be the default ‘go and ask a statistician’ end point – possibly taking your answers to the flow chart with you.

Where might this go? The flow chart idea is appealing as it may to some degree mimic a statistical consultation. If the system is flexible enough then each statistician can tune the SAA to their own approach to analysis and to how much they feel can be comfortably automated. Where there is uncertainty / options in what one should do this could be incorporated E-books can contain hyperlinks so that further background on proposed statistical methods or examples can be easily found

Workflows and StatJR LEAF
Workflows allow the sequencing of a series of operations to perform an analysis. StatJR LEAF is based around a new front end written using the Blockly system. It allows the user to link up templates themselves in a user-friendly visual way. Work flows can then be included in eBooks. We will use this system in the SAAs.

Skewness / Histogram workflow
Here is a logfile style workflow. Basically we select a dataset then fit a histogram to a variable and display several objects.

Skewness / Histogram workflow

More complex operations – linear regression
When we looked at the chi-squared test earlier we already broke the test down into a series of steps which formed the test. For a regression analysis we might have additional steps to translate from simply a test to an analysis. We might do some initial exploratory data analysis and possible transform variables. We will clearly do the model fit itself but we will probably then also do some post-processing steps – for example analysis of the residuals and plotting the model predictions We will demonstrate an SAA for a linear regression but first show an example of a flow-chart for a real analysis.

Data sourcing / collection
Possible confounding variables to control for? Are data standardised / transformed? How? Hypotheses / Design Is there a significant relationship / difference? What shape is relationship? What value of x predicts certain value of y? Is the design repeated measures? Etc., etc…. 1 Data sourcing / collection Check permissions 2 10 23 Data prep Generate new / overwrite variables Re-sort Data description / renaming Exclude data Re-code data 7 26 6 8 18, 19 16 12, 13 Data exploration 11 9 Charts Tables 3 Summary stats Filters 17 4, 5 24 27 17 14 Model fit Correlation GLM Etc., etc….(lots of model-fitting possibilities) 15 19 22 18 25 Post-process model 20 Charts Tables Significance tests Generate estimates 21 Conclusions / Report

Moving to general linear models
Here we have to deal differently with categorical predictors both in how they are included in the model and in also in how we perform exploratory data analysis on them. We might perform ‘univariable analysis’ where each predictor is considered in isolation and a separate model is fitted. We can then consider ‘multivariable analysis’, possibly via some stepwise style approach to find a ‘best’ model. Residual analysis is straightforward to extend to general linear models but what is more of a challenge is automation of prediction plots when say one has 3 continuous and 4 categorical predictors! One possible solution is to plot against each predictor in turn holding the others at their mean or offering a bespoke prediction tool.

Moving to Multilevel Models
Multilevel models are a natural extension to linear models For 2 level models we can consider fitting variance components models and commenting on the VPC diagnostic. We can do the equivalent of the “univariable models” fitting each predictor in isolation to a variance components model. We can also perform a stepwise approach. Residual analysis now takes place at two levels and prediction plots show lines for each cluster. The SAA can be extended to random slopes and further levels but the choice of order of operations has to be considered. Another extension is to include interactions and non-linear predictors into the model. Overleaf we show a couple of screen shots from a random intercepts SAA.

Logistic Regression Models
Logistic regression models are used when the response variable is binary (e.g. yes/no) For logistic regression models the model fitting is still fairly straightforward, although model comparison is more of a challenge. Any prediction plots are better transformed back to the probability scale and residual plots are harder to interpret. Often odds ratios are reported. Extending from single level logistic models to multilevel is analogous to the normal response case In multilevel logistic models it is often preferable to use MCMC methods.

Bayesian / MCMC methods
The SAAs currently considered have generally called the MLwiN IGLS engine from within StatJR but we might also want to fit models using MCMC When using MCMC we compare models using the DIC diagnostic rather than using likelihood ratio tests. We also need to consider how long to run the models for and which prior distributions to use which are harder decisions to automate. To help explain the concepts of convergence and posterior distributions it is useful to include contextual text to describe the graphical outputs from the MCMC engine.

Output

The challenge of missing data
In the SAAs written so far we have initially asked the user for all the variables (response, predictor, cluster labels) required for the analysis up front. The workflows then perform a listwise deletion to form a complete cases dataset based on all variables used. It would be feasible to select all valid data for each model though model comparison would require the same data for pairs of models. We have done much work on templates that perform multiple imputation and include an imputation model and a model of interest. The challenge is incorporating such imputation in a more general model fitting environment.

Problems with simple approaches 1 – mean imputation
Mean imputation basically replaces missing values with their mean. Here you can see that this results in a pattern in the dataset and effects estimates etc.

Problems with simple approaches 2 – regression imputation
Regression imputation basically replaces missing values with the value from a regression fit. Here not the points on 2 lines for missing Y and X values. Also residual plot (to the right) shows loads of values at 0 due to perfect fit of missing data thus underestimating the uncertainty.

Summary In this talk we have described the Stat-JR software and work on developing SAAs within its workflow and eBook interfaces. We have highlighted some of the challenges in automating the contextual information that goes into the SAA. We have shown how we might extend basic operations to form SAAs to fit particular families of models. See for new SAA manual for Stat-JR and example pdfs of analysis. The system will fit different response types, multiple levels of random effects, random slopes and interactions.

Professor William Browne,

Similar presentations

Presentation on theme: "Professor William Browne,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Professor William Browne,

Similar presentations

Presentation on theme: "Professor William Browne,"— Presentation transcript:

Similar presentations

About project

Feedback