Data and Statistics: New methods and future challenges Phil O’Neill University of Nottingham.

Slides:



Advertisements
Similar presentations
Analysis by design Statistics is involved in the analysis of data generated from an experiment. It is essential to spend time and effort in advance to.
Advertisements

Modelling Healthcare Associated Infections: A case study in MRSA.
Mathematical Modelling of Healthcare Associated Infections Theo Kypraios Division of Statistics, School of Mathematical Sciences
ABC: Bayesian Computation Without Likelihoods David Balding Centre for Biostatistics Imperial College London (
Model checking in mixture models via mixed predictive p-values Alex Lewin and Sylvia Richardson, Centre for Biostatistics, Imperial College, London Mixed.
Some Developments of ABC David Balding John Molitor David Welch Imperial College London.
Dimension reduction (1)
Prediction, Correlation, and Lack of Fit in Regression (§11. 4, 11
Model Assessment, Selection and Averaging
Statistical inference for epidemics on networks PD O’Neill, T Kypraios (Mathematical Sciences, University of Nottingham) Sep 2011 ICMS, Edinburgh.
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
Gaussian process emulation of multiple outputs Tony O’Hagan, MUCM, Sheffield.
1 Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data Aparna S. Varde, Elke A. Rundensteiner, Carolina Ruiz, Mohammed Maniruzzaman.
“Inferring Phylogenies” Joseph Felsenstein Excellent reference
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Chapter 11 Multiple Regression.
© P. Pongcharoen ISA/1 Applying Designed Experiments to Optimise the Performance of Genetic Algorithms for Scheduling Capital Products P. Pongcharoen,
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Probabilistic methods for phylogenetic trees (Part 2)
Statistics 350 Lecture 17. Today Last Day: Introduction to Multiple Linear Regression Model Today: More Chapter 6.
Structural Equation Modeling Intro to SEM Psy 524 Ainsworth.
Correlation & Regression
Computer vision: models, learning and inference Chapter 6 Learning and Inference in Vision.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
Bayes Factor Based on Han and Carlin (2001, JASA).
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Calibration Guidelines 1. Start simple, add complexity carefully 2. Use a broad range of information 3. Be well-posed & be comprehensive 4. Include diverse.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Relating models to data: A review P.D. O’Neill University of Nottingham.
Cost drivers, cost behaviour and cost estimation
Automated Detection and Classification Models SAR Automatic Target Recognition Proposal J.Bell, Y. Petillot.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
Forward-Scan Sonar Tomographic Reconstruction PHD Filter Multiple Target Tracking Bayesian Multiple Target Tracking in Forward Scan Sonar.
First topic: clustering and pattern recognition Marc Sobel.
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.
Introduction to Phylogenetic trees Colin Dewey BMI/CS 576 Fall 2015.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Applied Quantitative Analysis and Practices LECTURE#31 By Dr. Osman Sadiq Paracha.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Gaussian Processes For Regression, Classification, and Prediction.
Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Introduction to emulators Tony O’Hagan University of Sheffield.
Data statistics and transformation revision Michael J. Watts
Stats Methods at IC Lecture 3: Regression.
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Section 11.1: Least squares estimation CIS Computational.
26134 Business Statistics Week 5 Tutorial
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Overview of Supervised Learning
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Section 11.1: Least squares estimation CIS Computational.
Bayesian inference Presented by Amir Hadadi
CSCI 5822 Probabilistic Models of Human and Machine Learning
Linear Regression.
Probabilistic Models with Latent Variables
Simple Linear Regression
Cases. Simple Regression Linear Multiple Regression.
Rearrangement Phylogeny of Genomes in Contig form
Presentation transcript:

Data and Statistics: New methods and future challenges Phil O’Neill University of Nottingham

Professors: How they spend their time

1. High-resolution genetic data 2. Model assessment

Gardy 2011 NEJM

“High-resolution genetic data”: what are they?  individual-level data on the pathogen  can be taken at single or multiple time points  high-dimensional e.g. whole genome sequences  proportion of individuals sampled could be high/low  becoming far more common due to cost reduction

“High-resolution genetic data”: what use are they?  better inference about transmission paths  more reliable estimates of epi quantities?  understand evolution of the pathogen

.

. A C C C T T G G G A A A.....

Modelling and Data Analysis methods Two kinds of approaches exist: 1. Separate genetic and epidemic components (e.g. Volz, Rasmussen) 2. Combine genetic and epidemic components (e.g. Ypma, Worby, Morelli)

1. Separate genetic and epidemic components e.g: - estimate phylogenetic tree - given the tree, fit epidemic model or - cluster individuals into genetically similar groups - given the groups, fit multi-type epidemic model

1. Separate genetic and epidemic components + “Simple” approach + Avoids complex modelling - Ignores any relationship between transmission and genetic information

2. Combine genetic and epidemic components e.g: - model genetic evolution explicitly - define model featuring both genetic and epidemic parts

2. Combine genetic and epidemic components + “Integrated” approach - Is modelling too detailed? - Initial conditions: typical sequence? +/- Model differences between individuals instead?

1. High-resolution genetic data 2. Model assessment

“Model assessment”: what is it?  Does our model fit the data?  Is there a better model?

“Model assessment”: why do it?  Poor fit sheds doubt on conclusions from modelling  Model choice can be a tool for directly addressing questions of interest

Linear regression: y k = ax k + b + e k, e k ~ N(0,v) Minimise distance of model mean from observed data

For outbreak data:  What are the right residuals?  Should observed or unobserved data be compared to the model? (Streftaris and Gibson)  Mean model may only be available via simulation  Is the mean the right quantity to consider?

Simulation-based approaches to model fit:  Forward simulation – “close” to data?  Choice of summary statistics?  Close ties to ABC methods (McKinley, Neal)

Approaches to model choice  Hypermodels/saturated models  Bayesian non-parametric methods  Bayesian methods e.g. RJMCMC  Mixture models

 Hypermodels/saturated models e.g. Infection rates βS or βSI or βSI 0.5 in an SIR model? Instead use βSI  and estimate  (O’Neill and Wen)

 Bayesian non-parametric methods e.g. Infection rate β(t)SI or β(t) in an SIR model; Estimate β(t) in a Bayesian non-parametric manner using Gaussian process machinery (Kypraios, O’Neill and Xu; Knock and Kypraios)

 Reversible Jump MCMC e.g. Distinct models (usually small number), estimate Bayes factors by running MCMC on union of parameter spaces (O’Neill; Neal and Roberts; Knock and O’Neill)

 Mixture models e.g. Given two models (f, g), create mixture model f(x) =  g(x) + (1-  ) h(x); estimation of  enables estimation of Bayes Factors (Kypraios and O’Neill)