Cases and controls A case is an individual with a disease, whose location can be represented by a point on the map (red dot). In this table we examine.

Slides:



Advertisements
Similar presentations
Nearest Neighbour Analysis
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Statistical approaches for detecting clusters of disease. Feb. 26, 2013 Thomas Talbot New York State Department of Health Bureau of Environmental and Occupational.
Applied Geostatistics Geostatistical techniques are designed to evaluate the spatial structure of a variable, or the relationship between a value measured.
1 Psych 5500/6500 Statistics and Parameters Fall, 2008.
Spatial Statistics Applied to point data.
Chapter 3 Descriptive Measures
LECTURER PROF.Dr. DEMIR BAYKA AUTOMOTIVE ENGINEERING LABORATORY I.
Unit 9: Probability, Statistics and Percents Section 1: Relative Frequency and Probability The frequency of something is how often it happens Relative.
The binomial applied: absolute and relative risks, chi-square.
Objectives 2.1Scatterplots  Scatterplots  Explanatory and response variables  Interpreting scatterplots  Outliers Adapted from authors’ slides © 2012.
Biostatistics Unit 5 – Samples. Sampling distributions Sampling distributions are important in the understanding of statistical inference. Probability.
Spatial Statistics in Ecology: Point Pattern Analysis Lecture Two.
What’s the Point? Working with 0-D Spatial Data in ArcGIS
Introduction to Basic Statistical Tools for Research OCED 5443 Interpreting Research in OCED Dr. Ausburn OCED 5443 Interpreting Research in OCED Dr. Ausburn.
Methods for point patterns. Methods consider first-order effects (e.g., changes in mean values [intensity] over space) or second-order effects (e.g.,
Chapter 5 Sampling Distributions. The Concept of Sampling Distributions Parameter – numerical descriptive measure of a population. It is usually unknown.
Chapter 4 Variability PowerPoint Lecture Slides Essentials of Statistics for the Behavioral Sciences Seventh Edition by Frederick J Gravetter and Larry.
Examining difference: chi-squared (x 2 ). When to use Chi-Squared? Chi-squared is used to examine differences between what you actually find in your study.
INTRODUCTION TO STATISTICS
Regression Analysis.
Chapter 7 Review.
Hiroki Sayama NECSI Summer School 2008 Week 2: Complex Systems Modeling and Networks Network Models Hiroki Sayama
Statistics: The Z score and the normal distribution
Measures of Position & Exploratory Data Analysis
Relative Values.
The binomial applied: absolute and relative risks, chi-square
Summary of Prev. Lecture
PCB 3043L - General Ecology Data Analysis.
Introduction to Summary Statistics
Lecture 4: Meta-analysis
Chapter 8: Inference for Proportions
Introduction to Summary Statistics
Introduction to Summary Statistics
Introduction to Summary Statistics
Section 3.2 Measures of Spread.
What is the point of these sports?
Introduction to Summary Statistics
Introduction to Summary Statistics
Chapter 4 – Part 3.
Introduction to Summary Statistics
Introduction to Summary Statistics
1.3 Data Recording, Analysis and Presentation
Introduction to Summary Statistics
Probability Probability underlies statistical inference - the drawing of conclusions from a sample of data. If samples are drawn at random, their characteristics.
Inferential Statistics
Geology Geomath Chapter 7 - Statistics tom.h.wilson
Gerald Dyer, Jr., MPH October 20, 2016
Lesson 1: Summarizing and Interpreting Data
Introduction to Summary Statistics
Topic Quadrats and random sampling techniques Level
Comparing two Rates Farrokh Alemi Ph.D.
12 Inferential Analysis.
12/6/ Discrete and Continuous Random Variables.
Introduction to Summary Statistics
Sampling Design Basic concept
Introduction to Summary Statistics
CHAPTER 2: Basic Summary Statistics
Random Variables Random variable a variable (typically represented by x) that takes a numerical value by chance. For each outcome of a procedure, x takes.
Xbar Chart By Farrokh Alemi Ph.D
Stability of Cortical Responses and the Statistics of Natural Scenes
Measures of Variability
Statistics Definitions
Introduction to Summary Statistics
Skills 5. Skills 5 Standard deviation What is it used for? This statistical test is used for measuring the degree of dispersion. It is another way.
2.3. Measures of Dispersion (Variation):
Chapter 5 Hypothesis Tests With Means of Samples
Warm up Honors Algebra 2 3/14/19
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Cases and controls A case is an individual with a disease, whose location can be represented by a point on the map (red dot). In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

Cases and controls A case is an individual with a disease, whose location can be represented by a point on the map (red dot). A control is a similar individual free from disease, whose location can also be plotted on the map (blue dots). Controls might be children born in the same year as cases, taken from a birth register. In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

Which test to use? Some tests are used when data are only available for cases. Examples include Ripley’s K and the variance-mean ratio. In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

Which test to use? Some tests are used when data are only available for cases. Examples include Ripley’s K and the variance-mean ratio. Other tests are used when data are available for both cases and controls, such as Cuzick and Edwards’ test. In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

Nearest neighbours In a set of points, every point has a nearest neighbour. Nearest neighbours are indicated for these points by the black arrows. In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

Cuzick and Edwards’ test counts the number of cases that have other cases (not controls) as their nearest neighbours. In this example, cases are shown in red and controls in blue: Cuzick and Edwards’ test result is: a = near neighb. = blue (0) b = near neighb. = red (1) c = near neighb. = red (1) 0 + 1 + 1 = 2 Case a has a control as a nearest neighbour. a b Cases b and c have other cases as nearest neighbours c In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

For clustered cases the Cuzick and Edwards’ statistic is high, for example a cluster around a pollution source causing respiratory problems. For a distributed cases the test statistic is low, for examples smokers distributed amongst non smokers. If the Cuzick and Edwards’ statistic is higher than a randomly distributed data set then the data set a degree of clustering is present. In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

One problem with these clustering tests is scale: clusters may come in different sizes…. Cluster A is relatively small, Cluster B is larger. In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis A B

Cuzick and Edwards’ test (and other nearest neighbour analyses) can be expanded to consider more points than just the nearest neighbour. For example, the nearest two points to each case might be considered. We could continue and consider ever-increasing numbers of neighbouring points (the three nearest, four nearest, etc.), enabling us to detect clusters of different sizes and addressing the scale problem. 2nd nearest nearest In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

Testing for clustering in case data only A quadrat test can be used to test for clustering in point data of cases. This test is often used in ecology, rather than with health data. In a quadrat test, a grid is superimposed on the study area and the number of points in each grid square is calculated. 6 2 3 1 In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

The mean count per grid square is then calculated: (6 + 2 + 3 + 1)/4 = 3. As is the variance: Standard deviation = 2.2 (1 decimal place) The ratio of standard deviation to mean = 2.2/3 = 0.7 6 2 3 1 In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

For the blue population: Mean = 3 Standard deviation = 0 For the blue population: Mean = 3 Standard deviation = 0.8 Ratio SD:mean = 0.8/3 = 0.3 The blue population is more evenly distributed and this is reflected in its low quadrat ratio 6 2 3 1 4 3 2 In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

Ripley’s K works by counting the number of nearby cases lying within a certain distance of each case. Each point is considered in turn and a running total is kept of the number of nearby cases. The process continues until neighbouring cases have been identified for all points. 1 2 In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

Ripley’s K statistic is calculated by adding up the total number of neighbours found in all circles (let’s assume here this is 1600). This is divided by the density of points squared and area. In this case, we might have 2 points per km2 and a study site with an area of 100km2. Ripley’s K statistic measures the number of points you would expect to find within a given radius of an arbitrarily chosen point on the map. In this example, we’d expect to find 4 points within our chosen radius. K = (total no. of neighbours) (point density2.study area) K = (1 + 2 + 0 + …etc. = 1600) (point density2.study area) K = (1600) = 1600 / 400 = 4 (22.100)

A scaled version of this statistic, L, is used to test for clustering A scaled version of this statistic, L, is used to test for clustering. L is calculated by dividing by pi and taking the square root. L is 1 for a random point pattern, greater than 1 for a clustered pattern, and less than 1 for a regular pattern. In this example, our data appear clustered. The formula actually used for K is slightly more complicated than described here. The more complex formula accounts for points at the edge of the map. L = √(K/π) = √ (4/ 3.1) = 1.1

The calculation can be repeated for larger and larger distances to detect clusters of different sizes. 4 In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis

inter-point distance (radius of circles in earlier slides) We can plot out the value of the Ripley’s K statistic for different sizes of distance radius and compare the graph to what might be expected if cases were located at random. 10 L (scaled version of Ripley’s K) In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis 10 inter-point distance (radius of circles in earlier slides)

inter-point distance (radius of circles in earlier slides) A line with Y > X would indicate that the points are clustered together This line (y = x) is what we would expect if the points were randomly distributed A line with X > Y would indicate that the points are evenly distributed L (scaled version of Ripley’s K) In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis inter-point distance (radius of circles in earlier slides)

X = inter-point distance (radius of circles in earlier slides) We can test for significance by generating many random point patterns, then calculating and plotting K for these random patterns. In this example, K is high, but not outside the range that could be expected by chance. Observed values for K Y = value of scaled version of Ripley’s K In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis Min and max K values from many random simulations X = inter-point distance (radius of circles in earlier slides)

Cuzick and Edwards’ test In summary, we can divide our global clustering tests into those that address just case data, or those that address both case and control data. There are many examples of each, from which we have examined just three. Global clustering tests Case data Case and control data Ripley’s K Variance-mean ratio …many other tests Cuzick and Edwards’ test …many other tests In this table we examine male death rates for Scotland in 1999 and standardize by age relative to male death rates in the whole of the United Kingdom for the same period. The data are drawn from the World Health Organization Statistical Information System http://www3.who.int/whosis