A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh.

Slides:



Advertisements
Similar presentations
Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta SAMSI September 29, 2005.
Advertisements

Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta Mid-Year Meeting February 3, 2006.
Agency for Healthcare Research and Quality (AHRQ)
Statistical Analysis and Data Interpretation What is significant for the athlete, the statistician and team doctor? important Will Hopkins
1 Multiple Frame Surveys Tracy Xu Kim Williamson Department of Statistical Science Southern Methodist University.
Chapter 4 Probability and Probability Distributions
Brain Cancer Mortality in the United States Joint work with: Zixing Fang, UCLA David Gregorio, Univ Connecticut.
Early Detection of Disease Outbreaks Prospective Surveillance.
Empirical/Asymptotic P-values for Monte Carlo-Based Hypothesis Testing: an Application to Cluster Detection Using the Scan Statistic Allyson Abrams, Martin.
A Spatial Scan Statistic for Survival Data Lan Huang, Dep Statistics, Univ Connecticut Martin Kulldorff, Harvard Medical School David Gregorio, Dep Community.
April 25 Exam April 27 (bring calculator with exp) Cox-Regression
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Part I – MULTIVARIATE ANALYSIS
In addition to maximum parsimony (MP) and likelihood methods, pairwise distance methods form the third large group of methods to infer evolutionary trees.
Chapter 51 Experiments, Good and Bad. Chapter 52 Experimentation u An experiment is the process of subjecting experimental units to treatments and observing.
Clustered or Multilevel Data
Today Concepts underlying inferential statistics
Decision Tree Models in Data Mining
Spatial Statistics for Cancer Surveillance Martin Kulldorff Harvard Medical School and Harvard Pilgrim Health Care.
Mapping Rates and Proportions. Incidence rates Mortality rates Birth rates Prevalence Proportions Percentages.
Geographic Information Science
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Using ArcGIS/SaTScan to detect higher than expected breast cancer incidence Jim Files, BS Appathurai Balamurugan, MD, MPH.
Medical Statistics (full English class) Ji-Qian Fang School of Public Health Sun Yat-Sen University.
The Spatial Scan Statistic. Null Hypothesis The risk of disease is the same in all parts of the map.
Medical Statistics (full English class) Ji-Qian Fang School of Public Health Sun Yat-Sen University.
Multiple Choice Questions for discussion
SPONSOR JAMES C. BENNEYAN DEVELOPMENT OF A PRESCRIPTION DRUG SURVEILLANCE SYSTEM TEAM MEMBERS Jeffrey Mason Dan Mitus Jenna Eickhoff Benjamin Harris.
Chapter 1: Introduction to Statistics
Evidence-Based Medicine 3 More Knowledge and Skills for Critical Reading Karen E. Schetzina, MD, MPH.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Excepted from HSRP 734: Advanced Statistical Methods June 5, 2008.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Rasch trees: A new method for detecting differential item functioning in the Rasch model Carolin Strobl Julia Kopf Achim Zeileis.
1 Chapter 3: Experimental Design. 2 Effect of Wine Consumption on Heart Disease Death Rate **Each data point represents a different country.
Mrs. Watcharasa Pitug ID The Association between Waist Circumference and Renal Insufficiency among Hypertensive Patients 15/10/58 1.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Chapter-8 Chi-square test. Ⅰ The mathematical properties of chi-square distribution  Types of chi-square tests  Chi-square test  Chi-square distribution.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Case Control Study Dr. Ashry Gad Mohamed MB, ChB, MPH, Dr.P.H. Prof. Of Epidemiology.
Survival Analysis 1 Always be contented, be grateful, be understanding and be compassionate.
Data Analysis in Practice- Based Research Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine October.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
Black Lung February, 2009 WV Coal Association Charleston, WV TMTM The findings and conclusions in this poster have not been formally disseminated by NIOSH.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Statistics What is the probability that 7 heads will be observed in 10 tosses of a fair coin? This is a ________ problem. Have probabilities on a fundamental.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.1 Categorical Response: Comparing Two Proportions.
Postgraduate books recommended by Degree Management and Postgraduate Education Bureau, Ministry of Education Medical Statistics (the 2nd edition) 孙振球 主.
Descriptive study design
Statistical Significance: Tests for Spatial Randomness.
Classification and Regression Trees
Active Surveillance Using Longitudinal Data: A Pilot Project Drug Safety and Risk Management Advisory Committee Meeting Silver Spring, Maryland May 18,
Designs for Experiments with More Than One Factor When the experimenter is interested in the effect of multiple factors on a response a factorial design.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Spatial Scan Statistic for Geographical and Network Hotspot Detection C. Taillie and G. P. Patil Center for Statistical Ecology and Environmental Statistics.
Copyright © 2008 Delmar. All rights reserved. Chapter 4 Epidemiology and Public Health Nursing.
Chapter 2. **The frequency distribution is a table which displays how many people fall into each category of a variable such as age, income level, or.
Early Detection of Disease Outbreaks with Applications in New York City Martin Kulldorff University of Connecticut Farzad Mostashari and James Miller.
Methods of Presenting and Interpreting Information Class 9.
Descriptive statistics (2)
Descriptive study design
Dept of Biostatistics, Emory University
بسم الله الرحمن الرحيم COHORT STUDIES.
CJT 765: Structural Equation Modeling
Narrative Reviews Limitations: Subjectivity inherent:
Effect Modifiers.
MGS 3100 Business Analysis Regression Feb 18, 2016
M. Wang, J. Wood, J. Hale, L. Petsonk, and M. Attfield
Presentation transcript:

A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh

Database Disease Surveillance In what occupations are there an excess risk of dying from a particular disease? Are there pharmaceutical drugs that causes certain adverse effects?

Nested Variables inhalation therapists  therapists  health occupations  professional occupations ecotrin  asprin  nonsteoridal anti-inflammatory drugs  analgesic drugs

Occupational Multiple Cause of Death Database National Center for Health Statistics Based on Death Certificates Occupational Classification System Selected States

Occupational Multiple Cause of Death Database Time period: Age groups:  25 years Total deaths: 2,114,832 Silicosis deaths: 405

Occupational Classification System A hierarchical structure of occupations created by the United States Bureau of the Census. Number of occupational groups at each level: Level:

FarmersCowboysHuntersTeachersClerks Root Node Branches Leaf A Small Three-Level Tree Variable

Occupational Classification System Managerial and Professional Specialty Occupations Professional Specialty Occupations Mathematical and Computer Scientists Computer Systems Analysts and Scientists (064) Operations and Systems Researchers and Analysts (065) Actuaries (066) Statisticians (067) Mathematical Scientists, n.e.c. (068) Natural Scientists Medical Scientists (083), etc. Health Diagnosing Occupations Physicians (084), etc. Health Assessment and Treatment Occupations Therapists ( ), etc.

Silicosis A rare disease of the lung Chronic shortness of breath Caused by dust containing crystalline silica (quartz) particles No known cure

Silicosis Described by Agricola in 1556: ‘In the Carpathian mines, women are found who have married seven husbands, all of whom this terrible consumption has carried away’ Agricola G. (1556). De Re Metallica. Basel: Froben and Episopius.

Proportional Mortality (PM) N = Total number of deaths (2,114,832) C = Total number of silicosis deaths (405) n = Number of farmers (266,715) c = Farmers dying from silicosis (12) All: C/N = 405/2,114,832 = Farmers: c/n = 12/266,715 =

Proportional Mortality Ratio (PMR) N = Total number of deaths (2,114,832) C = Total number of silicosis deaths (405) n = Number of farmers (266,715) c = Farmers dying from silicosis (12) Farmers: PMR= [c/n] / [(C-c)/(N-n)] = 0.23

Standardized Proportional Mortality Ratio (SPMR) The same thing as proportional mortality ratio but adjusted for covariates. Adjusted for age and gender, for silicosis among farmers we have: SPMR = 0.29

Analysis Options Evaluate each of the 503 occupational groups, using a Bonferroni type adjustment for multiple testing. Use a higher group level, such as level 3 with 86 occupational groups. Substantive Problem: We do not know whether the disease relationships effect a smaller or larger group.

Analysis Options Take the 503 occupations as a base, and evaluate all = 2.6  combinations. Problems: Computational, Statistical, Substantive

Ideal Analytical Solution Use the Hierarchical Tree Evaluate Cuts on that Tree

FarmersCowboysHuntersTeachersClerks A Small Three-Level Tree Variable Cut

Problem How do we deal with the multiple testing?

Proposed Solution Tree-Based Scan Statistic

One-Dimensional Scan Statistic Studied by Naus (JASA, 1965)

Other Scan Statistics Spatial scan statistics using circles or squares. Space-time scan statistics using cylinders. Variable size window, using maximum likelihood rather than counts. Applied for geographical and temporal disease surveillance, and in many other fields.

Tree-Based Scan Statistic H 0 : The probability of dying from silicosis is the same for all occupations. H A : There is at least one group of occupations (cut) for which the probability is higher.

Tree-Based Scan Statistic 1. Scan the tree by considering all possible cuts on any branch. 2. For each cut, calculate the likelihood. 3. Denote the cut with the maximum likelihood as the most likely cut (cluster). 4. Generate 9999 Monte Carlo replications under H Compare the most likely cut from the real data set with the most likely cuts from the random data sets. 6. If the rank of the most likely cut from the real data set is R, then the p-value for that cut is R/(9999+1).

Result Most Likely Cut Occupations: Mining machine operators Observed: 56, Expected: 5.5 SPMR = 11.8, p=0.0001

Result: Second Most Likely Cut Occupations: Molding and casting machine operators, Metal plating machine operators, Heat treating equipment operators, Misc. metal and plastic machine operators Observed: 22, Expected: 1.2 SPMR = 20.5, p=0.0001

Result Ninth Most Likely Cut Occupation: Heavy equipment mechanics Observed: 5, Expected: 1.0 SPMR = 4.8, p=0.72

Extension to Complex Cuts Consider a node with 4 branches: A, B, C, D. Simple cuts: [A], [B], [C], [D] Combinatorial cuts: [A], [B], [C], [D] [AB], [AC], [AD], [BC], [BD], [CD] [ABC], [ABD], [ACD], [BCD] Ordinal cuts: [A], [B], [C], [D] [AB], [BC], [CD], [ABC], [BCD]

Result Most Likely Cut Occupations: Mining machine operators, Mining occupations n.e.c Observed: 59, Expected: 6.0 SPMR = 11.5, p=0.0001

Extension to Multiple Trees There may not be one unique suitable tree. It is trivial to extend the method to multiple trees, by simply scanning over all trees.

Result Most Likely Cut Occupations: Mining machine operators, Mining engineers, Mining occupations n.e.c Observed: 60, Expected: 6.0 SPMR = 11.6, p=0.0001

Evaluated Combinations Simple cuts: ~1,000 Mixed cuts: ~1,000,000 Two trees:~1,000,000

Comparison with Computer Assisted Regression Trees (CART) Similarity: The letters ‘T’, ‘R’, ’E’ and ‘E’. Both are Data Mining Methods

Difference CART: There are multiple continuous or categorical variables, and a regression tree is constructed by making a hierarchical set of splits in the multi- dimensional space of the independent variables. Tree-Based Scan Statistic: There may be only one independent variable (e.g. occupation). Rather than using this as a continuous or categorical variable, it is defined as a tree structured variable. That is, we are not trying to estimate the tree, but use the tree as a new and different type of variable.

Conclusions The tree-based scan statistic is a useful data mining tool when we want to do know if a detected ‘clusters’ is due to chance or not, adjusting for the multiple testing of all possible cluster locations considered. Requires a variable that are suitably expressed in a tree structure, although the method may be extended to other structures as well.

Conclusions There are many other potential application areas, such as pharmacovigilance where one is interested in detecting unsuspected adverse drug effects. Extensions can be made to tree-structured dependent variables, and to multiple tree- structured independent variables.