Introduction to Biostatistics (ZJU 2008) Wenjiang Fu, Ph.D Associate Professor Division of Biostatistics, Department of Epidemiology Michigan State University.

Slides:



Advertisements
Similar presentations
Descriptive Measures MARE 250 Dr. Jason Turner.
Advertisements

Bios 101 Lecture 4: Descriptive Statistics Shankar Viswanathan, DrPH. Division of Biostatistics Department of Epidemiology and Population Health Albert.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Calculating & Reporting Healthcare Statistics
Chapter 3 Describing Data Using Numerical Measures
DESCRIBING DATA: 2. Numerical summaries of data using measures of central tendency and dispersion.
B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Slides by JOHN LOUCKS St. Edward’s University.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Chapter Two Descriptive Statistics McGraw-Hill/Irwin Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Descriptive statistics (Part I)
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 3 Describing Data Using Numerical Measures.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
1 Applied Statistics – Challenges and Reward Wenjiang Fu, Ph.D Computational Genomics Lab, Department of Epidemiology Michigan State University
Describing Data: Numerical
Department of Quantitative Methods & Information Systems
Chapter 2 Describing Data with Numerical Measurements General Objectives: Graphs are extremely useful for the visual description of a data set. However,
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
With Statistics Workshop with Statistics Workshop FunFunFunFun.
Numerical Descriptive Techniques
Chapter 3 – Descriptive Statistics
Methods for Describing Sets of Data
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
1 1 Slide Descriptive Statistics: Numerical Measures Location and Variability Chapter 3 BA 201.
Chapter 3 Descriptive Statistics: Numerical Methods Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
1 MATB344 Applied Statistics Chapter 2 Describing Data with Numerical Measures.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
M07-Numerical Summaries 1 1  Department of ISM, University of Alabama, Lesson Objectives  Learn when each measure of a “typical value” is appropriate.
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
1 PUAF 610 TA Session 2. 2 Today Class Review- summary statistics STATA Introduction Reminder: HW this week.
Chapter 2 Describing Data.
Describing distributions with numbers
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Descriptive Statistics1 LSSG Green Belt Training Descriptive Statistics.
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
Lecture 3 Describing Data Using Numerical Measures.
Skewness & Kurtosis: Reference
INVESTIGATION 1.
Chap 3-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 3 Describing Data Using Numerical.
Econ 3790: Business and Economics Statistics Instructor: Yogesh Uppal
1 1 Slide IS 310 – Business Statistics IS 310 Business Statistics CSU Long Beach.
INVESTIGATION Data Colllection Data Presentation Tabulation Diagrams Graphs Descriptive Statistics Measures of Location Measures of Dispersion Measures.
Chapter 3, Part A Descriptive Statistics: Numerical Measures n Measures of Location n Measures of Variability.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS St. Edward’s University.
Edpsy 511 Exploratory Data Analysis Homework 1: Due 9/19.
Business Statistics, 4e, by Ken Black. © 2003 John Wiley & Sons. 3-1 Business Statistics, 4e by Ken Black Chapter 3 Descriptive Statistics.
CHAPTER 2: Basic Summary Statistics
Engineering Fundamentals and Problem Solving, 6e Chapter 10 Statistics.
Honors Statistics Chapter 3 Measures of Variation.
Descriptive Statistics(Summary and Variability measures)
Descriptive Statistics Dr.Ladish Krishnan Sr.Lecturer of Community Medicine AIMST.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Data Presentation Numerical Summary Measures Chung-Yi Li, PhD Dept. of Public Health, College of Med. NCKU.
Doc.RNDr.Iveta Bedáňová, Ph.D.
Chapter 3 Describing Data Using Numerical Measures
Chapter 6 ENGR 201: Statistics for Engineers
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
NUMERICAL DESCRIPTIVE MEASURES
Descriptive Statistics
Description of Data (Summary and Variability measures)
Chapter 3 Describing Data Using Numerical Measures
Descriptive Statistics
Introduction to Statistics
Basic Statistical Terms
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
CHAPTER 2: Basic Summary Statistics
Presentation transcript:

Introduction to Biostatistics (ZJU 2008) Wenjiang Fu, Ph.D Associate Professor Division of Biostatistics, Department of Epidemiology Michigan State University East Lansing, Michigan 48824, USA www:

Introduction Biostatistics ? Why do we need to study Biostatistics? A test for myself ! Biostatistics ? Why do we need to study Biostatistics? A test for myself ! Statistics – Data science to help to decipher data collected in many aspects of events using probability theory and statistical principles with the help of computer. Statistics – Data science to help to decipher data collected in many aspects of events using probability theory and statistical principles with the help of computer. Statistics Theoretical Statistics Theoretical AppliedBiostats EconomicsFinanceEngineeringSports … … Data:Events: party, disease, accident, award, game … Data:Events: party, disease, accident, award, game … Subjects: human, animal … Characteristics: sex, race, age, weight, height …

Inferential statistics Estimation Hypothesis testing Prediction Statistics Statistics sampling population sample descriptive statistics parameter statistic Most commonly, statistics refers to numerical data or other data. Statistics may also refer to the process of collecting, organizing, presenting, analyzing and interpreting data for the purpose of making inference, decision, policy and assisting scientific discoveries. frequency probability

Grand challenges we are facing … “Data” Knowledge & Information Decision Statistics 21 st century will be the golden age of statistics !

Grand challenges we are facing … 1. Data collection technology has advanced dramatically, but without sufficient statistical sampling design and experimental design. 2. Advancement of technology for discovering and retrieving useful information has been lagging and has become the bottleneck. 3. More sophisticated approaches are needed for decision making and risk management.

Statistical Challenges - Massive Amount of Data

Statistical Challenges – Image Data

Statistics in Science Cosmic microwave background radiation High Energy Physics Tick-by-tick stock data Genomic/protomic data

Statistics in Science Finger Prints Microarray

What do we do? New ways of thinking and attacking problems New ways of thinking and attacking problems Finding sub-optimal but computationally feasible solutions. Finding sub-optimal but computationally feasible solutions. New paradigm for new types of data New paradigm for new types of data Be satisfied with ‘very rough’ approximations Be satisfied with ‘very rough’ approximations Turn research results into easy and publicly available software and programs Turn research results into easy and publicly available software and programs Join force with computer scientists. Join force with computer scientists.

Some ‘hot’ research directions Dimension reduction Dimension reduction Visualization Visualization Dynamic systems Dynamic systems Simulation and real time computation Simulation and real time computation Uncertainty and risk management Uncertainty and risk management Interdisciplinary research Interdisciplinary research

Reasons to Study Biostatistics I Biostatistics is everywhere around us: Biostatistics is everywhere around us: Our life: entertainment, sports game, shopping, party, communication (cell phone), travel … Our life: entertainment, sports game, shopping, party, communication (cell phone), travel … Our work: career, business, school … Our work: career, business, school … Our health: food, weather, disease … Our health: food, weather, disease … Our environment: safety, security, chemical, animal, Our environment: safety, security, chemical, animal, Our well-being: physical examination, hospital, being happy, longevity. Our well-being: physical examination, hospital, being happy, longevity.

Reasons to Study Biostatistics I Entertainment - party: music / dance /food Entertainment - party: music / dance /food Alcohol, cigarette, drug, etc. Alcohol, cigarette, drug, etc. Sports game Sports game Car racing, skiing (time to event – survival analysis). Car racing, skiing (time to event – survival analysis). Shopping: diff taste /preference : Shopping: diff taste /preference : Allergy to certain food /smell : peanut, flowers … Allergy to certain food /smell : peanut, flowers … Communication - cell phone use Communication - cell phone use Potential hazard – leads to health problem (CA …) Potential hazard – leads to health problem (CA …) Travel – infectious diseases, safety, accident … Travel – infectious diseases, safety, accident …

Reasons to Study Biostatistics II We care our society, our family, our environment, our school, scientific research … We care our society, our family, our environment, our school, scientific research … Major impact on society and communities. Major impact on society and communities. Disease transmission Disease transmission Healthcare benefit, health economics Healthcare benefit, health economics Quality of life (research, health improvement) Quality of life (research, health improvement) Safety issue (outbreaks of diseases, etc.) Safety issue (outbreaks of diseases, etc.) Job market is very promising. Job market is very promising. Applications in a wide-range of areas. Applications in a wide-range of areas. Healthcare, quality of life, Healthcare, quality of life, Career – job market: scientific, public or private, industrial … Career – job market: scientific, public or private, industrial …

Reasons to Study Biostatistics III Biostatistics research and applications Biostatistics research and applications Major employers in the US Major employers in the US Research universities, Hospitals, Institutes (NIH), CDC, DoD, NASA, pharmaceutical industry, biotech industry, banks and other data warehouse … Major universities having biostatistics department in the US Major universities having biostatistics department in the US Harvard U, U. Michigan, U. Washington (Seattle), UC (Berkeley, LA, SF), JHU, Yale U, Stanford U … Harvard U, U. Michigan, U. Washington (Seattle), UC (Berkeley, LA, SF), JHU, Yale U, Stanford U …

Reasons to Study Biostatistics IV New Biostatistics research areas (still growing) New Biostatistics research areas (still growing) Medical research. Medical research. Recent trend in employment Recent trend in employment Private industry: Google, Microsoft … Private industry: Google, Microsoft … Affymetrix, Illumina, Agilent, Golden Helix, Affymetrix, Illumina, Agilent, Golden Helix, 23andMe … Investment – stock market, Capital One, Bank of America, Goldman Sack, etc. Investment – stock market, Capital One, Bank of America, Goldman Sack, etc. Nano tech, green energy (alternative energy) … Nano tech, green energy (alternative energy) …

Example 1. Medical study data: Ob/Gyn Modeling of PlGF: Placental Growth Factor

Example 2. Genomics study Single Nucleotide Polymorphism (SNP) Homologous pairs of chromosomes Homologous pairs of chromosomes Paternal allele Paternal allele Maternal allele Maternal allele Paternal allele Maternal allele ACGAACAGCT TGCTTGTCGA ACGAGCAGCT TGCTCGTCGA SNP A/G

Computational Genomics: SNP Genotype Error rate : around 5% : Genome-wide association studies – millions of SNPs

Applications Genetic counseling: Genetic counseling: gene expression + family medical history  disease gene expression + family medical history  disease Breast cancer (BRCA) … Breast cancer (BRCA) … Achieve accurate estimation and prediction Achieve accurate estimation and prediction Early detection / early treatment (cancer, …) Early detection / early treatment (cancer, …) Accurate diagnosis (HIV +) Accurate diagnosis (HIV +) Help development of new drugs for treatment. Help development of new drugs for treatment. Help to protect environment, live longer and happier, improve quality of life. Help to protect environment, live longer and happier, improve quality of life.

Did I pass my test? I hope I have convinced you to study biostatistics. I hope I have convinced you to study biostatistics.

Chapter 2. Descriptive Statistics First important thing to do is to visualize data. First important thing to do is to visualize data. Plot of data Plot of data Scatter plot – pair-wise (var 1 vs. var 2) Scatter plot – pair-wise (var 1 vs. var 2)

Scatter plot

Descriptive Statistics Summarize data using statistics Summarize data using statistics Central location (mean, median) Central location (mean, median) Range (min, max) Range (min, max) Variability (variance, standard deviation) Variability (variance, standard deviation) Mode Mode Quantiles (percentiles) Quantiles (percentiles) Rank data, but avoid long listing (use grouping, instead) Rank data, but avoid long listing (use grouping, instead)

Measure of Location Mean The mean is the sum of all the observations divided by the number of observations. Population mean : Sample mean : The number of observations in the population. The number of observations in the sample.

The mean is the most widely used measure of location and has the following properties : The mean is oversensitive to extreme values in the sample. Properties of the mean 

Translation of data

Measure of Location Median and Mode The median is the value of the “middle” point of samples, when samples are arranged in ascending order. Median = The [(n+1)/2] th largest observation if n is odd. = The average of the (n/2) th and (n/2+1) th largest observation if n is even. The mode is the most frequently occurring value among all the observations in a sample. It is the most probable value that would be obtained if one data point is selected at random from a population.

Calculate the median and mode of the following data: 12, 24, 36, 25, 17, 19, 24, 11 Sorted data : 11, 12, 17, 19, 24, 24, 25, 36 Example: Median and Mode Median = Mode = 24

 ≤  ≤  =  =   Mean  Median  Mode  ≤  ≤  ≤  ≤  The mean is influenced by outliers while the median is not. The mean is influenced by outliers while the median is not. The mode is very unstable. Minor fluctuations in the data can change it substantially; for this reason it is seldom calculated. mode bimodal

When the shape of a distribution to the left and the right is mirror image of each other, the distribution is symmetrical. Examples of symmetrical distribution are shown below : A skewed distribution is a distribution that is not symmetrical. Examples of skewed distributions are shown below : Positively skewedNegatively skewed Symmetry and Skewness in Distribution

Range and Mean Absolute Deviation (MAD) The Range is the simplest measure of dispersion. It is simply the difference between the largest and smallest observations in a sample. The mean absolute deviation is the average of the absolute values of the deviations of individual observations from the mean. Measure of Dispersion

Quantile (percentile) is the general term for a value at or below which a stated proportion (p/100) of the data in a distribution lies.  Quartiles: p =.25,.50,.75  Quantile / Percentile : p is any probability value Quantiles or Percentiles Measure of Dispersion

Let [k] denote the largest integer  k. For example, [3]=3, [4.7]=4. The p-th percentile is defined as follows: Find k = np/100. If k is an integer, the p-th percentile is the mean of the k-th and (k+1)-th observations (in the ascending sorted order). If k is NOT an integer, the p-th percentile is the [k]+1-th observation. Calculating Quantiles or Percentiles

Sorted data : 2, 4, 7, 8, 12, 14, 16, 17, 19, 20 (n = 10) 10th percentile: k = np/100 = 10×10/100 = 1 Average of 1st and 2nd observations = (2+4)/2 = 3 75th percentile: k = np/100 = 10×75/100 = 7.5 [7.5]+1 = 7+1 = 8th observation = 17 Example Calculate the 10th percentile and the 75th percentile of the following data: 7, 12, 16, 2, 8, 4, 20, 14, 19, 17

The variance is a measure of how spread out a distribution is. It is computed as the average squared deviation of each number from its mean. The standard deviation is the square root of the variance. It is the most commonly used measure of spread.  sample variance Variance and Standard Deviation Measure of Dispersion  sample standard deviation

Five people have their body mass index (BMI) calculated as [body weight (kg)] / [height] 2 18, 20, 22, 25, 24 Example

A direct comparison of two or more measures of dispersion may be difficult because of difference in their means. A relative dispersion is the amount of variability in a distribution relative to a reference point or benchmark. A common measure of relative dispersion is the coefficient of variation (CV). A direct comparison of two or more measures of dispersion may be difficult because of difference in their means. A relative dispersion is the amount of variability in a distribution relative to a reference point or benchmark. A common measure of relative dispersion is the coefficient of variation (CV). This measure remains the same regardless of the units used when only scaling applies. Very useful ! Good Example: Weight, Kg versus Lb. Bad Example: Temperature: C vs F. This measure remains the same regardless of the units used when only scaling applies. Very useful ! Good Example: Weight, Kg versus Lb. Bad Example: Temperature: C vs F. Relative Dispersion – Coefficient of Variation

Frequency Distribution Long list of data collection can be confusing, and need to be grouped in moderate intervals, rather than listed as raw data point. Hospital Length of Stay (LOS) __________________________________________________________________________________________ Hospital Length of Stay (LOS) __________________________________________________________________________________________

Interval FrequencyRelative Frequency LOS A summary table works better than raw data. A summary table works better than raw data.

A bar graph is simply a bar chart of data that has been classified into a frequency distribution. The attractive feature of a bar graph is that it allows us to quickly see where the most of the observations are concentrated. Graphic Methods IntervalFrequency LOS Bar Graph

Histogram provides a distribution plot, where the bars are not necessarily of the same length. The area of each bar is proportional to the density of the data or percentage of data points within the bar. Graphic Methods Histogram

MINMAX The box Plot is summary plot based on the median and interquartile range (IQR) which contains 50% of the values. Whiskers extend from the box to the highest and lowest values, excluding outliers. A line across the box indicates the median. Graphic Methods Box Plot