Biostat 200 Introduction to Biostatistics 1. Lecture 1 2.

Slides:



Advertisements
Similar presentations
Describing Quantitative Variables
Advertisements

Unit 1.1 Investigating Data 1. Frequency and Histograms CCSS: S.ID.1 Represent data with plots on the real number line (dot plots, histograms, and box.
Statistics 100 Lecture Set 6. Re-cap Last day, looked at a variety of plots For categorical variables, most useful plots were bar charts and pie charts.
Statistics It is the science of planning studies and experiments, obtaining sample data, and then organizing, summarizing, analyzing, interpreting data,
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Copyright (c) Bani Mallick1 Lecture 2 Stat 651. Copyright (c) Bani Mallick2 Topics in Lecture #2 Population and sample parameters More on populations.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch. 2-1 Statistics for Business and Economics 7 th Edition Chapter 2 Describing Data:
1 Economics 240A Power One. 2 Outline w Course Organization w Course Overview w Resources for Studying.
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Summarising and presenting data
Statistical Analysis SC504/HS927 Spring Term 2008 Week 17 (25th January 2008): Analysing data.
Intro to Descriptive Statistics
Statistics Lecture 2. Last class began Chapter 1 (Section 1.1) Introduced main types of data: Quantitative and Qualitative (or Categorical) Discussed.
B a c kn e x t h o m e Classification of Variables Discrete Numerical Variable A variable that produces a response that comes from a counting process.
Thomas Songer, PhD with acknowledgment to several slides provided by M Rahbar and Moataza Mahmoud Abdel Wahab Introduction to Research Methods In the Internet.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Measures of Central Tendency
Describing Data: Numerical
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Biostat 200 Introduction to Biostatistics 1. Lecture 1 2.
Variable  An item of data  Examples: –gender –test scores –weight  Value varies from one observation to another.
Methods for Describing Sets of Data
1 Excursions in Modern Mathematics Sixth Edition Peter Tannenbaum.
Chapter 3 Averages and Variations
Copyright (C) 2002 Houghton Mifflin Company. All rights reserved. 1 Understandable Statistics Seventh Edition By Brase and Brase Prepared by: Lynn Smith.
Descriptive Statistics: Numerical Methods
STAT 280: Elementary Applied Statistics Describing Data Using Numerical Measures.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Psyc 235: Introduction to Statistics Lecture Format New Content/Conceptual Info Questions & Work through problems.
Chapter 2 Describing Data.
Lecture 3 Describing Data Using Numerical Measures.
Skewness & Kurtosis: Reference
Lecture 5 Dustin Lueker. 2 Mode - Most frequent value. Notation: Subscripted variables n = # of units in the sample N = # of units in the population x.
Sampling Design and Analysis MTH 494 Ossam Chohan Assistant Professor CIIT Abbottabad.
Understanding Basic Statistics Fourth Edition By Brase and Brase Prepared by: Lynn Smith Gloucester County College Chapter Three Averages and Variation.
Chap 3-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 3 Describing Data Using Numerical.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
1 STAT 500 – Statistics for Managers STAT 500 Statistics for Managers.
Chapter 3 EXPLORATION DATA ANALYSIS 3.1 GRAPHICAL DISPLAY OF DATA 3.2 MEASURES OF CENTRAL TENDENCY 3.3 MEASURES OF DISPERSION.
Chapter 14 Statistics and Data Analysis. Data Analysis Chart Types Frequency Distribution.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Descriptive Statistics ( )
Exploratory Data Analysis
Chapter 3 Describing Data Using Numerical Measures
Descriptive measures Capture the main 4 basic Ch.Ch. of the sample distribution: Central tendency Variability (variance) Skewness kurtosis.
Chapter 2: Methods for Describing Data Sets
Description of Data (Summary and Variability measures)
Chapter 3 Describing Data Using Numerical Measures
Numerical Descriptive Measures
CHAPTER 1 Exploring Data
Please take out Sec HW It is worth 20 points (2 pts
Basic Statistical Terms
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
CHAPTER 1 Exploring Data
Describing Quantitative Data with Numbers
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Business and Economics 7th Edition
Presentation transcript:

Biostat 200 Introduction to Biostatistics 1

Lecture 1 2

Course instructors Course director Judy Hahn, M.A., Ph.D. Phone: (415) Office: 50 Beale St., Suite 1300 TAs Jeff Edwards, MD Phone: Christine Fox, MD Phone: Vicky Keoleian, BA Phone: Karen Ordovas, MD Phone:

Lectures: Tuesdays 10:30-12:30 Labs: Thursday 10:30-12 – Lab 1: Room CB 6702 – Lab 2: Room CB 6704 Office hours: Thursday 12-1 Room CB 6704 Course credits: 3 Please bring your laptop to class 4

Readings – Required readings will be from Principles of Biostatistics by M. Pagano and K. Gauvreau. Duxbury. 2nd edition. – Please read the assigned chapters before lecture, and review them after lecture – Lectures will closely follow book chapters 5

Assignments There will be 8 assignments Assignments will be due weekly on Thursdays starting 10/6 Each assignment will be posted at least one week before it is due Assignments will be due at the start of lab Assignment schedule in the syllabus ucsf.org/ticr/syllabus/display.asp?academic_year= &courseid=54 Assignments will consist of: – Data analysis and interpretation – Exercises in the book – Reading and interpretation of scientific publications 6

Assignments – Lab 1: Last name A-L TAs: Jeff Edwards and Vicky Keoleian Room 6702 Send assignments to: – Lab 2: Last name M-Z TAs: Christine Fox and Karen Ordovas Room 6704 Send assignments to: 7

Labs Labs will be every Thursday 10:30-12 No lab 9/29 and 11/24 Labs will include – A review of lecture material – A review of the assignment due that day – Time to ask questions about next assignment 8

Blog Please send your questions here I and the TAs will check it daily Also you can me or the TAs if you want to set up an appointment, etc. 9

Grading Homework (70%) – 8 Assignments Late assignments will not be graded – You will earn 60% credit if complete Extra credit opportunities may arise Final exam (30%) 10

TICR Professional Conduct Statement Clarifications for this class I will maintain the highest standards of academic honesty. I am allowed to collaborate with my classmates on assignments, however I will work through each problem myself and turn in my own work (no cutting and pasting from others). I will neither give nor receive help from other students on the final examination. I will not use questions or answer keys from prior years. 11

What I do and why 12

Course goals Knowledge of basic biostatistics terms and notation Understanding of concepts underlying all statistical analyses, as a foundation for more advanced analyses Ability to summarize data and conduct basic statistical analyses using STATA Ability to understand basic statistical analyses in published journals 13

Have you read a journal article that reports p-values or 95% confidence intervals? Do you have a data set or are you in the process of collecting your own data? Have you calculated a p-value or a 95% confidence interval? 14

Today’s topics Variables - numerical versus categorical Tables (frequencies) Graphs (histograms, box plots, scatter plots, line graphs) 15

Types of variables Variables are what you are measuring Data sets are made up of a set of variables 16

Types of variables Categorical variable: any variable that is not numerical (values have no numerical meaning) Examples: gender, race, drug, disease status 17

Types of variables Categorical variables – Nominal variables: The data are unordered For example: RACE: 1=Caucasian, 2=Asian American, 3=African American A subset of these variables are binary or dichotomous variables – Binary variables have only two categories – For example: GENDER: 1=male, 2=female – Most common example: 0=No 1=Yes 18

Types of variables Categorical variables – Nominal variables: The data are unordered – Ordinal variables: The data are ordered For example: AGE: 1=10-19 years, 2=20-29 years, 3= years For example: Likelihood of participating in a vaccine trial 1=Not at all likely 2=somewhat likely 3=very likely Pagano and Gauvreau, Chapter 2 19

Types of variables Numerical (quantitative) variables: naturally measured as numbers for which arithmetic operations are meaningful (e.g. height, weight, age, salary, viral load, CD4 cell counts) – Discrete variables: can be counted (e.g. number of children in household: 0, 1, 2, 3, etc.) but fractions do not make sense – Continuous variables: can take any value within a given range (e.g. weight: g, g) Pagano and Gauvreau, Chapter 2 20

Grey zone Dichotomous variables 0=No, 1=Yes Doing arithmetic operations actually does make sense If you take the mean of the 0’s and 1’s you get the proportion= yes 21

Why does it matter? Knowing what type of variable you are dealing with will help you choose your method of statistical analysis The most important/common distinction is between categorical and numerical 22

Manipulation of variables Continuous variables can be discretized – E.g., age can be rounded to whole numbers Continuous or discrete variables can be categorized – E.g., age categories Categorical variables can be re-categorized – E.g., lumping from 5 categories down to 2 23

Manipulation of variables Why discretize a continuous variable or re- categorize a categorical variable? – Ease of interpretation – Ease of statistical methodology – Some groups are too small to make conclusions about – But discretizing or lumping can have it’s statistical cost – loss of information 24

Tables to summarize data 25

Frequency tables Categorical variables are summarized by – Frequency counts – how many are in each category – Relative frequency or percent (a number from 0 to 100) – Proportion (a number from 0 to 1) Gender of new HIV clinic patients, , Mbarara, Uganda. n (%) Male415 (39) Female645 (61) Total1060 (100) 26

Frequency tables Continuous variables can be summarized in frequency tables but must be categorized in meaningful ways Choice of cutpoints – Even intervals (e.g. 10-year age categories) – Meaningful cutpoints related to a health outcome or decision – Equal percentage of the data falling into each category (e.g. tertiles – 33% each, quartiles – 25% each, quantiles – 20% each) 27

Frequency tables CD4 cell counts ( per mm 3 ) of newly diagnosed persons with HIV at Mulago Hospital, Kampala (N=268) n (%) ≤5042 (15.6) (25.9) (21.9) ≥35099 (36.7) 28

Frequency tables The cumulative frequency is the percentage of observations up to and including the current category CD4 cell counts ( per mm 3 ) of newly diagnosed persons with HIV at Mulago Hospital, Kampala (N=270) n (%)Cumulative frequency (%) ≤5042 (15.6) (25.9) (21.9)63.3 ≥35099 (36.7)

In Stata. tab cd4_cat cd4_cat | Freq. Percent Cum <50 | | | >=350 | Total |

Bar charts General graph for categorical variables Graphical equivalent of a frequency table The x-axis does not have to be numerical The height of the bars should add up to 1 31

Bar charts This one was made in excel 32

Histograms Bar chart for numerical data The number of bins and the bin width will make a difference in the appearance of this plot Width and number of bins may affect interpretation 33

Without specifying any options, your histogram will look like this. The bin width will be chosen automatically (here=500/6=83.33). ** Stata code for this histogram ** histogram cd4count 34

** Stata code for this histogram ** histogram cd4count, fcolor(blue) lcolor(black) width(50) title(CD4 among new HIV positives at Mulago) xtitle(CD4 cell count) percent 35

This histogram has less detail but gives us the % of persons with CD4 <350 cells/mm 3 histogram cd4count, fcolor(blue) lcolor(black) width(350) title(CD4 among new HIV positives at Mulago) xtitle(CD4 cell count) percent 36

Box plots Middle line=median (50 th percentile) Middle box=25 th to 75 th percentiles (interquartile range) Bottom whisker: Data point at or above 25 th percentile – 1.5*IQR Top whisker: Data point at or below 75 th percentile + 1.5*IQR 37

Box plots graph box cd4count, box(1, fcolor(blue) lcolor(black) fintensity(inten100)) title(CD4 count among new HIV positives at Mulago) 38 USE drop down menus in Stata to make your graphics look pretty!

Box plots by another variable We can divide up our graphs by another variable A way to describe the relationship between a numerical and categorical variable graph box daysdrank, by(, title(Days drank past 30) subtitle(Among current (prior 3 month) drinkers)) by(sex) box(1, fcolor(blue) lcolor(black) fintensity(inten100)) 39

Histograms by another variable histogram daysdrank, by(, title(Days drank past 30) subtitle(Among current (prior 3 month) drinkers)) by(sex) fcolor(blue) lcolor(black) 40

Numerical variable summaries Mode – the value (or range of values) that occurs most frequently Sometimes there is more than one mode, e.g. a bi-modal distribution (both modes do not have to be the same height) The mode makes most sense for categorical data For continuous data you can find the mode if you group the data 41

What type of variable is this? What is the mode? Is the distribution of this variable bi-modal? hist d1 if d1>=0 & d1<50, discrete fcolor(blue) title(Lifetime number of sex partners) 42

For numerical variables, the mode is dependent on the bin width.hist a4, width(2) fcolor(blue) title(Age with bin width=2) name(age_2, replace).hist a4, width(5) fcolor(blue) title(Age with bin width=5) name(age_5, replace).graph combine age_2 age_5 43

Scatter plots – 2 numerical variables twoway (scatter cd4count a4, color(maroon)) (lowess cd4count a4, lcolor(blue)) 44

The importance of good graphs 09/14/good-night-and-tough-luck/ 45

Numerical variable summaries Measures of central tendency – where is the center of the data? – Median – the 50 th percentile == the middle value If n is odd: the median is the (n+1)/2 observations (e.g. if n=31 then median is the 16 th highest observation) If n is even: the median is the average of the two middle observations (e.g. if n=30 then the median is the average of the 15 th and16th observation – Median CD4 cell count in previous data set =

In Stata. summarize cd4count, detail cd4count Percentiles Smallest 1% 0 0 5% % Obs % 92 1 Sum of Wgt % Mean Largest Std. Dev % % Variance % Skewness % Kurtosis

Numerical variable summaries Range – Minimum to maximum or difference (e.g. age range or range=63) CD4 cell count range: (0-1368) Interquartile range (IQR) – 25 th and 75 th percentiles (e.g. IQR for age: 23-36) or difference (e.g. 13) – Less sensitive to extreme values CD4 cell count IQR: (92-422) 48

Numerical variable summaries Measures of central tendency – where is the center of the data? – Mean – arithmetic average Means are sensitive to very large or small values Mean CD4 cell count: Mean age:

Interpreting the formula ∑ is the symbol for the sum of the elements immediately to the right of the symbol These elements are indexed (i.e. subscripted) with the letter i – The index letter could be any letter, though i is commonly used) The elements are lined up in a list, and the first one in the list is denoted as x 1, the second one is x 2, the third one is x 3 and the last one is x n. n is the number of elements in the list. 50

Numerical variable summaries Sample variance – Amount of spread around the mean, calculated in a sample by Sample standard deviation (SD) is the square root of the variance – The standard deviation has the same units as the mean SD of CD4 cell count = cells/mm 3 SD of Age = 11.2 years 51

Numerical variable summaries Coefficient of variation – For the same relative spread around a mean, the variance and standard deviation will be larger for a larger mean – Can use CV to compare variability across measurements that are on a different scale (e.g. IQ and head circumference) 52

CV for CD4 count. summarize cd4count, detail cd4count Percentiles Smallest 1% 0 0 5% % Obs % 92 1 Sum of Wgt % Mean Largest Std. Dev % % Variance % Skewness % Kurtosis

CV for age. summarize a4, detail a4. how old are you? Percentiles Smallest 1% % % Obs % Sum of Wgt % 30 Mean Largest Std. Dev % % Variance % Skewness % Kurtosis

Grouped data Sometimes you are given data in aggregate form The data consist of frequencies of each individual value or range of values For example: CD4 cell counts ( per mm 3 ) of newly diagnosed persons with HIV at Mulago Hospital, Kampala (N=270) n (%) ≤5042 (15.6) (25.9) (21.9) ≥35099 (36.7) 55

Grouped mean The mean uses the midpoint of each group For the highest group, the use the midpoint between the cutpoint and the maximum Grouped Mean m i = the midpoint of the i th group f i = the frequency in the i th group = (25* * * *99) / 270 = cells/mm 3 (mean from original data was 296.9) 56

Grouped standard deviation The standard deviation = sqrt ( ( ) 2 *42 + ( ) 2 *70 + ( ) 2 *59 + ( ) 2 *99 ) / 269 ) = cells/mm 3 (SD from original data was 255.4) 57

Pocket/wallet change Histogram, boxplot Mode, Median, 25 th percentile, 75 th percentile Mean, SD Differ by gender? 58

For next time Review today’s material – Read Pagano and Gauvreau Chapters 1-3 Next week’s material (Probability) – Read Chapter 6 – No laptop needed for next week’s lecture (but bring it to lab) 59