Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics.

Slides:



Advertisements
Similar presentations
Previous Lecture: Distributions. Introduction to Biostatistics and Bioinformatics Estimation I This Lecture By Judy Zhong Assistant Professor Division.
Advertisements

QUANTITATIVE DATA ANALYSIS
Chapter 13 Conducting & Reading Research Baumgartner et al Data Analysis.
Measures of Dispersion CJ 526 Statistical Analysis in Criminal Justice.
Analysis of Research Data
Chapter Two Descriptive Statistics McGraw-Hill/Irwin Copyright © 2004 by The McGraw-Hill Companies, Inc. All rights reserved.
Measures of Dispersion
Business Statistics BU305 Chapter 3 Descriptive Stats: Numerical Methods.
(c) 2007 IUPUI SPEA K300 (4392) Outline: Numerical Methods Measures of Central Tendency Representative value Mean Median, mode, midrange Measures of Dispersion.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
Introduction to the Practice of Statistics Fifth Edition Chapter 1: Looking at Data—Distributions Copyright © 2005 by W. H. Freeman and Company Modifications.
6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.
STA Lecture 131 STA 291 Lecture 13, Chap. 6 Describing Quantitative Data – Measures of Central Location – Measures of Variability (spread)
Tuesday, March 18, 2014MAT Tuesday, March 18, 2014MAT 3122.
How to find measures variability using SPSS
Chapter 21 Basic Statistics.
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
Descriptive Statistics1 LSSG Green Belt Training Descriptive Statistics.
Thursday, February 6, 2014MAT 312. Thursday, February 6, 2014MAT 312.
Sampling Design and Analysis MTH 494 Ossam Chohan Assistant Professor CIIT Abbottabad.
Measures of Dispersion How far the data is spread out.
Trial Group AGroup B Mean P value 2.8E-07 Means of Substances Group.
Tuesday, February 11, 2014MAT 312. Tuesday, February 11, 2014MAT 312.
Determination of Sample Size: A Review of Statistical Theory
Thursday, February 27, 2014MAT 312. Thursday, February 27, 2014MAT 312.
1 Results from Lab 0 Guessed values are biased towards the high side. Judgment sample means are biased toward the high side and are more variable.
Statistics Chapter 1: Exploring Data. 1.1 Displaying Distributions with Graphs Individuals Objects that are described by a set of data Variables Any characteristic.
Appendix B: Statistical Methods. Statistical Methods: Graphing Data Frequency distribution Histogram Frequency polygon.
Summarizing Risk Analysis Results To quantify the risk of an output variable, 3 properties must be estimated: A measure of central tendency (e.g. µ ) A.
The field of statistics deals with the collection,
Statistics topics from both Math 1 and Math 2, both featured on the GHSGT.
Cumulative frequency Cumulative frequency graph
Statistics with TI-Nspire™ Technology Module E Lesson 1: Elementary concepts.
MODULE 3: DESCRIPTIVE STATISTICS 2/6/2016BUS216: Probability & Statistics for Economics & Business 1.
Chapter II Methods for Describing Sets of Data Exercises.
STATISTICS Chapter 2 and and 2.2: Review of Basic Statistics Topics covered today:  Mean, Median, Mode  5 number summary and box plot  Interquartile.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
Statistics and probability Dr. Khaled Ismael Almghari Phone No:
7 th Grade Math Vocabulary Word, Definition, Model Emery Unit 4.
Statistics Descriptive Statistics. Statistics Introduction Descriptive Statistics Collections, organizations, summary and presentation of data Inferential.
COMPLETE BUSINESS STATISTICS
Exploratory Data Analysis
Statistical Methods Michael J. Watts
Analysis and Empirical Results
Statistics 1: Statistical Measures
STAT 4030 – Programming in R STATISTICS MODULE: Basic Data Analysis
Review 1. Describing variables.
Statistical Methods Michael J. Watts
Descriptive measures Capture the main 4 basic Ch.Ch. of the sample distribution: Central tendency Variability (variance) Skewness kurtosis.
Module 6: Descriptive Statistics
CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.
IENG 486: Statistical Quality & Process Control
Description of Data (Summary and Variability measures)
Summary Statistics 9/23/2018 Summary Statistics
An Introduction to Statistics
Basic Statistical Terms
Descriptive and inferential statistics. Confidence interval
More Weather Stats.
What would be the typical temperature in Atlanta?
Univariate Statistics
Descriptive Statistics
MCC6.SP.5c, MCC9-12.S.ID.1, MCC9-12.S.1D.2 and MCC9-12.S.ID.3
Probability and Statistics
DESIGN OF EXPERIMENT (DOE)
Advanced Algebra Unit 1 Vocabulary
Week 11.
Statistics Standard: S-ID
Introductory Statistics
Descriptive Statistics Civil and Environmental Engineering Dept.
Presentation transcript:

Introduction to Biostatistics and Bioinformatics Exploring Data and Descriptive Statistics

Learning Objectives Python matplotlib library to visualize data: Scatter plot Histogram Kernel density estimate Box plots Descriptive statistics: Mean and median Standard deviation and inter quartile range Central limit theorem

An Example Data Set

Scatter Plot Order or Measurement Measurement

Histogram Order or Measurement Measurement Bin size = 0.1Bin size = 0.05Bin size = Number of Measurements

Cumulative Distributions Order or Measurement Measurement Cumulative Frequency

Kernel Density Estimate Order or Measurement Measurement Number of Measurements

Original Distribution Order or Measurement Measurement Number of Measurements Original Distribution Kernel Density Estimate Frequency Measurement Bin size = 0.05 Number of Measurements Histogram Measurement

More Data Order or Measurement Measurement Number of Measurements Original Distribution Kernel Density Estimate Frequency Measurement Bin size = 0.05 Number of Measurements Histogram Measurement

Exercise 1 Download ibb2015_7_exercise1.py (a)Draw 20 points from a normal distribution with mean=0 and standard deviation=0.1. import numpy as np y=0.1*np.random.normal(size=20) print y [ ]

Exercise 1 (b)Make scatter plot of the 20 points. import matplotlib.pyplot as plt x=range(1,points+1) fig, (ax1) = plt.subplots(1,figsize=(6,6)) ax1.scatter(x,y,color='red',lw=0,s=40) ax1.set_xlim([0,points+1]) ax1.set_ylim([-1,1]) fig.savefig('ibb2015_7_exercise1_scatter_points'+str(poi nts)+'.png',dpi=300,bbox_inches='tight') plt.close(fig)

Exercise 1 (c)Plot histograms. for bin in [20,40,80]: fig, (ax1) = plt.subplots(1,figsize=(6,6)) ax1.hist(y,bins=bin,histtype='step',color='black', range=[-1,1], lw=2, normed=True) ax1.set_xlim([-1,1]) fig.savefig('ibb2015_7_exercise1_bin'+str(bin)+'_ points'+str(points)+'.png',dpi=300,bbox_inches='t ight') plt.close(fig)

Exercise 1 (d)Plot cumulative distribution. y_cumulative=np.linspace(0,1,points) x_cumulative=np.sort(y) fig, (ax1) = plt.subplots(1,figsize=(6,6)) ax1.plot(x_cumulative,y_cumulative,color='black', lw=2) ax1.set_xlim([-1,1]) ax1.set_ylim([0,1]) fig.savefig('ibb2015_7_exercise1_cumulative_points'+ str(points)+'.png',dpi=300,bbox_inches='tight') plt.close(fig)

Exercise 1 (e)Plot kernel density estimate. import scipy.stats as stats kde_points=1000 kde_x = np.linspace(-1,1,kde_points) fig, (ax1) = plt.subplots(1,figsize=(6,6)) kde_y=stats.gaussian_kde(y) ax1.plot(kde_x,kde_y(kde_x),color='black', lw=2) ax1.set_xlim([-1,1]) fig.savefig('ibb2015_7_exercise1_kde_points'+str(points) +'.png',dpi=300,bbox_inches='tight') plt.close(fig)

Comparing Measurements

Comparing Measurements – Cumulative distributions

Systematic Shifts

Exercise 2 Download ibb2015_7_exercise2.py (a)Generate 5 data sets with 20 data points each from normal distributions with means = 0, 0, 0.1, 0.5 and 0.3 and standard deviation=0.1. y=[] for j in range(5): y.append(0.1*np.random.normal(size=20)) y[2]+=0.1 y[3]+=0.5 y[4]+=0.3 print y

Exercise 2 (b)Make scatter plots for the 5 data sets. sixcolors=['#D4C6DF','#8968AC','#3D6570','#91732B', '#963725','#4D0132'] fig, (ax1) = plt.subplots(1,figsize=(6,6)) for j in range(5): ax1.scatter(np.linspace(j+1-0.2,j+1+0.2,20), y[j],color=sixcolors[6-(j+1)], lw=0, alpha=1) ax1.set_xlim([0,6]) ax1.set_ylim([-1,1]) fig.savefig('ibb2015_7_exercise2_scatter_sample'+ str(20),dpi=300,bbox_inches='tight') plt.close(fig)

Correlation Between Two Variables

Data Visualization -visualization-points-of-view.html

Process of Statistical Analysis Population Random Sample Sample Statistics Describe Make Inferences

Distributions ComplexNormalSkewedLong tails n=3 n=10 n=100

Mean Sample

Mean - Sample Size Normal Distribution Mean Sample Size -0.2

Mean – Sample Size ComplexNormalSkewedLong tails Sample Size

Mode, Maximum and Minimum Sample Maximum Minimum Mode the most common value

Median, Quartiles and Percentiles Sample Quartiles for 25% of the sample for 50% of the sample (median) for 75% of the sample for m% of the sample Percentiles

Median and Mean – Sample Size ComplexNormalSkewedLong tails Sample Size Median - Gray

Variance Sample Mean

Variance – Sample Size ComplexNormalSkewedLong tails Sample Size

Inter Quartile Range (IQR) Sample Quartiles for 25% of the sample for 50% of the sample (median) for 75% of the sample Inter Quartile Range

Inter Quartile Range and Standard Deviation ComplexNormalSkewedLong tails Sample Size IRQ/ Gray

Central Limit Theorem The sum of a large number of values drawn from many distributions converge normal if: The values are drawn independently; The values are from the one distribution; and The distribution has to have a finite mean and variance.

Uncertainty in Determining the Mean ComplexNormalSkewedLong tails n=3 n=10 Mean n=100 n=3 n=10 n=100 n=3 n=10 n=100 n=10 n=100 n=1000

Standard Error of the Mean Variance Sample Mean Standard Error of the Mean

Exercise 3 Download ibb2015_7_exercise3.py (a)Generate skewed data sets. sample_size=10 x_test=np.random.uniform(-1.0,1.0,size=30*sample_size) y_test=np.random.uniform(0.0,1.0,size=30*sample_size) y_test2=skew(x_test,-0.1,0.2,10) y_test2/=max(y_test2) x_test2=x_test[y_test<y_test2] x_sample=x_test2[:sample_size] 1.Generate a pair of random numbers within the range. 2.Assign them to x and y 3.Keep x if the point (x,y) is within the distribution. 4.Repeat 1-3 until the desired sample size is obtained. 5.The values x obtained in this was will be distributed according to the original distribution.

Exercise 3 (b)Calculate the mean of samples drawn from the skewed data set and the standard error of the mean, and plot the distribution of averages. for repeat in range(1000): … average.append(np.mean(x_sample)) sem=np.std(average) fig, (ax1) = plt.subplots(1,figsize=(6,6)) ax1.set_title('Sample size = '+str(sample_size)+', SEM = ' +str(sem)) ax1.hist(average,bins=100,histtype='step',color='red',range= [-0.5,0.5],normed=True,lw=2) ax1.set_xlim([-0.5,0.5])

Box Plot M. Krzywinski & N. Altman, Visualizing samples with box plots, Nature Methods 11 (2014) 119

n=5 Box Plots ComplexNormalSkewedLong tails n=10 n=100 n=5 n=10 n=100 n=5 n=10 n=100 n=5 n=10 n=100

Box Plots with All the Data Points ComplexNormalSkewedLong tails n=5 n=10 n=100 n=5 n=10 n=100 n=5 n=10 n=100 n=5 n=10 n=100

Box Plots, Scatter Plots and Bar Graphs Normal Distribution Error bars: standard deviation error bars: standard deviation error bars: standard error

Box Plots, Scatter Plots and Bar Graphs Skewed Distribution Error bars: standard deviation error bars: standard deviation error bars: standard error

Exercise 4 Download ibb2015_7_exercise4.py and plot box plots for a skewed data set. fig, (ax1) = plt.subplots(1,figsize=(6,6)) ax1.scatter(np.linspace(1-0.1, 1+0.1,sample_size), x_sample, facecolors='none', edgecolor=thiscolor, lw=1) bp=ax1.boxplot(x_samples, notch=False, sym='') plt.setp(bp['boxes'], color=thiscolor, lw=2) plt.setp(bp['whiskers'], color=thiscolor, lw=2) plt.setp(bp['medians'], color='black', lw=2) plt.setp(bp['caps'], color=thiscolor, lw=2) plt.setp(bp['fliers'], color=thiscolor, marker='o', lw=0) fig.savefig(…)

Descriptive Statistics - Summary Example distribution: Normal distribution Skewed distribution Distribution with long tails Complex distribution with several peaks Mean, median, quartiles, percentiles Variance, Standard deviation, Inter Quartile Range (IQR), error bars Box plots, bar graphs, and scatter plots

Descriptive Statistics – Recommended Reading

Homework Plot the ratio of the standard error of the mean and the standard deviation as a function of sample size (use sample sizes of 3, 10, 30, 100, 300, 1000) for the skewed distribution in Exercise 3. Modify ibb2015_7_exercise3.py to generate this plot and both the script and the plot.

Next Lecture: Sequence Alignment Concepts