Course in Statistics and Data analysis Course B, September 2009 Stephan Frickenhaus.

Slides:



Advertisements
Similar presentations
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Advertisements

Psych 5500/6500 The Sampling Distribution of the Mean Fall, 2008.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Sampling Distributions (§ )
1 Normal Probability Distributions. 2 Review relative frequency histogram 1/10 2/10 4/10 2/10 1/10 Values of a variable, say test scores In.
© 2010 Pearson Prentice Hall. All rights reserved Confidence Intervals for the Population Mean When the Population Standard Deviation is Unknown.
QUANTITATIVE DATA ANALYSIS
1 Analysis of Variance This technique is designed to test the null hypothesis that three or more group means are equal.
Data analysis Incorporating slides from IS208 (© Yale Braunstein) to show you how 208 and 214 are telling you many of the the same things; and how to use.
BCOR 1020 Business Statistics Lecture 15 – March 6, 2008.
1 The Basics of Regression Regression is a statistical technique that can ultimately be used for forecasting.
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
1 Pertemuan 06 Sebaran Normal dan Sampling Matakuliah: >K0614/ >FISIKA Tahun: >2006.
1 Business 90: Business Statistics Professor David Mease Sec 03, T R 7:30-8:45AM BBC 204 Lecture 22 = More of Chapter “Confidence Interval Estimation”
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Estimating  When  Is Unknown
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 4: The Normal Distribution and Z-Scores.
Normal distribution and introduction to continuous random variables and continuous probability density functions...
Effect Sizes, Power Analysis and Statistical Decisions Effect sizes -- what and why?? review of statistical decisions and statistical decision errors statistical.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of normality Transformations Assumption of normality script Practice problems.
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
AM Recitation 2/10/11.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
STAT 5372: Experimental Statistics Wayne Woodward Office: Office: 143 Heroy Phone: Phone: (214) URL: URL: faculty.smu.edu/waynew.
Sampling Theory Determining the distribution of Sample statistics.
Chapter 6 The Normal Probability Distribution
Class Meeting #11 Data Analysis. Types of Statistics Descriptive Statistics used to describe things, frequently groups of people.  Central Tendency 
Evidence Based Medicine
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Review of Chapters 1- 5 We review some important themes from the first 5 chapters 1.Introduction Statistics- Set of methods for collecting/analyzing data.
Theory of Probability Statistics for Business and Economics.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution and Other Continuous Distributions.
User Study Evaluation Human-Computer Interaction.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
Correlation and Linear Regression. Evaluating Relations Between Interval Level Variables Up to now you have learned to evaluate differences between the.
Copyright © Cengage Learning. All rights reserved. 2 Descriptive Analysis and Presentation of Single-Variable Data.
Biostatistics Class 1 1/25/2000 Introduction Descriptive Statistics.
TYPES OF STATISTICAL METHODS USED IN PSYCHOLOGY Statistics.
Research Seminars in IT in Education (MIT6003) Quantitative Educational Research Design 2 Dr Jacky Pow.
Copyright © 2012 Pearson Education. All rights reserved © 2010 Pearson Education Copyright © 2012 Pearson Education. All rights reserved. Chapter.
Chapter 12 Confidence Intervals and Hypothesis Tests for Means © 2010 Pearson Education 1.
Chapter Eight: Using Statistics to Answer Questions.
CY1B2 Statistics1 (ii) Poisson distribution The Poisson distribution resembles the binomial distribution if the probability of an accident is very small.
Chapter 3: Organizing Data. Raw data is useless to us unless we can meaningfully organize and summarize it (descriptive statistics). Organization techniques.
© Copyright McGraw-Hill 2004
Statistics. Descriptive Statistics Organize & summarize data (ex: central tendency & variability.
ANOVA, Regression and Multiple Regression March
© 2002 Prentice-Hall, Inc.Chap 5-1 Statistics for Managers Using Microsoft Excel 3 rd Edition Chapter 5 The Normal Distribution and Sampling Distributions.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
1 ES Chapter 3 ~ Normal Probability Distributions.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Lecture 8: Measurement Errors 1. Objectives List some sources of measurement errors. Classify measurement errors into systematic and random errors. Study.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
Sampling Theory Determining the distribution of Sample statistics.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Business Statistics: A First Course 5 th Edition.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
1 - COURSE 4 - DATA HANDLING AND PRESENTATION UNESCO-IHE Institute for Water Education Online Module Water Quality Assessment.
Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability.
Density Curves & Normal Distributions Textbook Section 2.2.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.
Chapter 6 The Normal Distribution and Other Continuous Distributions
The normal distribution
Applied Statistical Analysis
The normal distribution
Inferential Statistics
Sampling Distributions (§ )
Chapter Nine: Using Statistics to Answer Questions
Presentation transcript:

Course in Statistics and Data analysis Course B, September 2009 Stephan Frickenhaus

Outline theses my experience is: Many young researchers lack knowledge of analysis tools, so producing/sampling data is not the problem but analysing gets a problem right before publication. Once, appropriate tools are known (and: Excel is not approriate for analysis), still knowledge of methods/concepts may be missing. This course tries to tackle both …

schedule Day 1: 8.9., 10: :00 Room E4005 The probability distribution, The p-value concept, statistical tests in R Day 2: 9.9., 10: :00 Room E4005 Multivariate Analysis, Correlation tests, ANOVA, Ordination with factors and environmental data Cluster-Analysis (maybe as start of Day 3) Day 3: 10.9., 10: :00 Glaskasten F User-driven interactive: bring your project data and we work on it

Contents / Setup Tool-based (program „R“) course –Install „R“ from Exploring data analysis –Graphically –Numerically Exploring what significance really is –Statistics tests no longer as black-boxes

DAY1 – Lecture part I With each type of data we have different methods to analyse, give examples! Data Numerical (metric) data Nominal (class) data Ordinal (ranked) data Linear: Length in cm Circular: Angle in degree Sex, Colour, Species Age group, school class, phase in cell-division examples type of data

First steps from data … Plot in a co-ordinate system (scatter-plot), histogram, boxplot Count in a table, barplot, piechart Count in a table, with an axis, barplot Linear: Length in cm Circular: Angle in degree Sex, Colour, Species Age group, school class, phase in cell-division

… to methods Check for groups, trends, correlations Check for differences, ratios Check for differences, ratios, relation to order Plot in a co-ordinate system (scatter-plot), histogram, boxplot Count in a table, barplot, piechart Count in a table, with an axis, barplot metric nomiinal ordinal

…to combinations of data X-Y-Plots metric nomiinal ordinal metric X-Y-plot with colors=class metric Class=color in scatter plot Check for groups/clusters

…towards models: multivariate data Organize data in tables Keep data of same measurement in ONE row Distinguish groups in extra column by nominal data

Before discussing, what we can do with such a table, lets do first steps in the tool R!

Start Practice with R

Lecture part II What, if the summary of data is not enough? E.g., we want to say, whether an observed mean value is probably greater than 0.5? It is not enough to conclude „We clearly find mean(x)<mean(y)“ because this may be an outcome due to small sample sizes, and in reality the means may be equal, and there is maybe no effect at all. We must define some terms to learn how to be more quantitative about such statements, like „with 1% error we can exclude that x and y are from the same population“

Some terms… Population : –all individuals of the kind measured –If we measure them all, we know exactly the mean value etc., the true mean –Some times we do not have it accessible –Sometimes we think it has infinitely many individuals Sample : –A subset of individuals from a population –It has, e.g., a sample mean that is not equal to the true mean (the mean of the population) –sample size : number of individuals picked

…more terms, for real numbered variables X Probability density function p(x) the probability to pick samples x i from X in the interval [ a,b ] Cumulative distribution function cdf(x) probability to pick an x below a

p(x) prob density function x p(x) ab p(x)>=0 Need not be symmetric! Full range of X makes 100%

cumulative distr. function x cdf(x) 1 cdf starts from 0 at the minimal possible value of X, reaches 1 at the maximal possible value of X. Here p drops to 0. cdf is monotonically increasing, because it integrates a p≥0. min(X) max(X)

Mean E and Standard deviation S x p(x) E(X), need not be at the maximum of p(x) S(X) measures somehow the width of p(x), i.e., the scattering of x around E(x).

Long-tail distributions x p(x) Some rare samples will have very large values x ! When we have few samples, we pick from these rare values maybe none!

What is a statistics test? Example: We have a sample x of size 6. How probable is it, that the mean of the sample x is between 2 and 2.5, although E(X)=0? To answer this: –1) we repeat many times taking samples of size 6 and count how often. –2) we need an assumption about the probability density of X and then integrate a statistics distribution of mean(x) to measure Pr(2<mean(x)<2.5) May be too expensive LATER: Can I check what the pdf of X is?

…influence of sample size on the mean repeat a sampling from X with sd(X)=1.0 at different sizes N Take sample means How do repeated means vary (standard deviation) Result… For high N, sd(mean) goes (central limit theorem) How for low N ??? Its given by the t-statistics t = mean(x)/(sd(x)/sqrt(N)), which depends on sample size N.

A first test: Test the influence of sample size How do I know how many samples I need to make a correct statement about the mean like E(X)≥0.89? „correct“ is to be quantified as the „type-I error“: How probable is it that I see the same or more extreme value by chance alone, i.e., although the population mean is 0 ? Concept of the Null-Hypothesis How shure can I be to exclude, that the population mean is not zero, also when I find a sample mean of m=0.89. So, we evaluate how probable such an outcome is, when a certain pdf(X), e.g., the normal distribution, which has an E(X)=0. To evaluate this Pr, we need a test-statistic t for it and a distribution pdf(t) to integrate for Pr.

T-statistics T has a complicated mathematical, its graph is similar to bell-shaped curve. It has for small sample size N longer tails (green) Pr(T>=3) Blue area= Pr(T<3)

T is known in R Test for sample x=c(1,2) Pr(t<3), for n=2 Upper boundary 3? t=mean(x)/sd(x)*sqrt(2) =1.5/0.707*1.44=3.0 Sample size -1 So, ~90 from 100 repeated samples will give mean below pt(3,df=1) = is the chance to have mean(x) greater 1.5 ! (remember, N=2), Under the assumption that x is drawn from a population with mean 0 !

Now the test itself: We have a sample size 2 The Null-Hypothesis Our sample is from a population with mean 0. The test that checks this is in R… Ignore this 0