Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 1: Descriptive Statistics and Exploratory

Similar presentations


Presentation on theme: "Lecture 1: Descriptive Statistics and Exploratory"— Presentation transcript:

1 Lecture 1: Descriptive Statistics and Exploratory
Data Analysis When we make an instrumental measurement – we collect data: experimentally obtained measurement results. Data – is a set of values of qualitative or quantitative variables. Exploratory data analysis or “EDA” is a first step in analyzing the data from an experiment. The main reasons we use EDA: • detection of mistakes • checking of statistical methods assumptions • generating hypotheses • preliminary selection of appropriate models • determining relationships among the variables.

2 Why we need EDA? Data in the real world is „dirty“ incomplete: lacking attribute values occupation=“ ” noisy: containing errors or outliers Salary=“-10” inconsistent: containing discrepancies in codes or names Age=“42” Birthday=“03/07/1997” Was rating “1,2,3”, now rating “A, B, C” discrepancy between duplicate records Why Is Data Dirty? “Not applicable” data value when collected, faulty data collection instruments, human or computer error at data entry, different data sources, …

3 Why Is Data Preprocessing Important?
Quality decisions must be based on quality data. Duplicate or missing data may cause incorrect or even misleading statistics. High quality data requirements High-quality data needs to pass a set of quality criteria: Validity Accuracy Precision Reliability The precision of an experiment is related to our ability to minimize random error. The accuracy of an experiment is related to our ability to minimize systematic error.

4 Validity - The extent to which the study measures what it is intended to measure. Are the values describing what was supposed to be measured? Lack of validity is referred to as ‘Bias‘ or ‘systematic error‘. Accuracy - The degree to which a measurement represents the true value of something. How close a measurement is to the true value? Precision - The degree of the reproducibility of our technique.  How close the measurements are to each other? Reliability - A measure of how dependably an observation is exactly the same when repeated. Will one get the same values if the measurements are repeated?  Accuracy (validity): are used synonymously Precision (reliability): are used synonymously

5 Bias & Variability A biased measurement will be wrong in the same direction nearly every time. Variability is the difference in successive measurements of the same thing. TYPES OF DATA

6 Quantitative Data: Discrete vs. Continuous
Discrete random variables can only take on values from a countable set of numbers such as the integers or some subset of integers. (Usually, they can’t be fractions.) Continuous random variables can take on any real number in some interval. (They can be fractions.) Categorical: Nominal vs. Ordinal Nominal (unordered) random variables have categories where order doesn’t matter. e.g. gender, ethnic background, religious affiliation Ordinal (ordered) random variables have ordered categories. (e.g. grade levels, income levels, school levels, ...) Observational units are entities whose characteristics we measure. Random variables are characteristics of the observational.

7 A Review of the main Principles of Statistics
Population: the entire collection of units about which we would like information. Sample: the collection of units we actually measure. Parameter: the true value we hope to obtain. Statistic: an estimate of the parameter based on observed information in the sample.

8 Non-Graphical Exploratory Data Analysis
This preliminary data analysis step focuses on four points: measures of central tendency, i.e. the mean, the median and mode, measures of spread, i.e. variability, variances and standard deviation, the shape of the distribution, the existence of outliers.

9

10

11

12

13 Why Squared Deviations?
Squares eliminate the negatives. Result: – Increasing contribution to the variance as you go farther from the mean.

14 Standard deviations are simply the square root of the variance

15 Interesting Theoretical Result

16

17

18 Graphical Exploratory Data Analysis
Univariate Data: Histograms and Bar Plots What’s the difference between a histogram and bar plot? Bar plot • Used for categorical variables to show frequency or proportion in each category. • Translate the data from frequency tables into a pictorial representation… Histogram • Used to visualize distribution (shape, center, range, variation) of continuous variables • “Bin size” important

19

20

21


Download ppt "Lecture 1: Descriptive Statistics and Exploratory"

Similar presentations


Ads by Google