Presentation is loading. Please wait.

Presentation is loading. Please wait.

Descriptive statistics Petter Mostad 2005.09.08. Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when.

Similar presentations


Presentation on theme: "Descriptive statistics Petter Mostad 2005.09.08. Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when."— Presentation transcript:

1 Descriptive statistics Petter Mostad 2005.09.08

2 Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when you first get the data. Data presentation: Illustrating for others some conclusion with numbers or graphs based on the data.

3 Data exploration Understand description of variables Find ranges, typical values, distributions of variables –Is the data OK? Meaningful? Outliers? Errors? How do variables relate to each other? –Is it meaningful? As expected? Can you form new hypotheses?

4 Data presentation Remove superfluous information Present essential information fairly Present information efficiently Make it possible to understand information quickly and simply

5 Types of variables Numerical variables –Discrete –Continuous Categorical variables –Nominal values –Ordinal values

6 Histograms Subdivide continuous data into intervals, and display counts in intervals Desicion about width of intervals can influence result a lot ”Ogives”

7 Bar charts Can show variation between categories Grouped bars can compare variations in different groups Stacked bars can show proportions, or cumulative effects

8 Example Shows changing proportions of 8 types across 24 groups Groups: coexpressed genes Types: Types of organisms

9 Cumulative distributions Cumulates the proportions up to each level Can never decrease; goes from 0 to 1 (or 100%)

10 Stem-and-leaf diagrams A way to show both the distribution of numbers graphically, and the digits involved Age in years Stem-and-Leaf Plot Frequency Stem & Leaf 2,00 1. & 18,00 2. 01223444 28,00 2. 5667888889999 39,00 3. 0000111222233344444 48,00 3. 55555666777778888899999 38,00 4. 00001111223334444 39,00 4. 555677777888889999 37,00 5. 0000011223333444 22,00 5. 55667789999 13,00 6. 011133 5,00 6. 6& 7,00 7. 03& 1,00 7. & Stem width: 10 Each leaf: 2 case(s) & denotes fractional leaves.

11 Pie charts Illustrates percentages or parts well for comparison between the parts. 3D pies, or ”exploded” pies, distort more than they clarify the information

12 Pareto diagrams Focuses on the most important (frequent) categories. Shows cumulative frequences when including each category

13 Numerical summary statistics (Arithmetic) mean Median Mode Skewness Outliers Max, min, range

14 Arithmetic versus geometric mean Given observations x 1, x 2, …, x n Arithmetic mean: Geometric mean: They correspond to each other when the scale is changed by taking logarithms!

15 Measures of variability (Sample) variance (Sample) standard deviation Coefficient of variation

16 Percentiles and quartiles The x percentile is the number p such that x percent of the data is smaller than p. The first and third quartiles are the 25th and 75th percentiles, respectively The inter-quartile range is the difference between the third and first quartiles.

17 Boxplots ”Box and whisker plots” Sometimes shows min, 1st quartile, median, 3rd quartile, max May instead show some outliers separately

18 Scatterplots Probably the most useful graphical plot Can show any kind of connection between variables, not only linear Can be done for many pairs at a time (matrix plot), or for triplets (3D plot)

19 Covariance Given paired observations (x 1,y 1 ), (x 2,y 2 ), …, (x n, y n ) (sample) covariance: Positive when variables tend to change in the same direction, negative if opposite direction

20 Correlation coefficient Correlation coefficient: Always between -1 and 1 If exactly equal to 1, then points are on an increasing line Can be a more illustrative measure than covariance

21 Least squares line fitting We can illustrate a trend in the data by fitting a line

22 Fitting the line The line is often fitted by minimizing the sum of the squares of the ”errors” (the vertical distances to the line) We will hear much about regression methods later

23 Cross tables When items can be classified using two different categorical variables, we can illustrate counts in a cross table. If percentages are computed, they must be either relative to the columns or the rows. In multiway tables, more than two classifying variables are used.

24 Early example: Napoleons Russian campain 1812-1813

25 DNA sequence logos Used to show what is conserved, and what varies, at DNA binding sites for some protein Relative height of letters show which bases are conserved Total height shows degree of conservation

26 Chernoff faces A way to visualize about 20 parameters in one figure Background: We are good at remembering and comparing faces Features in the face correspond to parameters you want to visualize

27 Chernoff faces

28 Use your own creativity! When exploring data, try to make the kinds of plots that will answer your questions! When presenting data, think about –simplicity –fairness –efficiency –inventiveness


Download ppt "Descriptive statistics Petter Mostad 2005.09.08. Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when."

Similar presentations


Ads by Google