Presentation is loading. Please wait.

Presentation is loading. Please wait.

R INSTALLATION R is an open source software package for statistical data analysis.

Similar presentations


Presentation on theme: "R INSTALLATION R is an open source software package for statistical data analysis."— Presentation transcript:

1 R INSTALLATION R is an open source software package for statistical data analysis

2 R Installation R download: http://www.r-project.org/http://www.r-project.org/ Germany, Stefan Drees Bonn: http://cran.r-mirror.de/http://cran.r-mirror.de/ This is the main program! RStudio (Auxiliary program for editing R files): http://www.rstudio.com/ide/download/ http://www.rstudio.com/ide/download/ Install the programs (R should be there already).

3 Working with the R command prompt (without RStudio) Practical session

4 Working with RStudio Start RStudio „File -> New -> R Script“ Lets you edit a R command or R script (= small programme = several consecutive commands)

5 Working with RStudio New R files The command prompt Select Files Plots Packages (for advanced analyses) Help

6 Working with RStudio Commands and programs can be stored in R files Execute one command line: Ctrl+Enter or Button „Run“ Execute several lines: Mark lines and use „Ctrl+Enter“ or „Run“ button

7 WHAT IS STATISTICS?

8 What is statistics? Statistics is a means to connect empirical knowledge and theory and is constituted as follows: Data representation (Empirics) Methods for description, analysis, and interpretation of data, in order to allow predictions, conclusions and decisions (Statistical Theory)

9 What is statistics? I. Descriptive Statistics II. Probability theory III. Test theory

10 DESCRIPTIVE STATISTICS

11 Descriptive Statistics

12 Descriptive statistics quantitative Observational objects discrete continuous Attributes / traits Values qualitative patients, blood samples, DNA samples, houses, atoms blood pressure, weight, age, blood group, number of siblings, marital status, rent Blood group, marital status Number of siblings blood pressure, weight, age, rent

13 Descriptive statistics Scaling: Nominal scale: attribute values that are not directly comparable (sex, subject of studies, country of origin) (qualitative) Ordinal scale: attribute values that have a „natural“ order (grades, font sizes: tiny-small-medium-large-huge) Interval scale: difference between attribute values is interpretable (temperature in °C) (quantitative) To be distinguished: Discrete attributes: Attribute values can be counted Continuous attributes: All real numbers, or at least all numbers from an interval, are possible

14 Descriptive statistics Frequencies: n Absolute frequency n i : i Number of obersvations with attribute value i (counts) Relative frequency h i : i Portion of elements with attribute value i Nn i / N To be computed as absolute frequency devided by total number of objects N: n i / N Relative frequencies lie between 0 and 1 Relative frequencies have to add up to 1 (<- can be used to check computation)

15 Descriptive statistics AB0 blood group 1.00N = 50 0 1 2 5 6 19 17 1.00 0.00 0.01 0.03 0.09 0.10 0.38 0.39 Köln other A2B A1B B A2 A1 0 value 7 6 5 4 3 2 1 n i absolute frequency n i 200 0 2 6 18 20 76 78 KölnBonn h i relative frequency h i Bonn 0.00 0.02 0.04 0.10 0.12 0.38 0.34 tally sheet

16 Descriptive statistics

17 height classification: complete disjoint (each value belongs to only one class) class limits: (160; 170] contains all values, that are > 160 but  170. 150160170180190200 150200 height [cm] ( ] ( ] (]( ](] (

18 Descriptive statistics height [cm] h i relative h i 1,00 0.00 0.05 0.25 0.35 0.30 0.05 0.00 frequency n i absolute n i N=100 0 5 25 35 30 5 0 Cumulative frequency H i relative H i 1.00 0.95 0.70 0.35 0.05 0.00 N i absolute N i 100 95 70 35 5 0 7 6 5 4 3 2 1 Class number i > 200 (190; 200] (180; 190] (170; 180] (160; 170] (150; 160]  150 Class limits (ai-1; ai] Tally sheet

19 DESCRIPTIVE STATISTICS Graphical representation

20 Descriptive statistics Pie chart (R function: pie() ) Shows absolute frequencies Example: blood groups

21 Descriptive statistics Bar chart (R function: barplot() ) Shows relative frequencies Example: blood groups

22 Descriptive statistics Representation of cumulative frequencies with empirical distribution function F Discrete trait: Number of Children 1.00 0.96 0.90 0.80 0.50 0.10 >4 4 3 2 1 0 Number of children 6 5 4 3 2 1 N = 50 2 3 5 15 20 5 h i relative h i n i absolute n i Frequencies 50 48 45 40 25 5 Cumulative frequencies N i absolute N i relative H i 0.04 0.06 0.10 0.30 0.40 0.10 1.00 Tally sheet

23 Descriptive statistics H 0.0 0.2 0.4 0.6 0.8 1.0 01234>4 i Number of children hihi hihi h i 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4>4 Bar chart F F:Empirical distribution function Since the attribute is quantitative discrete, we obtain a step function

24 Descriptive statistics Histogramms (R function: hist() ) Construction: Data is subdevided into classes Surface area of columns is proportional to the respective frequencies Columns are neighbouring since classes are neighbouring

25 Descriptive statistics Example: Height [cm] 0 0,2 0,4 0,6 0,8 1 150160170180190200 height [cm ] hihi Histogram

26 Descriptive statistics 0 0,2 0,4 0,6 0,8 150160170180190 200 height [cm] F 0 0,2 0,4 0,6 0,8 150160170180190 200 height [cm] f empirical density function f empirical distribution function F (for continuous trait)

27 Descriptive statistics 0 0,2 0,4 0,6 0,8 150160170180190 200 height [cm] f empirical density function f 0 0,2 0,4 0,6 0,8 150160170180190 200 height [cm] F empirical distribution function F hihi hihi

28 Descriptive Statistics Note: Slides 23 and 26 both show empirical distribution functions. In the first case, we obtain a step function since the trait under investigation is discrete.

29 DESCRIPTIVE STATISTICS Measures of central tendency, dispersion and spread

30 Descriptive statistics Measures of central tendency: A number to characterize the „center“ of the data Most important: Mean Median

31 Descriptive statistics x 8 =7 x 7 =6 x 6 =4 x 5 =19 x 4 =8 x 3 =3 x 2 =9 x 1 =5 sampleranks x (8) =19 x (1) =3 x (2) =4 x (3) =5 x (4) =6 x (5) =7 x (6) =8 x (7) =9 x 7 =6 x 6 =4 x 5 =19 x 4 =8 x 3 =3 x 2 =9 x 1 =5 sampleranks x (1) =3 x (2) =4 x (3) =5 x (4) =6 x (5) =8 x (6) =9 x (7) =19

32 Descriptive statistics

33 i 12000 1500 25000150002000 34000 2500 41500 4000 52500 5000 / 15000

34 Descriptive statistics x outlier How to treat outliers? Yes!2) Check value and correct No!1) Discard

35 Descriptive statistics Measure the amount of variation of the data! x sample Asample B x  The mean (or median) is not sufficent to describe a sample

36 Descriptive statistics Measures of dispersion and spread: Numbers to characterize the amount variation around the center (= mean) Most important: Minimum, maximum, range (dispersion) Empirical variance (spread) Empirical standard deviation (spread)

37 Descriptive statistics ranks n=7 x 7 =6 x 6 =4 x 5 =19 x 4 =8 x 3 =3 x 2 =9 x 1 =5 sample x (1) =3 x (2) =4 x (3) =5 x (4) =6 x (5) =8 x (6) =9 x (7) =19 minimum:min = x (1) maximum:max = x (n) range:R = x (n) – x (1) range:

38 Descriptive statistics

39 = =n =x4x4 =x3x3 =x2x2 =x1x1 Example: 100 4 270 2 75 x 4 =53 is not free, but given by other values when the mean is known. s 2 has (n-1) degrees of freedom (f)

40 Data If you have data you want to analyse, please bring it along!


Download ppt "R INSTALLATION R is an open source software package for statistical data analysis."

Similar presentations


Ads by Google