Introduction to Statistical Modelling Example: Body and heart weights of cats. The R data frame cats, and the variables therein, are made available by typing > library(MASS) > data(cats) > attach(cats)
The data frame records the heart weights (Hwt), body weights (Bwt) and gender (Sex) of 47 female and 97 male adult cats. We regard heart weight as the response variable, and seek to model its conditional distribution, given (the values of) the explanatory variables body weight and gender. In other words, how does heart weight depend on body weight or gender?
Exploratory analysis The R code >boxplot(Hwt~Sex, xlab='gender', ylab='heart weight') produces a plot that clearly shows that heart weight depends on gender.
Similarly, the R code > plot(Hwt~Bwt, xlab='body weight (kg)', ylab='heart weight (g)', type='n') > points(Hwt[Sex=='M']~Bwt[Sex=='M']) > points(Hwt[Sex=='F']~Bwt[Sex=='F'], pch=3) produces the following plot of heart weight against body weight for male cats (denoted by o) and female cats (denoted by +).
Conclusion: It is clear that there is some dependence of heart weight on each of the two explanatory variables considered separately. Thus we might initially try further analysis for male cats alone.
Modelling the dependence of heart weight on body weight for male cats. Let y denote the response variable “heart weight”, and let x denote the explanatory variable “body weight”.
We now seek to fit a relationship of the form y = f(x) + є, where, for each x, the fit f(x) describes the average of the conditional distribution of y given x. The quantity є can then be thought of as a residual random variable describing the typical deviation of y from its average value.
Observation of the graph leads us to guess that there may be a linear dependence here. Thus the relationship is approximately y=a+bx+є, for some constants a and b These constants are chosen by the method of least squares regression so as to minimise the sum of squares of the є. They can be found using R
> lm(Hwt~Bwt) Gives the equation y = x є