8 General concepts of correlation and simple linear regression
In medicine interest often centres on how two measurable variables taken from the same group of individuals relate to each other, in effect how they co-vary. E.g. the relationship between birthweight and maternal alcohol consumption. Two common techniques for analysing such data are correlation and regression.
Correlation: Measures the strength of the association between paired measurable data. Causation should not be inferred from a correlation coefficient as it simply measures the degree of association between the two. In addition, just because two variables are correlated at a particular range of values, it should not be assumed that the same relationship holds for a different range. Figures 6-9 illustrate ways in which two measurable variables can co-vary.
Figure 6: Correlation not appropriate
Figure 7: No correlation $$r=0$$
Figure 8: Perfect positive linear correlation $$r=1$$
Figure 9: Perfect negative linear correlation $$r=-1$$
Scatter plot: a visual representation of the direction and strength of the relationship between two variables. If it is known or suspected that one variable (known as the independent/explanatory/predictor variable) influences the value of the other variable (known as the dependent/response variable), the independent variable should be plotted on the horizontal axis and the dependent variable on the vertical axis. When undertaking either a correlation or simple linear regression analysis it is important to construct a scatter plot of the data as this will reveal how the two variables co-vary. It may be that the relationship is not monotonic and thus neither correlation nor simple linear regression analysis would be appropriate (e.g. Figure 6).
Pearson's correlation coefficient: $$r$$ is used to quantify the strength and direction of the linear relationship between two variables.
Spearman's rank correlation coefficient: $$r_{s}$$ is used if one or both variables are ordinal or we are interested in whether the two variables are increasing or decreasing in general together rather than in a straight line.
Positive correlation: $$r>0$$, both variables increase simultaneously.
Negative correlation: $$r<0$$, one variable increases as the other decreases.
No correlation: $$r=0$$, no linear association (Figure 7).
Perfect correlation: $$r=1$$ or $$r=-1$$, all points lie on a straight line (Figures 8 & 9).
Simple Linear Regression: A technique for describing quantitatively the linear relationship between a dependent variable $$Y$$ and an independent variable $$X$$. It enables prediction of the value of $$Y$$ from a known value of $$X$$.
Regression line: A straight-line equation that is used to model the relationship between the dependent (response) variable and the independent (predictor) variable. Note that the regression line should not be used to make predictions for $$X$$ values outside the range of values in the observed data. For simple linear regression where there is a single response variable and a single predictor variable the equation of the regression line is given by:
$$Y = a + bX$$
Where:
$$Y =$$ dependent/response variable
$$X =$$ independent/predictor variable
$$a =$$ intercept: the value of the $$Y$$ variable when the $$X$$ variable is zero
$$b =$$ regression coefficient or slope. It shows the change in $$Y$$ for a unit change in $$X$$. When the value of $$Y$$ increases as $$X$$ increases this will be positive. Conversely when the value of $$Y$$ decreases as the value of $$X$$ increases this coefficient will be negative.
The proportion of the total variability of the dependent variable, $$Y$$, explained by the regression on $$X$$ is called $$r^{2}$$ and is often quoted as a measure of goodness of fit of the regression line to the data. Note that this is equal to the square of the correlation coefficient $$r$$.
Teenage pregnancy example:
The example below shows the relationship between Deprivation and Teenage pregnancy rates for 40 local authorities in the England for the years 1999-2001. In this example the equation for the regression line is:
$$\mbox{Pregnancy rate } = 13.04 + 6 \times \mbox{ deprivation score}.$$
Thus, when deprivation is 0 the teenage pregnancy rate is 13.04 and for every additional increase in deprivation of 1 unit, the pregnancy rate increases by 6 per 1000 women aged 15-17.
![]()
Figure 10: Scatter plot of teenage pregnancy rate against deprivation score for 40 English local authorities, with fitted regression line. (Ref: www.empho.org.uk/whatsnew/teenage-pregnancy-presentation.ppt)
The method can be extended to adjust for other risk factors. In this case it is called multiple regression. Logistic regression is used when the outcome of interest has only two possible values (e.g. event/no event). In this case the outcome is expressed in terms of an odds ratio.
Contents