Guide to Statistics: "Sampling for Surveys;"

Print this page

3 Ratio and Regression Estimators

We may be interested in many variables $$X, Y, \ldots$$ in a sample survey. For example, in a household expenditure survey $$Y$$ may be annual household expenditure and $$X$$ may be household size. Ratios of these variables may be relevant and interesting. For example per capita expenditure is represented by the population ratio $$R = Y_{T}/X_{T} = \bar{Y}/\bar{X}$$. There are various possible estimators of $$R$$ based on a simple random sample $$\left(y_{1},x_{1}\right), \left(y_{2},x_{2}\right), \ldots, \left(y_{n},x_{n}\right)$$; in particular there is the sample average ratio $$r_{1} = \frac{1}{n}\sum_{1}^{n}\left(y_{i}/x_{i}\right)$$ and the ratio of the sample averages $$r_{2} = \bar{y}/\bar{x}= y_{T}/x_{T}$$.

How do ratio estimators compare?

In spite of its intuitive appeal, $$r_{1}$$ is not widely used to estimate $$R$$: it is biased and can have large mean square error (mse)  compared with $$r_{2}$$. Consider the population of values $$R_{i} = Y_{i}/X_{i}$$ with mean $$\bar{R}$$ and variance $$S^{2}_{R}$$  then $$r_{1}$$ must have mean $$\bar{R}$$ and variance $$\left(1-f\right)S_{R}^{2}/n$$. But $$\bar{R} \neq R$$, so the bias of $$\bar{R}-R = -\sum_{1}^{N}R_{i}\left(X_{i}-\bar{X}\right)/X_{T}$$ is.  An unbiased estimator of the bias is obtained as $$-\left(N-1\right)n\left(\bar{y}-r_{1}\bar{x}\right)/ \left[\left(n-1\right)X_{T}\right]$$ and if $$X_{T}$$ is known, which is not uncommon, we can correct the bias with an unbiased estimator called the Hartley-Ross estimator, $$r_{1}^{\prime} = r_{1}+\left(N-1\right)n\left(\bar{y}-r_{1}\bar{x}\right)/\left[\left(n-1\right)X_{T}\right]$$. The mse is readily estimated. Another way to eliminate the bias is to sample with probability proportional to the $$X_{i}$$ values rather than using sr sampling.

The estimator $$r_{2}$$ tends to be used more widely than $$r_{1}$$, because even though it is still biased, it is likely to be less so than $$r_{1}$$ and with lower mse. The bias becomes negligible in large samples and the sampling distribution tends to normality.

Asymptotically, $$E\left(r_{2}\right) = \bar{Y}/\bar{X} = Y_{T}/X_{T} = R$$ and $$Var\left(r_{2}\right) = \frac{\left(1-f\right)}{n\bar{X}^2}\sum_{1}^{N}\frac{\left(Y_{i}-RX_{i}\right)^2}{N-1}$$,
which can be estimated by $$s^{2}\left(r_{2}\right)=\frac{\left(1-f\right)}{n\bar{x}^2}\sum_{1}^{n}\frac{\left(y_{i}-r_{2}x_{i}\right)^2}{n-1}$$,
so that an approximate $$100\left(1-\alpha\right)\%$$ symmetric two-sided confidence interval is given by $$r_{2} - z_{\alpha}s\left(r_{2}\right) \leq R \leq r_{2} + z_{\alpha}s\left(r_{2}\right)$$.

Frequently two population variables $$\left(Y, X\right)$$ will be correlated and we can exploit the relationship between them to obtain improved estimates of a population mean $$\bar{Y}$$ or total $$Y_{T}$$ using what is called a ratio estimator or a regression estimator.

Ratio Estimators

Suppose we want to estimate the total expenditure, $$Y_{T}$$, of all local authorities on community services from a simple random sample $$\left(y_{i}, x_{i}\right)$$ for $$i = 1, \ldots, n$$ where $$y_{i}$$ is authority spend on community services and we also have sampling authority population sizes $$x_{i}$$. We would expect $$Y$$to be positively correlated with $$X$$ and might hope to be able to exploit this relationship. Instead of using the sr sample estimator $$y_{T}$$ we might assume that $$Y_{i} \approx RX_{i}$$  so that $$Y_{T} = RX_{T}$$ and if $$X_{T}$$, the total population size, is known (which is not unreasonable) we can estimate $$Y_{T}$$ by $$Y_{TR} = rX_{T}$$ where $$r$$ is an estimator of the ratio $$R$$, as discussed above.  We use $$r_{2}=\bar{y}/\bar{x} = y_{T}/x_{T}$$ for $$r$$. Then $$y_{TR} = rX_{T} = \left(y_{T}/x_{T}\right)X_{T}$$ is known as the sr sample ratio estimator of the population total.This provides a natural compensation: if $$x_{T}$$ happens to be larger, or smaller, than $$X_{T}$$ then the estimate of $$Y_{T}$$ is reduced, or increased, accordingly. Of course, the corresponding ratio estimator of the population mean is just $$\bar{y}_{R} = r \bar{X} = \left(\bar{X}/\bar{x}\right) \bar{y}$$. The properties of these ratio estimators are immediately found from what we discussed about $$r = r_{2}$$ above. We see that $$\bar{y}_{R}$$ is asymptotically unbiased, sometimes exactly unbiased, and $$Var\left(\bar{y}_{R}\right) \approx \frac{\left(1-f\right)}{n} \sum_{1}^{N} \frac{\left(Y_{i}- RX_{i}\right)^{2}}{N-1} = \frac{\left(1-f\right)}{n}\left(S_{Y}^{2} - 2R\rho_{YX}S_{Y}S_{X} + R^{2}S_{X}^{2}\right)$$
where $$\rho_{YX} = S_{YX}/\left(S_{Y}S_{X}\right)$$ is the population correlation coefficient. The larger the (positive) correlation, the smaller will be $$Var\left(\bar{y}_{R}\right)$$ which can be estimated using the results discussed above for $$r_{2} = r$$ . Approximate confidence intervals are correspondingly obtained. Properties for the population total using $$y_{TR}$$ are similarly obtained.

Under what circumstances are $$\bar{y}_{R}$$ and $$y_{TR}$$ more efficient (have smaller sampling variance) than the sr sample estimators $$\bar{y}$$ and $$y_{T}$$? It can be shown that this will happen if $$\rho_{YX} \geq C_{X}/\left(2C_{Y}\right)$$ where $$C_{X} = S_{X}/\bar{X}$$; $$C_{Y} = S_{Y}/\bar{Y}$$ are the coefficients of variation. Any efficiency gain clearly requires $$C_{X} \leq 2C_{Y}$$, but efficiency gains can be quite high if the correlation between $$Y$$and $$X$$ is highly positive.

Regression Estimators

Ratio estimators are especially beneficial when there is a degree of proportionality between the two variables $$Y$$ and $$X$$; the more so the higher the correlation. When there is rough linearity between the principal variable $$Y$$ and the auxiliary variable $$X$$, but this is not through the origin (i.e. there is no ‘proportionality’), the link between $$Y$$ and $$X$$ can be exploited to improve sr sample estimators by using so-called regression estimators.

The linear regression estimator of $$\bar{Y}$$ is $$\bar{y}_{L} = \bar{y} + b\left(\bar{X}- \bar{x}\right)$$ for a suitable choice of $$b$$ reflecting any (even a rough) linear regression relationship between $$Y$$ and $$X$$. It is readily confirmed that this produces an appropriate compensation depending on the sign of $$b$$. Of course, $$Y_{T}$$ can be estimated by $$N\bar{y}_{L}$$. We might pre-assigna value of $$b$$ or estimate it. In the former case $$\bar{y}_{L}$$ is clearly unbiased (as is $$N\bar{y}_{L}$$) and its variance is $$Var\left(\bar{y}_{L}\right) = \frac{1-f}{n}\left(S_{Y}^{2} - 2b S_{Y}S_{X} + b^{2}S_{X}^{2}\right)$$ with corresponding unbiased sample estimate $$s^{2}\left(\bar{y}_{L}\right) = \frac{1-f}{n}\left(s_{Y}^{2}-2bs_{Y}s_{X} + b^{2}s_{X}^{2}\right)$$ .
$$Var\left(\bar{y}_{L}\right)$$ will take a minimum value $$MinVar\left(\bar{y}_{L}\right) = \frac{1-f}{n}S_{Y}^{2}\left(1-\rho_{YX}\right)$$ if $$b$$ is chosen as $$b_{0}=\rho_{YX}\left(S_{Y}/S_{X}\right)$$ so that irrespective of any relationship between $$Y$$ and $$X$$ in the population, $$\bar{y} + \rho_{YX}\frac{S_{Y}}{S_{X}}\left(\bar{X}-\bar{x}\right)$$ is the most efficient estimator of $$\bar{Y}$$ in the form of $$\bar{y}_{L}$$. However, $$b_{0}=\rho_{YX}\left(S_{Y}/S_{X}\right)$$ will not be known. So if there is no basis for an a priori assignment of a value for $$b_{0}$$ we will need to estimate $$b$$; usually we would use the sample analogue $$\tilde{b} = \frac{s_{YX}}{s_{X}^{2}} = \frac{\sum_{1}^{n}\left(y_{i}-\bar{y}\right)\left(x_{i}-\bar{x}\right)}{\sum_{1}^{n}\left(x_{i}-\bar{x}\right)^{2}}$$. So the linear regression estimator of $$\bar{Y}$$ is now $$\bar{y}_{L} = \bar{y} + \tilde{b}\left(\bar{X}-\bar{x}\right)$$. Its distributional properties are difficult to determine but it is found to be asymptotically unbiased with approximate variance $$\frac{1-f}{n}S_{Y}^{2}\left(1-\rho^{2}_{YX}\right)$$ (estimated by $$s^{2}\left(\bar{y}_{L}\right) = \frac{1-f}{n}\left(s_{Y}^{2}-\tilde{b}s_{YX}\right)$$ ) so that having to estimate $$b$$ in large samples is no disadvantage.

Clearly $$\bar{y}_{L}$$ can be no less efficient than $$\bar{y}$$ and since $$Var\left(\bar{y}_{R}\right) - Var\left(\bar{y}_{L}\right) \approx\frac{1-f}{n}\left(RS_{X}-\rho_{YX}S_{Y}\right)^{2}$$ it must be at least as efficient (asymptotically) as the ratio estimator with equality of variance only if $$R = \rho_{YX}\frac{S_{Y}}{S_{X}}$$.

Summary of the use of ratio and regression estimators

They are useful in estimating $$\bar{Y}$$ (or $$Y_{T}$$) when there is an auxiliary variable $$X$$ (with known population mean $$\bar{X}$$) also sampled. If $$Y$$and $$X$$bear some reasonable degree of linear relationship then we obtain a useful increase in efficiency over $$\bar{y}$$ (or $$y_{T}$$) by using $$\bar{y}_{L}$$ (or $$y_{TL}$$) – if the relationship is one of rough proportionality we expect similar benefits but for somewhat less computational effort from ratio estimators.

Contents