Guide to Statistics: "Sampling for Surveys;"

Print this page

4 Stratified sampling

Sometimes a finite population is divided naturally into non-overlapping exhaustive sub-populations or strata e.g. in a school survey, different local education authorities make up distinct strata. There can be an administrative advantage in taking separate sr samples of prescribed size from each stratum (a stratified sr sample) rather than taking an overall sr sample from the whole population. If we have $$k$$ strata of sizes $$N_{i}$$; $$i = 1, 2, \ldots, k$$ we can estimate the population mean $$\bar{Y}$$ by the stratified sr sample mean $$\bar{y}_{st} = \sum_{1}^{k}W_{i}\bar{y}_{i}$$  where $$W_{i} = N_{i}/N$$ and $$\bar{y}_{i}$$ is the sample mean of the sr sample of size $$n_{i}$$ chosen from the $$i^{th}$$ stratum. The overall sample size is $$n = \sum_{1}^{k}n_{i}$$. We note that $$\bar{Y} = \sum_{1}^{k}W_{i}\bar{Y}_{i}$$ and that $$S^{2} = \frac{1}{N-1}\left[\sum_{1}^{k}\left(N_{i}-1\right)S_{i}^{2} + \sum_{1}^{k}N_{i}\left(\bar{Y}_{i}-\bar{Y}\right)^{2}\right]$$. It is easily confirmed that $$\bar{y}_{st}$$ is unbiased for $$\bar{Y}$$ with $$Var\left(\bar{y}_{st}\right) = \sum_{1}^{k}W_{i}^{2}\left(1-f_{i}\right)S_{i}^{2}/n_{i}$$ with $$f_{i} = \left(n_{i}/N_{i}\right)$$. If the $$f_{i}=f$$ (constant sampling fractions), we have what is called proportional allocation in which case $$Var\left(\bar{y}_{st}\right) = \frac{1-f}{n}\sum_{1}^{k}W_{i}S_{i}^{2}$$. If the sampling fractions are negligible, we have $$Var\left(\bar{y}_{st}\right) = \sum_{1}^{k}W_{i}^{2}S_{i}^{2}/n_{i}$$. We estimate $$Var\left(\bar{y}_{st}\right)$$ using sample analogues $$s_{i}^{2} = \frac{1}{n_{i}-1}\sum_{1}^{n_{i}}\left(y_{ij}-\bar{y}_{j}\right)^{2}$$ $$\left(i = 1, 2, \ldots, k\right)$$for the typically unknown $$S_{i}^{2}$$. Analogous results hold for estimating the population total $$Y_{T}$$.
For proportions we have corresponding results. With a derived variable $$X$$ which is 0 or 1 depending on whether or not the population member possesses the attribute of interest, then the population mean $$\bar{X} = P$$ is the proportion of population members with the attribute.  So the stratified sr sample mean $$\bar{x}_{st} = \sum_{1}^{k}W_{i}\bar{x}_{i}$$ provides an unbiased estimator of $$P$$ in the form $$p_{st} = \sum_{1}^{k}W_{i}p_{i}$$ where $$p_{i}$$ is the sampled proportion in the $$i^{th}$$ stratum with population proportions $$P_{i}\left(i = 1,2,\ldots, k\right)$$.  (ignoring terms in $$1/n_{i}$$) with unbiased estimate $$s^{2}\left(p_{st}\right) = \sum_{1}^{k}W_{i}^{2}\left(1-f_{i}\right)p_{i}\left(1-p_{i}\right)/\left(n_{i}-1\right)$$.
For proportional allocation $$Var\left(p_{st}\right) = \frac{1-f}{n}\sum_{1}^{k}W_{i}P_{i}\left(1-P_{i}\right)$$.

Some key questions

We can compare the efficiencies of $$\bar{y}_{st}$$ and $$\bar{y}$$  by examining, firstly for proportional allocation, $$Var\left(\bar{y}\right) - Var\left(\bar{y}_{st}\right) \approx \frac{1-f}{n}\sum_{1}^{k}W_{i}\left(\bar{Y}_{i}-\bar{Y}\right)^{2}$$ (if the stratum sizes $$N_{i}$$ are large enough). But this is always non-negative, so that $$\bar{y}_{st}$$ must always be at least as efficient as $$\bar{y}$$.

More detailed investigation tempers this simple conclusion. We find that the stratified sr sample mean $$\bar{y}_{st}$$ is more efficient than the sr sample mean $$\bar{y}$$provided $$\sum_{1}^{k}N_{i}\left(\bar{Y}_{i}-\bar{Y}\right)^{2} > \frac{1}{N}\sum_{1}^{k}N_{i}\left(N-N_{i}\right)S_{i}^{2}$$

i.e. if the variation between the stratum means is sufficiently large compared with the within-strata variation so that the higher the variability in stratum means and the lower the accumulated within-stratum variability the greater the advantage in using the stratified sr sample mean (or corresponding estimators of population total or proportion).

How do we allocate the stratum sample sizes $$n_{i}\left(i = 1, 2,\ldots, k\right)$$?

There is a clear practical advantage in stratified sr sampling. With naturally defined strata it will usually be more economical and more convenient to sample separately from each stratum. We have now seen that it can also lead to more efficient estimators than those obtained from overall simple random sampling. As far as allocation of stratum sample sizes is concerned, proportional allocation has intuitive appeal, is easy to operate and can lead to efficiency gains. But for more effort we might be able to do better by choosing the $$n_{i}\left(i = 1, 2,\ldots, k\right)$$ optimally, that is, to minimise $$Var\left(\bar{y}_{st}\right)$$ for given overall sample size or cost. Specifically, we can assume that the overall cost of taking the stratified sample is $$C = c_{0}+\sum_{1}^{k}c_{i}n_{i}$$ and then choose the $$n_{i}$$ to minimize $$Var\left(\bar{y}_{st}\right)$$ for a prescribed fixed overall cost $$C$$. Appropriate constrained minimization yields expressions for the stratum sample sizes $$n_{i}$$ and overall sample size $$n$$ for given $$c_{i}\left(i = 1, 2, \ldots, k\right)$$. For the special case, where each observation costs the same amount, $$c$$, in each stratum we obtain the allocation $$n_{i}=nW_{i}S_{i}/\sum_{1}^{k}W_{i}S_{i}$$  with overall sample size $$n=\left(C-c_{0}\right)/c$$. This is known as Neyman Allocation. Alternatively, we can prescribe the value $$V$$ we need for $$Var\left(\bar{y}_{st}\right)$$ and choose the allocation to minimise the overall cost. For constant fixed sampling cost ($$c$$ per observation) we again obtain the Neyman allocation above with overall sample size $$n = \left(\sumW_{i}S_{i}\right)^{2}/\left(V+\sum W_{i}S_{i}^{2}/N\right)$$.

We can also express our need for precision in terms of a sample size needed to yield a specified margin of error, $$d$$, and maximum probability of error, $$\alpha$$, in the form $$Pr\left(\left|\bar{y}_{st}-\bar{Y}\right|\geq d\right)\leq\alpha$$. If we assume that $$\bar{y}_{st}$$ is approximately normally distributed this reverts to the case just discussed with $$V = \left(d/z_{\alpha}\right)^{2}$$. We obtain $$n = \sum\left(W_{i}^{2}S_{i}^{2}/w_{i}\right)/\left(V+\sum W_{i}S_{i}^{2}/N\right)$$

giving as a first approximation to the required sample size $$n_{0} = \sum \left(W_{i}^{2}S_{i}^{2}/w_{i}\right)/V$$ or more accurately $$n = n_{0}\left(1+\sum W_{i}S_{i}^{2}/\left(NV\right)\right)^{-1}$$. For the special cases of proportional allocation and Neyman allocation we get, respectively,   $$n_{0}=\sum W_{i}S_{i}^{2}/V$$, $$n = n_{0}\left(1 + n_{0}/N\right)^{-1}$$ and $$n_{0}=\left(\sum W_{i}S_{i}\right)^{2}/V$$, $$n = n_{0}\left(1+\sum W_{i}S_{i}^{2}/\left(NV\right)\right)^{-1}$$.

Is optimal allocation always noticeably more efficient than the convenient proportional allocation?

This, of course, does not require stratum variances or relative sampling costs. The simple answer is that the advantage of optimal allocation (specifically Neyman allocation) is greater the more the variability of the stratum variances.

We must recognize that much of the above discussion of stratified sampling implicitly assumes that we know the stratum sizes and the stratum variances. Often this is not so, particularly for the stratum variances. If we have to estimate these from the survey data or assign ‘reasonable values’ say from previous experience the above results may not be reliable and far more complex methods will need to be employed. These are not pursued in this brief review.

Quota Sampling 

Often we will want to exploit many (crossed)  factors of stratification e.g. age ranges, locations, types of individual etc. and complex methods of sampling for multi-factor stratification must be used. One form of such stratified sampling is called quota sampling in which proportional allocation is used with respect to the various crossed factors and samplers seek to fill the ‘quotas’ implied for the various allocations. This is the method used predominantly in commercial surveys such as government, politics, commerce, opinion polls and so forth. The practical difficulties of conducting such sampling can lead to some lack of representation or randomness of the resulting samples. Non-response from some selected individuals also further complicates the sampling scheme and use of conventional results for stratified sr sampling schemes (single- or multi-factor) may at best be approximations to the actual, but often unassessible, statistical properties of the employed quota sampling method.

Contents