Guide to Statistics: "Sampling for Surveys;"

Print this page

2 Random Sampling

In order to assess the statistical properties of inferences drawn from a sample, we need to draw the sample $$y_{1}, y_{2} , \ldots, y_{n}$$  according to a probability sampling scheme.

Simple Random Sampling

The simplest form is simple random (sr) sampling where observations are drawn successively with replacement from the population $$Y_{1}, Y_{2}, \ldots, Y_{N}$$  in such a way that each population member is equally likely to be drawn.

Sampling fraction or finite population correction

This is the ratio $$f = n/N$$.
We will want to estimate some population characteristic $$\theta$$ (e.g. $$Y_{T}$$ ) by some function $$\tilde{\theta}(S)$$ of the sample $$S$$. The properties of the estimator (or statistic) $$\tilde{\theta}$$ will be assessed from its sampling distributioni.e. the probability distribution of the different values $$\tilde{\theta}$$ may take as a result of employing the probability sampling scheme (e.g. sr sampling).

Thus if $$E\left(\tilde{\theta}\right) = \theta$$  we say $$\tilde{\theta}$$  is unbiased, a property we normally require in sample survey work (we seldom are prepared to ‘trade bias for precision’.)

For unbiased estimators we have $$Var\left(\tilde{\theta}\right) = E\left(\tilde{\theta}-\theta\right)^2$$ as the variance of $$\tilde{\theta}$$ which provides an inverse measure of precision: the lower $$Var\left(\tilde{\theta}\right)$$ the more precise the estimator $$\tilde{\theta}$$.

The broad aim is to choose a probability sampling scheme which is easy to use and yields unbiased estimators which  effectively minimize the effects of sampling fluctuations (i.e. is as  precise as possible).

With sr sampling, all samples $$y_{1}, y_{2} , \ldots, y_{n}$$ are equally likely to arise and we estimate the population mean, $$\bar{Y}$$, by the sample mean $$\bar{y} = \left(\sum_{1}^{n}y_{i}\right)/n$$ which is easily seen to be unbiased $$\left(E\left(\bar{y}\right) = \bar{Y}\right)$$, to have variance$$Var\left(\bar{y}\right) = \left(1 - f\right)S^{2}/n$$ where $$S^{2} = \sum_{1}^{N}\left(Y_{i}-\bar{Y}\right)^{2}/\left(N-1\right)$$ is defined as the population variance and to be the best (minimum variance) linear estimator based on a sr sample.
We also need to estimate $$S^{2}$$ and use the unbiased estimator $$s^{2} = \sum_{1}^{n}\left(y_{i}-\bar{y}\right)^{2}/\left(n-1\right)$$, the sample variance which helps to:
 
1) assess the precision of $$\bar{y}$$;
2) compare $$\bar{y}$$ with other estimators;
3) determine sample size $$n$$ needed to achieve desired precision.

Thus $$s^{2}\left(\bar{y}\right) = \left(1-f\right)s^{2}/n$$  is unbiased for $$Var\left(\bar{y}\right)$$ and for large enough $$n$$ we can assume that $$\bar{y}$$ is approximately normally distributed written $$\bar{y} \sim N\left(\bar{Y}, \left(1-f\right)S^{2}/n\right)$$. This yields an approximate $$100\left(1-\alpha\right)\%$$ symmetric two-sided confidence interval for $$\bar{Y}$$ as $$\bar{y} - z_{\alpha}s\sqrt{\left(1-f\right)/n} \leq \bar{Y} \leq \bar{y} + z_{\alpha}s\sqrt{\left(1-f\right)/n}$$ where $$z_{\alpha}$$ is the double-tailed $$\alpha$$-point for $$N\left(0, 1\right)$$. To choose a sample size $$n$$ to yield required precision e.g. with $$P\left(\left|\bar{Y} - \bar{y} > d\right|\right) \leq \alpha$$ for prescribed values of $$d$$ and $$\alpha$$ we need $$n \geq N/\left(1+N\left(d/\left(z_{\alpha} S\right)\right)^{2}\right)$$ or specifying $$Var\left(\bar{y}\right) \leq \left(d/z_{\alpha}\right)^{2} = V$$ say, this becomes $$n \geq \left(S^{2}/V\right)\left[1+S^{2}/\left(NV\right)\right]^{-1}\approx S^{2}/V$$ if $$S^{2}/\left(NV\right)$$ is small. Typically we do not know $$S^{2}$$ and need to estimate it, sometimes rather informally, from pilot studies, previous surveys or a preliminary sample.

Systematic Sampling

With a complete list of population members, a simple sampling method is to choose sample members at regular intervals throughout the list to obtain the required sample size, $$n$$. This is not strictly sr sampling (nor a probability sampling scheme) but can be effective if there is no relationship between population value and order on the list.

Estimating $$Y_{T}$$ 

An immediate estimate of $$Y_{T}$$  is given by $$y_{T} = N\bar{y}$$  which is unbiased with $$Var\left(y_{T}\right) = N^{2}\left(1-f\right)S^{2}/n$$ and all properties (unbiasedness, minimum variance, confidence intervals, required sample size etc) transfer immediately from those of $$\bar{y}$$.

With sr sampling we sample with equal probabilities. Non-equal probability (non-epsem) schemes are also important – e.g. sampling with probability proportional to size (pps) with the Hansen-Hurwitz and Horvitz-Thompson estimators.

Estimating a proportion $$P$$

Let $$P$$ be the proportion of population members with some quality $$A$$. For each population member define $$X_{i} = 1$$ if $$Y_{i}$$ has quality $$A$$ and $$X_{i} = 0$$ otherwise.  Then clearly $$P = \sum_{1}^{N}X_{i}/N = \bar{X}$$ and we are again concerned with estimating a population mean (now for the derived $$X-$$variable). The only difference now is that the population variance depends on its mean, as $$S^{2}_{X} = NP\left(1-P\right)/\left(N-1\right)$$ and the previous inferences have to be modified to reflect this. Thus we estimate $$P$$ by the sample proportion $$p = \bar{x} = \sum_{1}^{n} x_{i}/n$$ which is unbiased with minimum variance $$Var\left(p\right) = \frac{\left(N-n\right)}{\left(N-1\right)}P\left(1-P\right)/n$$ with unbiased estimator $$s^{2}\left(p\right) = \left(1-f\right)p\left(1-p\right)/\left(n-1\right)$$. An approximate $$100\left(1-\alpha\right)\%$$ two-sided confidence interval for $$P$$ is now given as the region between the two roots of a quadratic equation which for large  $$n$$ simplifies to
 $$ p \pm z_{\alpha}\sqrt{\left(1-f\right)p\left(1-p\right)/\left(n-1\right)}$$. 

Choice of sample size is now more complex depending on whether we want absolute or relative accuracy represented as
$$P\left(\left|p - P\right|>d\right) \leq \alpha$$ or $$P\left(\left|p - P\right|>\xi P\right) \leq \alpha$$, respectively. The first (absolute) form requires $$n \geq N\left[1+\frac{\left(N-1\right)}{P\left(1-P\right)}\left(\frac{d}{z_{\alpha}}\right)^{2}\right]^{-1} = \frac{P\left(1-P\right)}{V}\left[1+\frac{1}{N}\left(\frac{P\left(1-P\right)}{V}-1\right)\right]^{-1}$$ if we put $$V = \left(d/z_{\alpha}\right)^{2}$$. So as first approximation we have $$n_{0} = P\left(1-P\right)/V$$  or more accurately $$n = n_{0}\left[1 +\left(n_{0}-1\right)/N\right]^{-1}$$. Corresponding results are readily obtained for the relative case.

Contents