2 What are data?
How data are recorded, analysed and displayed will depend upon their type. Thus, in order to apply statistical methods we need to be able to classify the data that we collect into specific types. One simple classification is shown in the table below:
Table 2: a simple classification of data types
Data typeSub-type
Example
Numerical
measurable
blood pressure; height; weight
Count
number of visits to general practitioner in a year; parity
Categorical
Binary
(two categories)Gender; presence/absence of disease
Nominal
(no natural ordering)Ethnic group; area of residence
Ordinal (ordered)
Duke’s stages in cancer development; Frequency of symptoms (never, rarely, sometimes, often, always)
Descriptive statistics for samples of size n
For numerical data:
Sample mean: the sum of the observations divided by the sample size $$(n)$$.
Sample variance: the sum of the squared distances from the sample mean divided by the sample size minus one i.e. $$n-1$$. The divisor $$(n-1)$$ is called the degrees of freedom (df).
Sample standard deviation ($$s$$): the square root of the sample variance. The standard deviation is useful as it provides a measure of the spread of the data that is in the same units as the mean, unlike the variance which is the square of the standard deviation.
If the sample data are ordered from smallest to largest then the:
Minimum (Min) is the smallest value;
Lower quartile (LQ) is the $$\frac{1}{4}(n+1)^{th}$$ value; it is the value below which the lowest 25% of data values lie; n.b. 75% of data values lie above it;
Median (Med) is the middle or the $$\frac{1}{2}(n+1)^{th}$$ value; it is the value which exactly divides the data in half; 50% of data values lie below and 50% of data values lie above it;
Upper quartile (UQ) is the $$\frac{3}{4}(n+1)^{th}$$ value; it is the value below which 75% of data values lie. N.b. 25% of values lie above it;
Maximum (Max) is the largest value.
These five values constitute a five-number summary of the data. They can be represented diagrammatically by a box-and-whisker plot, commonly called a boxplot. Note that the distance between the lower and upper quartiles is known as the interquartile range (IQR) and represents the region within which the middle 50% of the data lie. The distance between the minimum and maximum is known as the range. N.b. It is good practice to include the sample size.
Figure 2: Boxplots for the heights of 200 randomly chosen men and women
We summarise numerical data that are symmetrically distributed using the mean and standard deviation and data that are not symmetrically distributed around the mean (skewed) using the median and interquartile range.
For categorical data:
We summarise data that are categorical by calculating the proportion that are in each category. Categorical data can be displayed using a barchart.