The aim of this chapter is to give a short insight to medical statistics. Understanding and interpreting statistics is essential to practice evidence-based medicine. The discipline of statistics is intended to extract numerical information from a high-quality dataset, tailored for description, for analysis purposes and to predict outcome. The high-quality dataset requires appropriate data collection (unbiased or representative samples) and data handling (data cleaning and arranging). The population being observed is typically very large, thus we collect observations (sample units or a subset) from the statistical population, and inherently we assume that the final sample set represents the whole population. Statistical research design (e.g. number of subjects, length of the study, data precision) ensure the setting up of optimal study conditions regarding the requested statistical power, accuracy (deviation between expectation value and true value) and precision (related to noise and uncertainty).
The clear understanding of data types of the sample we have are essential for correct selection of the statistical procedures and the appropriate interpretation of our results. Data can be categorized as qualitative (whether nominal or ordinal) or quantitative (numerical).
Qualitative data deals with description with words:
If the categories can be ordered we use the phrase of Ordinal data type. Visual classification of a set of medical images into 5 ranges from high to low quality, or rating the feeling of pain, stage of cancer are some example of this data type. In this case the order of the values are important, but reporting the difference between them remains meaningless. The central tendency of ordinal dataset is measured by median but several times the mean of ordinal data leads to appropriate results as well.
In case of Quantitative or Numerical (discrete or continuous, Interval or ratio) data we know both the order and the exact differences between values. Weight, height, body mass index, blood pressure are some representative examples of this data type, but also pixel intensity. We can add, subtract, multiply and divide these values. This data type allows to use the most methods to describe data set like percentiles, median, mode, mean, standard deviation (average distance to the mean).
Visual data screening of the dataset is extremely useful for data exploration, checking assumptions and report findings. Nominal data allows us to display results only with pie or bar charts, but ordinary data and even more numeric data are suitable for creating more informative plots like histogram, scatter plot and boxplot. A Box-and-Whiskers plot is depicting groups of numerical data through their quartiles, and also indicating variability outside the upper and lower quartiles.
The inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.
The majority of statistical analysis in medicine involves comparison, most obviously between treatments or procedures or between groups of subjects. The numerical value corresponding to the comparison of interest is often called the effect. The hypothesis called null hypothesis (H0) state that the effect is zero, which is usually the negation of the research hypothesis that generated the data. We also have an alternative hypothesis (H1), which is usually that the effect of interest is not zero. The difference is deemed statistically significant if the relationship between the data-sets would be an unlikely realization of the null hypothesis according to a threshold probability—the significance level. The significance level, denoted by α is the probability of the study results rejecting the null hypothesis, given that the null hypothesis is assumed to be true. The statistical test results in a p-value, which indicates the probability of the pertinence of the null hypothesis. If the p value is smaller than the significance level α, the null hypothesis is rejected. The value of α is typically 0.05 (5%), which means that if the confidence in the zero effect is less than 5% we reject the null hypothesis and we accept the alternative hypothesis.
Thought experiment: a teacher does coin flipping one after another, and reports the results (heads or tails) for us after each flipping, without showing up the coin. We just record his statements. Our task is to determine whether the teacher tells us the real result of the coin flipping, or is misleading us. We get heads at first, which is not surprising, since it has 50% probability. To get the same result for the second, third, fourth and fifth time too, have the probability of 25%, 12.5%, 6.25% and 3.125% respectively. By recording the reported results only, after the 5th reported equivalent side we are more confident, the teacher does not tell us the true results. The p = 3.125% is under the typical 5% significance level, so at the 5th flipping the probability to get 5 equivalent results in a consecutive flipping experiment steps over the significance limit.
However, even in the previous experiment the 3.125% is a low probability event, but a possible outcome. If the teacher always reported the real value (null hypothesis), but we rejected his truthfulness (accept the alternative hypothesis), we made a type 1 error (rejecting the null hypothesis, which was correct), and therefore we got a false positive result. If we accept the null hypothesis when it is false we would create type 2 error, we get a false negative result.
The probability of type 1 error is equal to the significance level (α), the probability of type 2 error (β) is related to the power of the statistical test (equal to 1-β). These two types of error rates are traded off against each other: for any given sample set, the effort to reduce one type of error generally results in increasing the other type of error. We can translate Type 1 error as positive class prediction error (false positive), and type 2 as a negative class prediction error (false negative).
The power of a hypothesis test is the probability that it rejects the null hypothesis (H0), when a specific alternative hypothesis (H1) is true. The statistical power ranges from 0 to 1, and its value depends on the magnitude of the effect, the sample size and the statistical significance criterion used in the test.
It is important to note that even if a statistical test coming up with statistically significant results, it does not necessarily mean it has biological relevance. A biologically relevant effect can be defined as an effect considered by expert judgement as important and meaningful for human, animal, plant or environmental health. It therefore implies a change that may alter how decisions for a specific problem are taken[60,61].
As we clarified the points mentioned above, it will be much easier to select the appropriate statistical test procedure with the help of available statistics selection charts. Some of the most commonly used tests are presented below:
|
Variable to be explained |
||||
Nominal (2 groups) |
Ordinal (2 groups) |
Quantitative |
|||
Study’s factors |
Qualitative |
Unpaired
|
- 2 proportion Z-test - Chi squared test - Fisher’s exact test |
- Cochran-Armitage test |
- Student’s T-test |
Paired
|
- McNemar’s test - Fisher’s exact test |
- Wilcoxon signed rank test |
- Wilcoxon signed rank test - Student’s T-test for paired values |
||
Quantitative |
- Logistic regression |
- Spearman correlation |
- Pearson correlation |
Sensitivity and Specificity
Specificity and sensitivity are statistical measures of the performance of a binary classification test (diagnostic test resulting positive or negative result). The sensitivity (true positive rate or TPR) reflects the proportion of the positive detection to all the positive cases. Specificity (true negative rate or TNR) indicates the correctly identified negative (healthy) cases to all the negative cases. Sensitivity therefore quantifies the avoidance of false negatives and specificity does the same for false positives. Specificity and sensitivity are prevalence-independent test characteristics.
For any diagnostic test there is a trade of between these two quantities. This trade-off can be represented graphically using a receiver operating characteristic curve.
Receiver operating characteristics (ROC) curve is a graphical plot that represents the diagnostic ability of a binary classifier system as its discrimination threshold varied. Plotting the false positive rate (1-specificity) against the true positive rate (sensitivity) results the ROC curve.