Chapter 9

Principles of Statistics

About Statistics

The aim of this chapter is to give a short insight to medical statistics. Understanding and interpreting statistics is essential to practice evidence-based medicine. The discipline of statistics is intended to extract numerical information from a high-quality dataset, tailored for description, for analysis purposes and to predict outcome. The high-quality dataset requires appropriate data collection (unbiased or representative samples) and data handling (data cleaning and arranging). The population being observed is typically very large, thus we collect observations (sample units or a subset) from the statistical population, and inherently we assume that the final sample set represents the whole population. Statistical research design (e.g. number of subjects, length of the study, data precision) ensure the setting up of optimal study conditions regarding the requested statistical power, accuracy (deviation between expectation value and true value) and precision (related to noise and uncertainty).

Descriptive Statistics

Understanding the Data

The clear understanding of data types of the sample we have are essential for correct selection of the statistical procedures and the appropriate interpretation of our results. Data can be categorized as qualitative (whether nominal or ordinal) or quantitative (numerical).

Qualitative data deals with description with words:

The Nominal (or categorical) data usually represents a label or name (e.g. colour, gender, profession), allowing to sort the samples into distinct groups without obvious ordering of the categories. These labels can sometimes be represented with numbers (e.g. 1=positive, 2=negative), but only as symbols without mathematical content. Categorical data table can be summarized as counts or contingency tables (cross tabulations), which display frequency distribution. When dealing with nominal data the frequency, proportion and percentages are the relevant parameters to describe the data set.

If the categories can be ordered we use the phrase of Ordinal data type. Visual classification of a set of medical images into 5 ranges from high to low quality, or rating the feeling of pain, stage of cancer are some example of this data type. In this case the order of the values are important, but reporting the difference between them remains meaningless. The central tendency of ordinal dataset is measured by median but several times the mean of ordinal data leads to appropriate results as well.

In case of Quantitative or Numerical (discrete or continuous, Interval or ratio) data we know both the order and the exact differences between values. Weight, height, body mass index, blood pressure are some representative examples of this data type, but also pixel intensity. We can add, subtract, multiply and divide these values. This data type allows to use the most methods to describe data set like percentiles, median, mode, mean, standard deviation (average distance to the mean).

Statistical Graphics

Visual data screening of the dataset is extremely useful for data exploration, checking assumptions and report findings. Nominal data allows us to display results only with pie or bar charts, but ordinary data and even more numeric data are suitable for creating more informative plots like histogram, scatter plot and boxplot. A Box-and-Whiskers plot is depicting groups of numerical data through their quartiles, and also indicating variability outside the upper and lower quartiles.

Statistical Inference

The inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population.

Statistical hypothesis testing

The majority of statistical analysis in medicine involves comparison, most obviously between treatments or procedures or between groups of subjects. The numerical value corresponding to the comparison of interest is often called the effect. The hypothesis called null hypothesis (H₀) state that the effect is zero, which is usually the negation of the research hypothesis that generated the data. We also have an alternative hypothesis (H₁), which is usually that the effect of interest is not zero. The difference is deemed statistically significant if the relationship between the data-sets would be an unlikely realization of the null hypothesis according to a threshold probability—the significance level. The significance level, denoted by α is the probability of the study results rejecting the null hypothesis, given that the null hypothesis is assumed to be true. The statistical test results in a p-value, which indicates the probability of the pertinence of the null hypothesis. If the p value is smaller than the significance level α, the null hypothesis is rejected. The value of α is typically 0.05 (5%), which means that if the confidence in the zero effect is less than 5% we reject the null hypothesis and we accept the alternative hypothesis.

Thought experiment: a teacher does coin flipping one after another, and reports the results (heads or tails) for us after each flipping, without showing up the coin. We just record his statements. Our task is to determine whether the teacher tells us the real result of the coin flipping, or is misleading us. We get heads at first, which is not surprising, since it has 50% probability. To get the same result for the second, third, fourth and fifth time too, have the probability of 25%, 12.5%, 6.25% and 3.125% respectively. By recording the reported results only, after the 5^th reported equivalent side we are more confident, the teacher does not tell us the true results. The p = 3.125% is under the typical 5% significance level, so at the 5^th flipping the probability to get 5 equivalent results in a consecutive flipping experiment steps over the significance limit.

Type 1 and Type 2 Error

However, even in the previous experiment the 3.125% is a low probability event, but a possible outcome. If the teacher always reported the real value (null hypothesis), but we rejected his truthfulness (accept the alternative hypothesis), we made a type 1 error (rejecting the null hypothesis, which was correct), and therefore we got a false positive result. If we accept the null hypothesis when it is false we would create type 2 error, we get a false negative result.

The probability of type 1 error is equal to the significance level (α), the probability of type 2 error (β) is related to the power of the statistical test (equal to 1-β). These two types of error rates are traded off against each other: for any given sample set, the effort to reduce one type of error generally results in increasing the other type of error. We can translate Type 1 error as positive class prediction error (false positive), and type 2 as a negative class prediction error (false negative).

The power of a hypothesis test is the probability that it rejects the null hypothesis (H₀), when a specific alternative hypothesis (H₁) is true. The statistical power ranges from 0 to 1, and its value depends on the magnitude of the effect, the sample size and the statistical significance criterion used in the test.

It is important to note that even if a statistical test coming up with statistically significant results, it does not necessarily mean it has biological relevance. A biologically relevant effect can be defined as an effect considered by expert judgement as important and meaningful for human, animal, plant or environmental health. It therefore implies a change that may alter how decisions for a specific problem are taken[60,61].

As we clarified the points mentioned above, it will be much easier to select the appropriate statistical test procedure with the help of available statistics selection charts. Some of the most commonly used tests are presented below:

			Variable to be explained
			Nominal (2 groups)	Ordinal (2 groups)	Quantitative
Study’s factors	Qualitative	Unpaired	- 2 proportion Z-test - Chi squared test - Fisher’s exact test	- Cochran-Armitage test	- Student’s T-test
	Qualitative	Paired	- McNemar’s test - Fisher’s exact test	- Wilcoxon signed rank test	- Wilcoxon signed rank test - Student’s T-test for paired values
	Quantitative		- Logistic regression	- Spearman correlation	- Pearson correlation

Sensitivity and Specificity

Specificity and sensitivity are statistical measures of the performance of a binary classification test (diagnostic test resulting positive or negative result). The sensitivity (true positive rate or TPR) reflects the proportion of the positive detection to all the positive cases. Specificity (true negative rate or TNR) indicates the correctly identified negative (healthy) cases to all the negative cases. Sensitivity therefore quantifies the avoidance of false negatives and specificity does the same for false positives. Specificity and sensitivity are prevalence-independent test characteristics.

For any diagnostic test there is a trade of between these two quantities. This trade-off can be represented graphically using a receiver operating characteristic curve.

Receiver Operating Characteristics

Receiver operating characteristics (ROC) curve is a graphical plot that represents the diagnostic ability of a binary classifier system as its discrimination threshold varied. Plotting the false positive rate (1-specificity) against the true positive rate (sensitivity) results the ROC curve.

Area Under Curve (AUC)

ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much the model is capable to distinguish between classes. By analogy, the higher the AUC, the better the model is at distinguishing between patients with disease and no disease[62,63].

Association

A correlation analyses is used to investigate the relationship between two variables within a group of subjects, with the following purposes:

whether the two variables are associated
to enable the value of one variable to be predicted from any known value of the other variable
to assess the amount of agreement between the values of the two variables

The strength of a relationship is usually given by the “correlation coefficient”, denoted by r. the positive correlation means that as one variable is increasing the value for the other is also increasing. Negative correlation coefficient means that if the value of one variable goes up, the value of other variable goes down. If the correlation coefficient is equal to zero, it means that there is no correlation between the two variables. The following is a good rule of thumb when considering the magnitude of the correlation (for both for negative and/or positive values):

|r| = [0-0.2] : very low or meaningless
|r| = [0.2-0.4]: weak or low correlation
|r| =[0.4-0.6] : moderate or reasonable correlation
|r| =[0.6-0.8] : strong or high correlation
|r| =[0.8-1] : perfect or very high correlation

The most frequently performed correlation analysis procedure type is the inter-correlation, e.g. Pearson correlation, Spearman’s rank correlation and Bland-Altman correlation[64,65].

Observer Variability

In a high percentage of clinical studies, judgements from medical experts are taken into consideration, which implies the possible distortion on the outcome of the research due to the subjective differences between the experts. Two types of statistical tests are tailored to assess the significance of this possible alteration: intra-observer variability and inter-observer variability. The intra-observer variability measures the stability of an individual’s observation at two or more time points. On the other hand, the inter-observer variability is the degree of agreement among the experts. From the aspects of the quality of data provided by the experts it is essential to consistently gather accurate information. Thus training, experience and researcher objectivity can improve intra- and inter-observer reliability and efficiency.

Statistical procedures to measure observer variability are Cohen’s Kappa, intra-class correlation, and the Bland-Altman method[66].

Survival Analysis

Survival analysis is a set of statistics for analysing the expected duration of time until one or more events happen, such as death or disease reoccurrence in clinical studies. One of the most common sources of such data when we record the time from some fixed starting point (diagnosis) to the event (death of the subject, or release from the hospital). In the data set we denote some of the data points as “censored”, to indicate that the period of observation was cut off before the event of interest occurred or the information on a case only known for a limited duration.

Survival curves are most frequently estimated using Kaplan-Meier method. The comparison between survival curves relies on the Logrank test or the Wald test most frequently.

Choosing the right statistical test

The following conditions are determining the appropriate statistical test: research question (for example explore difference between groups, or comparison of the relative frequencies, association between variables), the datatype in our dataset (nominal, ordinal, numerical), study design (paired or unpaired data), number of groups (2 or >2) and the mathematical properties of our data (i.e.: normality, equality of two variances Important to note, that the calculated p value can be distorting or misleading, if the data do not adequately conform to the underlying mathematical requirements. In further, their "significance" is merely an inference derived from principles of mathematical probability, not an evaluation of substantive importance for the "big" or "small" magnitude of the observed distinction[67].