What Statistical Tests Should I Perform?


Too often, I see tables in ML papers where authors claim that their method is “significantly” better than the rest of the methods without performing any statistical tests. I feel that such claims should be backed up by statistical evidence and not just concluding from the magnitude of the metrics reported. Even if the error intervals are reported, the top methods tend to lie with the standard deviation of each other, making it hard to conclude which method is actually better. Therefore, in this post, I intend to cover a few recurring scenarios when presenting results in ML papers and what kind of tests could be performed. Choosing the right test is tricky but it is worth the effort as results backed by statistical evidence seem more credible. This post is inspired from my own struggles in deciding what tests to do and how I wished there was a guide for non-stats folks to refer to when it’s time to put results into the paper.

Here’s a flowchart on the most common types of tests depending on the type of the output variable followed by the series of tests to consider.

A flowchart for choosing the right statistical test.
A flowchart for choosing the right statistical test.

Definitions

Outcome (dependent) variable types

  • Continuous variables 📏: Think measuring tape. Continuous variables represent values that can be measured on a continuous scale. Their values are not restricted to separate, distinct numbers. For example:
    • Performance metrics such as the F1-score, MSE; values whose range lies between 0-1.
    • Features such as a person’s height, average temperature, etc.
  • Categorical variables 🗂️: Think groups or labeled buckets. Categorical variables only take on a limited set of values, assigning each data point to a particular group. There are two types of variables: (i) Nominal: categories have no intrinsic order (e.g. car, bus, plane), and (ii) Ordinal: categories have a meaniningful order (e.g. rating something as good, okay, bad, or, levels of difficulty, etc.). A few examples of categorial variables:
    • Model prediction as Spam/Not spam, or, classification of a bacterium as Resistant/Susceptible against a specific antibiotic.
    • Features such as grading of a tumor on a scale of 1-4 (note: this is ordinal), or, a person’s blood type (A, B, AB, O; note: this is nominal).
  • Survival variables ⏳: Think an hourglass. A survival variable is a unique type of data that has two parts: (i) a measure of time until a specific event occurs, and (ii) an indicator of whether the event has occurred or if the observation period ended before the event could happen (this is called censoring). A few examples:
    • Time until a patient relapses after treatment, with some patients still being disease-free at the end of the study (censored data).
    • Time until a machine fails, with some machines still operational at the end of the observation period (censored data).

Exposure (independent) variable types

  • Groups: They refer to the different methods/models we want to compare. For example, if we have three models A, B, and C, then we have three “groups”.

  • Samples: They refer to the number of observations we have for each group. For example, (i) each set of predictions from models A, B, and C is a sample (so three models, three sets of samples), (ii) if we have five different runs of model A, then we have five samples for model A, and so on.

  • Paired samples: They refer to the samples that are related to each other. For example, (i) if we have three models A, B, and C and we evaluate them on the same test set, then the samples from these models are “paired” because they are related to each other through the same test set, (ii) if we have a model that we run five times with different random seeds and evaluate them on the same test set, then the samples from these five runs are paired samples.

  • Independent samples: They refer to the samples that are not related to each other. For example, say we have a large test dataset and we create non-overlapping sub-test sets from it. Then, evaluating models A, B, and C on these different sub-test sets will give us independent samples.

  • Parametric: This class of tests assume that the data follows a normal distribution. Normality of the data can be tested using D’Agostino and Pearson’s test, Shapiro-Wilk test, etc. If the p-value from these tests is greater than a significance level (typically 0.05), meaning that the null hypothesis ($H_0$; stating that the data is normally distributed) is not rejected, then we can assume that the data is normally distributed and we can use parametric tests.

  • Non-parametric: This class of tests do not assume that the data follow a normal distribution. They are typically used when the data is not normally distributed or when the sample size is small (typically less than 30 per group).

Errors in hypothesis testing

When performing statistical tests, we typically start with a null hypothesis ($H_0$) and an alternative hypothesis ($H_a$). The null hypothesis usually states that there is no effect or no difference between the groups, while the alternative hypothesis states that there is an effect or a difference. Based on the results of the statistical test, we either reject the null hypothesis or fail to reject it. However, there are two types of errors that can occur in this process:

  • Type I error (False Positive): It occurs when we reject the null hypothesis ($H_0$) when it is actually true. In simpler terms, it’s like a false alarm where we think there is an effect or difference when there isn’t one (e.g. concluding that a drug is effective when it’s not). The significance level (alpha, typically set at 0.05) represents the probability of making a Type I error. For example, if we set alpha to 0.05, it means that we are willing to accept a 5% chance of incorrectly rejecting the null hypothesis.

  • Type II error (False Negative): It occurs when we fail to reject the null hypothesis ($H_0$) when it is actually false. In simpler terms, it’s like missing a real effect or difference that actually exists (e.g. failing to detect that a drug is working). The probability of making a Type II error is denoted by beta (β). The power of a test, which is calculated as (1 - β), represents the probability of correctly rejecting the null hypothesis when it is false. A higher power means a lower chance of making a Type II error. Typically, researchers aim for a power of 0.8 or higher, meaning there is an 80% chance of correctly detecting an effect if it exists.

Choosing the right test

When it comes to presenting the results, a recurring scenario is the following:

We have the method/model that we have proposed and we want to compare its performance with the rest of the methods/models from the literature. We also have a fixed test set on which we evaluate our models on.

Now, let’s take a bottom-up approach to arrive at the right test(s) to perform. Note that the test set is fixed so we have already reduced the search space for the class of the tests to be within: “2 groups” or “>2 groups” with “paired” samples (since the models are evaluated on the same test set). If we have two models (model A vs. model B) then we use either paired Student’s t-test or Wilcoxon signed rank test if the variable we want to compare is continuous (e.g. accuracy, F1-score, etc.) and McNemar’s test if the variable is categorical (e.g. model predictions as Spam/Not spam). If we’re comparing more than two models (i.e. ours and two or more) then we use either repeated measures ANOVA or Friedman test if the variable we want to compare is continuous and Cochran’s Q test if the variable is categorical. The next natural question is: should we use parametric or non-parametric tests?

Parametric or Non-parametric?

When comparing continuous variables, the choice between choosing parametric or non-parametric tests depends on the distribution of the data. To ground it in ML terms, the “data” here refer to the set of performance metrics on the test set obtained from the models we want to compare. To know how the data are distributed we typically run normality tests from the scipy.stats package. Here’s a short code example of how to do it:

import numpy as np
from scipy import stats
# Example data: performance metrics from two models
model_a_metrics = np.array([0.85, 0.87, 0.86, 0.88, 0.84])
model_b_metrics = np.array([0.80, 0.82, 0.81, 0.83, 0.79])
# Perform D'Agostino and Pearson's test for normality
stat_a, p_value_a = stats.normaltest(model_a_metrics)
stat_b, p_value_b = stats.normaltest(model_b_metrics)
# Significance level
# NOTE: The null hypothesis (H0) is that the data is normally distributed.
alpha = 0.05
# Check if the data is normally distributed
is_a_normal = p_value_a > alpha
is_b_normal = p_value_b > alpha
print(f"Model A normality: {is_a_normal}, p-value: {p_value_a}")
print(f"Model B normality: {is_b_normal}, p-value: {p_value_b}")

# CHECK
# If p > alpha, we fail to reject the null hypothesis --> data is normally distributed --> parametric
# If p <= alpha, we reject the null hypothesis --> data is not normally distributed --> non-parametric

Based on the normality tests, we now know whether to choose from parametric or non-paramtric tests. Continuing the example from the scenario mentioned above, say there are more than 2 groups (i.e. we’re comparing more than 2 methods) and the test indicates that there is a significant difference between the groups. How do we know which groups are different from each other? This is where post-hoc tests come into play.

Post-hoc tests

Post-hoc is a Latin term that means “after this”. Post-hoc tests are additional tests performed after an initial statistical test (like ANOVA or Friedman test) to determine which specific groups are different from each other. They help to identify where the differences lie when the initial test indicates that there is a significant difference among the groups. More concretely, say we have three models A, B, and C and we perform a Friedman test (non-parametric) to compare their performance on a fixed test set. The test gives a p-value of 0.02 (which is significant!). But, we don’t know which pairs of models are significantly different from each other (i.e. A vs. B, A vs. C, B vs. C). Therefore, we perform post-hoc tests, which involve multiple pairwise comparisons between the groups to identify which specific pairs of groups have significant differences in their means or distributions. Some common post-hoc tests include:

  • Tukey’s HSD (Honestly Significant Difference) test: used after ANOVA to find means that are significantly different from each other.
  • Bonferroni correction: adjusts the significance level when multiple pairwise tests are performed to control the overall Type I error rate.
  • Dunn’s test: non-parametric post-hoc test used after the Kruskal-Wallis test or Friedman test.

Here’s a short code example on performing Dunn’s post-hoc test:

import numpy as np
import pandas as pd
import scikit_posthocs as sp
# Example data: performance metrics from three models
model_a_metrics = np.array([0.85, 0.87, 0.86, 0.88, 0.84])
model_b_metrics = np.array([0.80, 0.82, 0.81, 0.83, 0.79])
model_c_metrics = np.array([0.78, 0.77, 0.79, 0.76, 0.80])
# Combine the data into a single array
data = np.concatenate([model_a_metrics, model_b_metrics, model_c_metrics])
# Create a group labels array
groups = (['A'] * len(model_a_metrics) +
          ['B'] * len(model_b_metrics) +
          ['C'] * len(model_c_metrics))
# Create a DataFrame for the data
df = pd.DataFrame({'Model': groups, 'Accuracy': data})
# Perform Dunn's test with Holm-Bonferroni correction
dunn_results = sp.posthoc_dunn(df, val_col='Accuracy', group_col='Model', p_adjust='holm')
print(dunn_results)

Note that post-hoc tests are only performed when the initial test indicates that there is a significant difference between the groups.

Okay, we now have an understanding of how to go about choosing the right test. Let’s look at a few examples now.

Examples

Example 1: Comparing accuracy between five models on a fixed test set

Say you have five models (A, B, C, D, E) and you want to compare their accuracy on a fixed test set. The workflow would be as follows:

  1. Since we’re comparing accuracy, the outcome variable is continuous.
  2. We have five models, so we have more than two groups.
  3. The samples are paired since the models are evaluated on the same test set.
  4. Check for normality on each of the test set predictions for the five models. If all the models’ predictions are normally distributed, we can use parametric tests. If at least one of them is not normally distributed, we use non-parametric tests.
  5. Choose the appropriate test. So, we have:
    • continuous, >2 groups, paired, non-parametric: Friedman test
    • continuous, >2 groups, paired, parametric: Repeated measures ANOVA
  6. If the test indicates that there is a significant difference between the groups, perform post-hoc test. For parametric, use Tukey’s HSD test. For non-parametric, use Dunn’s test.

Example 2: Comparing ages of patients in train and test sets

Say you are working with medical data and you have randomly split the data into train/test but you want to know if the age of the patients in train set is significantly different from that in the test set (to make a case for your model’s generalization across age groups). The workflow would be as follows:

  1. Identify the type of outcome variable: Age is a continuous variable.
  2. Identify the number of groups: We have two groups (train and test).
  3. Identify the type of samples: The samples are independent since the train and test sets are disjoint.
  4. Check for normality: Use D’Agostino and Pearson’s test or Shapiro-Wilk test to check if the age data in both groups is normally distributed.
  5. Choose the appropriate test:
    • continuous, 2 groups, independent, non-parametric: Mann-Whitney U test
    • continuous, 2 groups, independent, parametric: Student’s t-test (two-sample t-test)

Example 3: Comparing model predictions (categorical) between three models on a fixed test set

Say you have three models (A, B, C) and you want to compare their predictions (as Malignant/Benign) on a fixed test set. The workflow would be as follows:

  1. Since we’re comparing models’ binary classification predictions, the outcome variable is categorical.
  2. We have three models, so we have more than two groups.
  3. The samples are paired since the models are evaluated on the same test set.
  4. Choose the appropriate test. So, we have:
    • categorical, >2 groups, paired: Cochran’s Q test

General Notes and Common Pitfalls

Why use post-hoc tests when we can do multiple pairwise tests?

When comparing multiple groups, performing multiple pairwise tests (e.g., multiple t-tests) can lead to an increased risk of Type I errors (false positives). This is because each test has its own significance level (alpha), and when you perform multiple tests, the overall chance of making at least one Type I error increases. Post-hoc tests are designed to control for this increased risk by adjusting the significance levels or using specific procedures that account for the number of comparisons being made. This helps to maintain the overall Type I error rate at a desired level (e.g., 0.05) across all comparisons.

Choosing significance level (alpha)

A good significance level balances the risks of false positives and false negatives. Choosing the right significance level depends on the context and the consequence of the errors. A significance level of 5% is more commonly used than that of 1% which is more conservative and used only in safety-critical applications like medicine.