04 - Individual Pairwise Comparisons (The Paired T-Test)¶
While the Wilcoxon-Holm pipeline is the standard for comparing multiple models, there are times when you only need to compare two specific models.
In labicompare, you can run individual statistical tests directly. In this notebook, we will explore the Paired T-Test (paired_ttest), which is a parametric test.
Because parametric tests make strict assumptions about your data (specifically, that the differences between the two models are normally distributed), labicompare includes built-in safety checks to warn you if you are violating these assumptions.
In this tutorial, you will learn:
- How to run a standalone Paired T-Test.
- How to read the
PairwiseResultobject. - How the library warns you about non-normal distributions (and how to handle it).
import pandas as pd
import numpy as np
import warnings
from labicompare.core.data import EvaluationData
from labicompare.stats.pairwise import paired_ttest
# Setting up dummy data
df = pd.read_csv("./results.csv", index_col="dataset")
eval_data = EvaluationData(df, higher_is_better=True)
print(eval_data)
WARNING:labicompare.core.data:Null values detected. Rows (or datasets) with NaNs will be removed to ensure the integrity of paired statistical tests and methods.
<EvaluationData: 127 datasets, 8 models>
Step 1: Running a Basic Paired T-Test¶
Let's compare Model_A and Model_B. These models have stable scores, so the differences between them are likely normally distributed.
The paired_ttest function requires the data object and the exact names of the two models you want to compare. It returns a PairwiseResult object.
# Run the parametric T-Test
result = paired_ttest(
data=eval_data,
model_a='InceptionTime',
model_b='FCN',
alpha=0.05
)
print(f"Comparing: {result.model_a} vs {result.model_b}")
print(f"Mean Difference: {result.mean_diff:.4f}")
print(f"P-value: {result.p_value:.5f}")
print(f"Significant?: {result.is_significant}")
print(f"Winner: {result.winner}")
Comparing: InceptionTime vs FCN Mean Difference: 0.0609 P-value: 0.00000 Significant?: True Winner: InceptionTime
/tmp/ipykernel_1631687/2330940442.py:2: UserWarning: [labicompare] WARNING: Differences between 'InceptionTime' and 'FCN' NOT follow a normal distribution (Shapiro-Wilk p-value = 0.0000 < 0.05). The result of this paired T-Test has high risk of false positive. We strongly suggest using the Wilcoxon Signed-Rank instead. result = paired_ttest(
The Paired T-Test assumes that the differences between the two models follow a normal distribution. If this assumption is violated, the p-value is unreliable, and you risk making false claims.
By default, labicompare runs a Shapiro-Wilk test in the background. If the data is not normal, it will throw a warning suggesting you use a non-parametric test (like the Wilcoxon Signed-Rank) instead.
Step 2: Bypassing the Normality Check¶
If you are absolutely certain about your data distribution (or if you are running a massive benchmark script where you don't want warnings clogging your console), you can turn off the safety check using check_normality=False.
# Running the same test, but silencing the built-in normality check
result_silenced = paired_ttest(
data=eval_data,
model_a='InceptionTime',
model_b='FCN',
check_normality=False
)
print("Test executed without warnings.")
print(f"P-value: {result_silenced.p_value:.5f}")
Test executed without warnings. P-value: 0.00000
Step 3: A Non-Parametric Alternative (The Sign Test)¶
As we saw in Step 2, comparing Model_A and Model_C with a Paired T-Test triggered a normality warning because Model_C has extreme outliers (like scoring 0.60 on Dataset 4). The parametric test got confused by the magnitude of these errors.
To fix this, we can use a Non-Parametric Test. The simplest one is the Sign Test.
The Sign Test is incredibly robust against outliers because it completely ignores the scale of the differences. It only looks at the "signs" (+ or -), effectively asking: "Out of all the datasets, how many times did Model A beat Model C?"
from labicompare.stats.pairwise import sign_test
# Run the non-parametric Sign Test
result_sign = sign_test(
data=eval_data,
model_a='InceptionTime',
model_b='ROCKET',
alpha=0.05
)
print(f"Comparing: {result_sign.model_a} vs {result_sign.model_b} (Sign Test)")
print(f"P-value: {result_sign.p_value:.5f}")
print(f"Significant?: {result_sign.is_significant}")
Comparing: InceptionTime vs ROCKET (Sign Test) P-value: 0.41142 Significant?: False
Step 4: Wilcoxon Signed-Rank Test¶
While the Sign Test is robust, it throws away a lot of useful information. By completely ignoring the magnitude of the differences, it can sometimes fail to detect a model that is consistently winning by large margins.
This is where the Wilcoxon Signed-Rank Test highlights. It is the most highly recommended pairwise test for Machine Learning benchmarks.
How it works:
- It calculates the difference in performance for each dataset.
- It ranks these differences by their absolute size (ignoring the sign).
- It then applies the signs back to the ranks to see if the positive ranks or negative ranks dominate.
Because it uses ranks instead of raw values, it doesn't care if the data is perfectly normal. However, because it ranks the magnitudes, it still accounts for how big the wins and losses are.
from labicompare.stats.pairwise import wilcoxon_signed_rank
# Run the Wilcoxon Signed-Rank Test
result_wilcoxon = wilcoxon_signed_rank(
data=eval_data,
model_a='InceptionTime',
model_b='ROCKET',
alpha=0.05
)
print(f"Comparing: {result_wilcoxon.model_a} vs {result_wilcoxon.model_b} (Wilcoxon)")
print(f"P-value: {result_wilcoxon.p_value:.5f}")
print(f"Significant?: {result_wilcoxon.is_significant}")
print(f"Winner: {result_wilcoxon.winner}")
Comparing: InceptionTime vs ROCKET (Wilcoxon) P-value: 0.80610 Significant?: False Winner: None
Summary of Pairwise Tests¶
To recap, if you need to compare exactly two models, choose your test based on your data:
- Paired T-Test: Use when differences are normally distributed (safe for very stable, low-variance scores).
- Sign Test: Use when you only care about the win/loss ratio, heavily penalizing single large failures.
- Wilcoxon Signed-Rank Test: The recommended default. Safe against outliers, doesn't assume normality, and respects the relative magnitude of differences.
Note: If you are comparing more than two models simultaneously, always use the wilcoxon_holm post-hoc function we covered in Notebook 03 instead of running these individual tests manually!