03 - Pairwise Comparisons and Post-hoc Correction (Wilcoxon-Holm)¶
Once the Friedman test indicates that a significant difference exists among the models, the next step is to find out which specific models differ from each other.
To do this, we perform pairwise comparisons using the Wilcoxon Signed-Rank Test. However, comparing multiple models increases the chance of finding a false positive (Family-Wise Error Rate). To correct this, we apply the Holm step-down procedure, which adjusts the significance thresholds.
In labicompare, the wilcoxon_holm function handles this entire pipeline automatically:
- It runs the Friedman test.
- It computes raw p-values and mean differences for all pairs.
- It applies the Holm correction to determine final significance.
Let's see how it works!
import pandas as pd
from labicompare.core.data import EvaluationData
from labicompare.stats.posthoc import wilcoxon_holm
# Recreating our sample data from previous notebooks
df = pd.read_csv("./results.csv", index_col="dataset")
eval_data = EvaluationData(df, higher_is_better=True)
print(eval_data)
WARNING:labicompare.core.data:Null values detected. Rows (or datasets) with NaNs will be removed to ensure the integrity of paired statistical tests and methods.
<EvaluationData: 127 datasets, 8 models>
Step 1: Running the Wilcoxon-Holm Pipeline¶
Running the full post-hoc procedure is as simple as calling one function. It returns a ComparisonSummary object containing all the details of the experiment.
# The default alpha is 0.05
summary = wilcoxon_holm(data=eval_data, alpha=0.05)
print(f"Global Test (Friedman) P-Value: {summary.friedman_p_value:.5f}")
print(f"Global Null Hypothesis Rejected?: {summary.is_global_sig}")
Global Test (Friedman) P-Value: 0.00000 Global Null Hypothesis Rejected?: True
Step 2: Unpacking the Pairwise Results¶
The ComparisonSummary holds a list of pairwise results. We can iterate through them to see exactly how each model performed against the others.
The function also automatically identifies the "Winner" of each pair based on the mean difference and your higher_is_better configuration.
print("--- Pairwise Comparisons ---")
for result in summary.pairwise_results:
print(f"{result.model_a} vs {result.model_b}:")
print(f" Winner: {result.winner}")
print(f" P-value: {result.p_value:.5f}")
print(f" Significant: {result.is_significant}\n")
--- Pairwise Comparisons --- LITE vs LITETime: Winner: LITETime P-value: 0.00000 Significant: True Inception vs InceptionTime: Winner: InceptionTime P-value: 0.00000 Significant: True FCN vs InceptionTime: Winner: InceptionTime P-value: 0.00000 Significant: True FCN vs MultiROCKET: Winner: MultiROCKET P-value: 0.00000 Significant: True FCN vs LITETime: Winner: LITETime P-value: 0.00000 Significant: True FCN vs ROCKET: Winner: ROCKET P-value: 0.00000 Significant: True LITE vs MultiROCKET: Winner: MultiROCKET P-value: 0.00000 Significant: True ResNet vs MultiROCKET: Winner: MultiROCKET P-value: 0.00000 Significant: True FCN vs Inception: Winner: Inception P-value: 0.00000 Significant: True InceptionTime vs LITE: Winner: InceptionTime P-value: 0.00000 Significant: True ResNet vs InceptionTime: Winner: InceptionTime P-value: 0.00000 Significant: True ResNet vs ROCKET: Winner: ROCKET P-value: 0.00000 Significant: True Inception vs MultiROCKET: Winner: MultiROCKET P-value: 0.00000 Significant: True ResNet vs LITETime: Winner: LITETime P-value: 0.00000 Significant: True FCN vs LITE: Winner: LITE P-value: 0.00000 Significant: True LITE vs ROCKET: Winner: ROCKET P-value: 0.00001 Significant: True FCN vs ResNet: Winner: ResNet P-value: 0.00003 Significant: True ROCKET vs MultiROCKET: Winner: MultiROCKET P-value: 0.00022 Significant: True ResNet vs Inception: Winner: Inception P-value: 0.00038 Significant: True Inception vs LITE: Winner: Inception P-value: 0.00287 Significant: True LITETime vs MultiROCKET: Winner: MultiROCKET P-value: 0.00463 Significant: True Inception vs ROCKET: Winner: ROCKET P-value: 0.00506 Significant: True Inception vs LITETime: Winner: LITETime P-value: 0.00666 Significant: True InceptionTime vs MultiROCKET: Winner: MultiROCKET P-value: 0.01010 Significant: False ResNet vs LITE: Winner: LITE P-value: 0.14313 Significant: False InceptionTime vs LITETime: Winner: InceptionTime P-value: 0.23091 Significant: False LITETime vs ROCKET: Winner: ROCKET P-value: 0.72894 Significant: False InceptionTime vs ROCKET: Winner: InceptionTime P-value: 0.80610 Significant: False
Step 3: Exporting Results for Publication¶
Manually copying these results into your research paper is tedious and prone to errors. The ComparisonSummary provides a .to_dataframe() method to instantly convert these results into a standard tabular format.
# Extract the summary as a pandas DataFrame
results_df = summary.to_dataframe()
display(results_df)
# Hint: You can easily export this to LaTeX using results_df.to_latex(index=False)
| Model A | Model B | P-Value | Significant | Winner | Mean Diff | |
|---|---|---|---|---|---|---|
| 0 | LITE | LITETime | 4.158076e-18 | True | LITETime | -0.015869 |
| 1 | Inception | InceptionTime | 4.403208e-18 | True | InceptionTime | -0.009527 |
| 2 | FCN | InceptionTime | 1.858516e-13 | True | InceptionTime | -0.060869 |
| 3 | FCN | MultiROCKET | 4.094701e-13 | True | MultiROCKET | -0.072679 |
| 4 | FCN | LITETime | 2.182776e-12 | True | LITETime | -0.058192 |
| 5 | FCN | ROCKET | 8.334871e-11 | True | ROCKET | -0.060210 |
| 6 | LITE | MultiROCKET | 3.708010e-10 | True | MultiROCKET | -0.030356 |
| 7 | ResNet | MultiROCKET | 1.724134e-09 | True | MultiROCKET | -0.054476 |
| 8 | FCN | Inception | 2.370499e-09 | True | Inception | -0.051342 |
| 9 | InceptionTime | LITE | 5.735349e-09 | True | InceptionTime | 0.018546 |
| 10 | ResNet | InceptionTime | 6.270000e-09 | True | InceptionTime | -0.042666 |
| 11 | ResNet | ROCKET | 1.184978e-06 | True | ROCKET | -0.042006 |
| 12 | Inception | MultiROCKET | 1.334716e-06 | True | MultiROCKET | -0.021337 |
| 13 | ResNet | LITETime | 1.517250e-06 | True | LITETime | -0.039989 |
| 14 | FCN | LITE | 1.675961e-06 | True | LITE | -0.042323 |
| 15 | LITE | ROCKET | 6.467758e-06 | True | ROCKET | -0.017887 |
| 16 | FCN | ResNet | 2.739844e-05 | True | ResNet | -0.018203 |
| 17 | ROCKET | MultiROCKET | 2.200214e-04 | True | MultiROCKET | -0.012469 |
| 18 | ResNet | Inception | 3.845425e-04 | True | Inception | -0.033139 |
| 19 | Inception | LITE | 2.868770e-03 | True | Inception | 0.009019 |
| 20 | LITETime | MultiROCKET | 4.630711e-03 | True | MultiROCKET | -0.014487 |
| 21 | Inception | ROCKET | 5.058008e-03 | True | ROCKET | -0.008868 |
| 22 | Inception | LITETime | 6.656989e-03 | True | LITETime | -0.006850 |
| 23 | InceptionTime | MultiROCKET | 1.010484e-02 | False | MultiROCKET | -0.011810 |
| 24 | ResNet | LITE | 1.431335e-01 | False | LITE | -0.024119 |
| 25 | InceptionTime | LITETime | 2.309091e-01 | False | InceptionTime | 0.002677 |
| 26 | LITETime | ROCKET | 7.289358e-01 | False | ROCKET | -0.002017 |
| 27 | InceptionTime | ROCKET | 8.061037e-01 | False | InceptionTime | 0.000660 |
Step 4: Built-in Safety Mechanism (The Exception)¶
What happens if you try to run the wilcoxon_holm test on models that are statistically identical?
As a strict statistical library, labicompare will refuse to compute pairwise differences if the global Friedman test fails to reject the null hypothesis. It will raise a ValueError to prevent false scientific claims. Let's demonstrate this with our tied dataset.
# Tied data from Notebook 02
tied_df = pd.DataFrame({
'Model_A': [0.80, 0.81, 0.79, 0.80, 0.82],
'Model_B': [0.79, 0.80, 0.80, 0.81, 0.81],
'Model_C': [0.81, 0.79, 0.81, 0.79, 0.80]
})
tied_data = EvaluationData(tied_df, higher_is_better=True)
try:
bad_summary = wilcoxon_holm(tied_data, alpha=0.05)
except ValueError as e:
print("Exception caught gracefully!")
print(f"Error Message: {e}")
print("\nConclusion: The library correctly stopped us from running invalid pairwise tests.")
Exception caught gracefully! Error Message: The null-hypothesis of Friedman test cannot be rejected (p-value: 0.8187 > 0.05). Conclusion: The library correctly stopped us from running invalid pairwise tests.