03 - Pairwise Comparisons and Post-hoc Correction (Wilcoxon-Holm)¶

Once the Friedman test indicates that a significant difference exists among the models, the next step is to find out which specific models differ from each other.

To do this, we perform pairwise comparisons using the Wilcoxon Signed-Rank Test. However, comparing multiple models increases the chance of finding a false positive (Family-Wise Error Rate). To correct this, we apply the Holm step-down procedure, which adjusts the significance thresholds.

In labicompare, the wilcoxon_holm function handles this entire pipeline automatically:

It runs the Friedman test.
It computes raw p-values and mean differences for all pairs.
It applies the Holm correction to determine final significance.

Let's see how it works!

In [2]:

Copied!





import pandas as pd
from labicompare.core.data import EvaluationData
from labicompare.stats.posthoc import wilcoxon_holm

# Recreating our sample data from previous notebooks
df = pd.read_csv("./results.csv", index_col="dataset")
eval_data = EvaluationData(df, higher_is_better=True)

print(eval_data)
import pandas as pd
from labicompare.core.data import EvaluationData
from labicompare.stats.posthoc import wilcoxon_holm

# Recreating our sample data from previous notebooks
df = pd.read_csv("./results.csv", index_col="dataset")
eval_data = EvaluationData(df, higher_is_better=True)

print(eval_data)

WARNING:labicompare.core.data:Null values detected. Rows (or datasets) with NaNs will be removed to ensure the integrity of paired statistical tests and methods.

<EvaluationData: 127 datasets, 8 models>

Step 1: Running the Wilcoxon-Holm Pipeline¶

Running the full post-hoc procedure is as simple as calling one function. It returns a ComparisonSummary object containing all the details of the experiment.

In [3]:

Copied!

# The default alpha is 0.05
summary = wilcoxon_holm(data=eval_data, alpha=0.05)

print(f"Global Test (Friedman) P-Value: {summary.friedman_p_value:.5f}")
print(f"Global Null Hypothesis Rejected?: {summary.is_global_sig}")
# The default alpha is 0.05
summary = wilcoxon_holm(data=eval_data, alpha=0.05)

print(f"Global Test (Friedman) P-Value: {summary.friedman_p_value:.5f}")
print(f"Global Null Hypothesis Rejected?: {summary.is_global_sig}")

Global Test (Friedman) P-Value: 0.00000
Global Null Hypothesis Rejected?: True

Step 2: Unpacking the Pairwise Results¶

The ComparisonSummary holds a list of pairwise results. We can iterate through them to see exactly how each model performed against the others.

The function also automatically identifies the "Winner" of each pair based on the mean difference and your higher_is_better configuration.

In [4]:

Copied!





print("--- Pairwise Comparisons ---")
for result in summary.pairwise_results:
  print(f"{result.model_a} vs {result.model_b}:")
  print(f"  Winner:      {result.winner}")
  print(f"  P-value:     {result.p_value:.5f}")
  print(f"  Significant: {result.is_significant}\n")
print("--- Pairwise Comparisons ---")
for result in summary.pairwise_results:
  print(f"{result.model_a} vs {result.model_b}:")
  print(f"  Winner:      {result.winner}")
  print(f"  P-value:     {result.p_value:.5f}")
  print(f"  Significant: {result.is_significant}\n")

--- Pairwise Comparisons ---
LITE vs LITETime:
  Winner:      LITETime
  P-value:     0.00000
  Significant: True

Inception vs InceptionTime:
  Winner:      InceptionTime
  P-value:     0.00000
  Significant: True

FCN vs InceptionTime:
  Winner:      InceptionTime
  P-value:     0.00000
  Significant: True

FCN vs MultiROCKET:
  Winner:      MultiROCKET
  P-value:     0.00000
  Significant: True

FCN vs LITETime:
  Winner:      LITETime
  P-value:     0.00000
  Significant: True

FCN vs ROCKET:
  Winner:      ROCKET
  P-value:     0.00000
  Significant: True

LITE vs MultiROCKET:
  Winner:      MultiROCKET
  P-value:     0.00000
  Significant: True

ResNet vs MultiROCKET:
  Winner:      MultiROCKET
  P-value:     0.00000
  Significant: True

FCN vs Inception:
  Winner:      Inception
  P-value:     0.00000
  Significant: True

InceptionTime vs LITE:
  Winner:      InceptionTime
  P-value:     0.00000
  Significant: True

ResNet vs InceptionTime:
  Winner:      InceptionTime
  P-value:     0.00000
  Significant: True

ResNet vs ROCKET:
  Winner:      ROCKET
  P-value:     0.00000
  Significant: True

Inception vs MultiROCKET:
  Winner:      MultiROCKET
  P-value:     0.00000
  Significant: True

ResNet vs LITETime:
  Winner:      LITETime
  P-value:     0.00000
  Significant: True

FCN vs LITE:
  Winner:      LITE
  P-value:     0.00000
  Significant: True

LITE vs ROCKET:
  Winner:      ROCKET
  P-value:     0.00001
  Significant: True

FCN vs ResNet:
  Winner:      ResNet
  P-value:     0.00003
  Significant: True

ROCKET vs MultiROCKET:
  Winner:      MultiROCKET
  P-value:     0.00022
  Significant: True

ResNet vs Inception:
  Winner:      Inception
  P-value:     0.00038
  Significant: True

Inception vs LITE:
  Winner:      Inception
  P-value:     0.00287
  Significant: True

LITETime vs MultiROCKET:
  Winner:      MultiROCKET
  P-value:     0.00463
  Significant: True

Inception vs ROCKET:
  Winner:      ROCKET
  P-value:     0.00506
  Significant: True

Inception vs LITETime:
  Winner:      LITETime
  P-value:     0.00666
  Significant: True

InceptionTime vs MultiROCKET:
  Winner:      MultiROCKET
  P-value:     0.01010
  Significant: False

ResNet vs LITE:
  Winner:      LITE
  P-value:     0.14313
  Significant: False

InceptionTime vs LITETime:
  Winner:      InceptionTime
  P-value:     0.23091
  Significant: False

LITETime vs ROCKET:
  Winner:      ROCKET
  P-value:     0.72894
  Significant: False

InceptionTime vs ROCKET:
  Winner:      InceptionTime
  P-value:     0.80610
  Significant: False

Step 3: Exporting Results for Publication¶

Manually copying these results into your research paper is tedious and prone to errors. The ComparisonSummary provides a .to_dataframe() method to instantly convert these results into a standard tabular format.

In [5]:

Copied!

# Extract the summary as a pandas DataFrame
results_df = summary.to_dataframe()
display(results_df)

# Hint: You can easily export this to LaTeX using results_df.to_latex(index=False)
# Extract the summary as a pandas DataFrame
results_df = summary.to_dataframe()
display(results_df)

# Hint: You can easily export this to LaTeX using results_df.to_latex(index=False)

	Model A	Model B	P-Value	Significant	Winner	Mean Diff
0	LITE	LITETime	4.158076e-18	True	LITETime	-0.015869
1	Inception	InceptionTime	4.403208e-18	True	InceptionTime	-0.009527
2	FCN	InceptionTime	1.858516e-13	True	InceptionTime	-0.060869
3	FCN	MultiROCKET	4.094701e-13	True	MultiROCKET	-0.072679
4	FCN	LITETime	2.182776e-12	True	LITETime	-0.058192
5	FCN	ROCKET	8.334871e-11	True	ROCKET	-0.060210
6	LITE	MultiROCKET	3.708010e-10	True	MultiROCKET	-0.030356
7	ResNet	MultiROCKET	1.724134e-09	True	MultiROCKET	-0.054476
8	FCN	Inception	2.370499e-09	True	Inception	-0.051342
9	InceptionTime	LITE	5.735349e-09	True	InceptionTime	0.018546
10	ResNet	InceptionTime	6.270000e-09	True	InceptionTime	-0.042666
11	ResNet	ROCKET	1.184978e-06	True	ROCKET	-0.042006
12	Inception	MultiROCKET	1.334716e-06	True	MultiROCKET	-0.021337
13	ResNet	LITETime	1.517250e-06	True	LITETime	-0.039989
14	FCN	LITE	1.675961e-06	True	LITE	-0.042323
15	LITE	ROCKET	6.467758e-06	True	ROCKET	-0.017887
16	FCN	ResNet	2.739844e-05	True	ResNet	-0.018203
17	ROCKET	MultiROCKET	2.200214e-04	True	MultiROCKET	-0.012469
18	ResNet	Inception	3.845425e-04	True	Inception	-0.033139
19	Inception	LITE	2.868770e-03	True	Inception	0.009019
20	LITETime	MultiROCKET	4.630711e-03	True	MultiROCKET	-0.014487
21	Inception	ROCKET	5.058008e-03	True	ROCKET	-0.008868
22	Inception	LITETime	6.656989e-03	True	LITETime	-0.006850
23	InceptionTime	MultiROCKET	1.010484e-02	False	MultiROCKET	-0.011810
24	ResNet	LITE	1.431335e-01	False	LITE	-0.024119
25	InceptionTime	LITETime	2.309091e-01	False	InceptionTime	0.002677
26	LITETime	ROCKET	7.289358e-01	False	ROCKET	-0.002017
27	InceptionTime	ROCKET	8.061037e-01	False	InceptionTime	0.000660

Step 4: Built-in Safety Mechanism (The Exception)¶

What happens if you try to run the wilcoxon_holm test on models that are statistically identical?

As a strict statistical library, labicompare will refuse to compute pairwise differences if the global Friedman test fails to reject the null hypothesis. It will raise a ValueError to prevent false scientific claims. Let's demonstrate this with our tied dataset.

In [6]:

Copied!





# Tied data from Notebook 02
tied_df = pd.DataFrame({
  'Model_A': [0.80, 0.81, 0.79, 0.80, 0.82],
  'Model_B': [0.79, 0.80, 0.80, 0.81, 0.81],
  'Model_C': [0.81, 0.79, 0.81, 0.79, 0.80]
})
tied_data = EvaluationData(tied_df, higher_is_better=True)

try:
  bad_summary = wilcoxon_holm(tied_data, alpha=0.05)
except ValueError as e:
  print("Exception caught gracefully!")
  print(f"Error Message: {e}")
  print("\nConclusion: The library correctly stopped us from running invalid pairwise tests.")
# Tied data from Notebook 02
tied_df = pd.DataFrame({
  'Model_A': [0.80, 0.81, 0.79, 0.80, 0.82],
  'Model_B': [0.79, 0.80, 0.80, 0.81, 0.81],
  'Model_C': [0.81, 0.79, 0.81, 0.79, 0.80]
})
tied_data = EvaluationData(tied_df, higher_is_better=True)

try:
  bad_summary = wilcoxon_holm(tied_data, alpha=0.05)
except ValueError as e:
  print("Exception caught gracefully!")
  print(f"Error Message: {e}")
  print("\nConclusion: The library correctly stopped us from running invalid pairwise tests.")

Exception caught gracefully!
Error Message: The null-hypothesis of Friedman test cannot be rejected (p-value: 0.8187 > 0.05).

Conclusion: The library correctly stopped us from running invalid pairwise tests.