02 - Global Statistical Validation (Friedman Test)¶

Before comparing models one-by-one, scientific rigor demands that we answer a simpler question: "Is there any significant difference among the models overall?"

In labicompare, this is done using the Friedman Test, a non-parametric alternative to the repeated-measures ANOVA. It uses the ranks of the models across the datasets to determine if they all come from the same distribution.

In this tutorial, you will learn:

How to run the Friedman test in isolation.
How to interpret the test statistic and the p-value.
What happens when the models perform too similarly.

In [3]:

Copied!





import pandas as pd
import numpy as np
from labicompare.core.data import EvaluationData
from labicompare.stats.friedman import friedman_test

# Let's recreate our sample data from Notebook 01
df = pd.read_csv("./results.csv", index_col="dataset")

eval_data = EvaluationData(df, higher_is_better=True)

display(eval_data.ranks_df)
import pandas as pd
import numpy as np
from labicompare.core.data import EvaluationData
from labicompare.stats.friedman import friedman_test

# Let's recreate our sample data from Notebook 01
df = pd.read_csv("./results.csv", index_col="dataset")

eval_data = EvaluationData(df, higher_is_better=True)

display(eval_data.ranks_df)

WARNING:labicompare.core.data:Null values detected. Rows (or datasets) with NaNs will be removed to ensure the integrity of paired statistical tests and methods.

	FCN	ResNet	Inception	InceptionTime	LITE	LITETime	ROCKET	MultiROCKET
dataset
Adiac	2.0	5.0	6.0	1.0	7.0	3.5	8.0	3.5
AllGestureWiimoteX	8.0	6.0	3.0	2.0	5.0	4.0	1.0	7.0
AllGestureWiimoteY	5.0	3.0	2.0	1.0	8.0	4.0	7.0	6.0
AllGestureWiimoteZ	8.0	7.0	2.0	1.0	6.0	4.5	3.0	4.5
ArrowHead	4.5	6.0	3.0	2.0	7.0	4.5	8.0	1.0
...	...	...	...	...	...	...	...	...
Wine	8.0	3.0	5.0	5.0	7.0	5.0	2.0	1.0
WordSynonyms	8.0	7.0	4.0	2.0	6.0	5.0	3.0	1.0
Worms	4.0	6.0	5.0	3.0	2.0	1.0	8.0	7.0
WormsTwoClass	8.0	7.0	6.0	4.0	3.0	5.0	1.0	2.0
Yoga	8.0	7.0	6.0	4.0	5.0	2.0	3.0	1.0

127 rows × 8 columns

Step 1: Running the Friedman Test¶

The friedman_test function takes our EvaluationData object and returns two values:

Test Statistic: The calculated $\chi^2_F$ value.
P-value: The probability of observing these differences if all models were actually identical.

In [4]:

Copied!

# Run the test
f_stat, f_p_value = friedman_test(eval_data)

print(f"Friedman Statistic: {f_stat:.4f}")
print(f"P-value:            {f_p_value:.6f}")
# Run the test
f_stat, f_p_value = friedman_test(eval_data)

print(f"Friedman Statistic: {f_stat:.4f}")
print(f"P-value:            {f_p_value:.6f}")

Friedman Statistic: 198.4172
P-value:            0.000000

Step 2: Interpreting the Results¶

To make a decision, we compare the p-value against our chosen significance level ($\alpha$, usually 0.05).

If p-value < $\alpha$: We reject the Null Hypothesis ($H_0$). There is a significant difference among the models.
If p-value $\ge$ $\alpha$: We fail to reject $H_0$. The models are statistically tied.

In [5]:

Copied!





alpha = 0.05

if f_p_value < alpha:
  print(f"REJECT H0 (p < {alpha}).")
  print("Conclusion: At least one model performs significantly differently from the others.")
  print("Next step: Proceed to post-hoc tests (e.g., Wilcoxon-Holm).")
else:
  print(f"FAIL TO REJECT H0 (p >= {alpha}).")
  print("Conclusion: The models perform equally well across these datasets.")
  print("Next step: Stop here. Post-hoc tests are not valid.")
alpha = 0.05

if f_p_value < alpha:
  print(f"REJECT H0 (p < {alpha}).")
  print("Conclusion: At least one model performs significantly differently from the others.")
  print("Next step: Proceed to post-hoc tests (e.g., Wilcoxon-Holm).")
else:
  print(f"FAIL TO REJECT H0 (p >= {alpha}).")
  print("Conclusion: The models perform equally well across these datasets.")
  print("Next step: Stop here. Post-hoc tests are not valid.")

REJECT H0 (p < 0.05).
Conclusion: At least one model performs significantly differently from the others.
Next step: Proceed to post-hoc tests (e.g., Wilcoxon-Holm).

Step 3: An Example of a Failed Test¶

What happens if we test a set of models that are basically identical? The Friedman test should protect us from claiming false victories. Let's create dummy data where all models score around 0.80.

In [6]:

Copied!





# Data where models are statistically indistinguishable
tied_data_dict = {
  'Model_A': [0.80, 0.81, 0.79, 0.80, 0.82],
  'Model_B': [0.79, 0.80, 0.80, 0.81, 0.81],
  'Model_C': [0.81, 0.79, 0.81, 0.79, 0.80]
}

df_tied = pd.DataFrame(tied_data_dict, index=[f'Dataset_{i+1}' for i in range(5)])
tied_eval_data = EvaluationData(df_tied, higher_is_better=True)

# Run the test on the tied data
stat_tied, p_tied = friedman_test(tied_eval_data)

print("--- Tied Data Test ---")
print(f"P-value: {p_tied:.4f}")

if p_tied < alpha:
    print("Result: Reject H0")
else:
    print(f"Result: Fail to reject H0. Models are statistically equivalent.")
# Data where models are statistically indistinguishable
tied_data_dict = {
  'Model_A': [0.80, 0.81, 0.79, 0.80, 0.82],
  'Model_B': [0.79, 0.80, 0.80, 0.81, 0.81],
  'Model_C': [0.81, 0.79, 0.81, 0.79, 0.80]
}

df_tied = pd.DataFrame(tied_data_dict, index=[f'Dataset_{i+1}' for i in range(5)])
tied_eval_data = EvaluationData(df_tied, higher_is_better=True)

# Run the test on the tied data
stat_tied, p_tied = friedman_test(tied_eval_data)

print("--- Tied Data Test ---")
print(f"P-value: {p_tied:.4f}")

if p_tied < alpha:
    print("Result: Reject H0")
else:
    print(f"Result: Fail to reject H0. Models are statistically equivalent.")

--- Tied Data Test ---
P-value: 0.8187
Result: Fail to reject H0. Models are statistically equivalent.