02 - Global Statistical Validation (Friedman Test)¶
Before comparing models one-by-one, scientific rigor demands that we answer a simpler question: "Is there any significant difference among the models overall?"
In labicompare, this is done using the Friedman Test, a non-parametric alternative to the repeated-measures ANOVA. It uses the ranks of the models across the datasets to determine if they all come from the same distribution.
In this tutorial, you will learn:
- How to run the Friedman test in isolation.
- How to interpret the test statistic and the p-value.
- What happens when the models perform too similarly.
import pandas as pd
import numpy as np
from labicompare.core.data import EvaluationData
from labicompare.stats.friedman import friedman_test
# Let's recreate our sample data from Notebook 01
df = pd.read_csv("./results.csv", index_col="dataset")
eval_data = EvaluationData(df, higher_is_better=True)
display(eval_data.ranks_df)
WARNING:labicompare.core.data:Null values detected. Rows (or datasets) with NaNs will be removed to ensure the integrity of paired statistical tests and methods.
| FCN | ResNet | Inception | InceptionTime | LITE | LITETime | ROCKET | MultiROCKET | |
|---|---|---|---|---|---|---|---|---|
| dataset | ||||||||
| Adiac | 2.0 | 5.0 | 6.0 | 1.0 | 7.0 | 3.5 | 8.0 | 3.5 |
| AllGestureWiimoteX | 8.0 | 6.0 | 3.0 | 2.0 | 5.0 | 4.0 | 1.0 | 7.0 |
| AllGestureWiimoteY | 5.0 | 3.0 | 2.0 | 1.0 | 8.0 | 4.0 | 7.0 | 6.0 |
| AllGestureWiimoteZ | 8.0 | 7.0 | 2.0 | 1.0 | 6.0 | 4.5 | 3.0 | 4.5 |
| ArrowHead | 4.5 | 6.0 | 3.0 | 2.0 | 7.0 | 4.5 | 8.0 | 1.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Wine | 8.0 | 3.0 | 5.0 | 5.0 | 7.0 | 5.0 | 2.0 | 1.0 |
| WordSynonyms | 8.0 | 7.0 | 4.0 | 2.0 | 6.0 | 5.0 | 3.0 | 1.0 |
| Worms | 4.0 | 6.0 | 5.0 | 3.0 | 2.0 | 1.0 | 8.0 | 7.0 |
| WormsTwoClass | 8.0 | 7.0 | 6.0 | 4.0 | 3.0 | 5.0 | 1.0 | 2.0 |
| Yoga | 8.0 | 7.0 | 6.0 | 4.0 | 5.0 | 2.0 | 3.0 | 1.0 |
127 rows × 8 columns
Step 1: Running the Friedman Test¶
The friedman_test function takes our EvaluationData object and returns two values:
- Test Statistic: The calculated $\chi^2_F$ value.
- P-value: The probability of observing these differences if all models were actually identical.
# Run the test
f_stat, f_p_value = friedman_test(eval_data)
print(f"Friedman Statistic: {f_stat:.4f}")
print(f"P-value: {f_p_value:.6f}")
Friedman Statistic: 198.4172 P-value: 0.000000
Step 2: Interpreting the Results¶
To make a decision, we compare the p-value against our chosen significance level ($\alpha$, usually 0.05).
- If p-value < $\alpha$: We reject the Null Hypothesis ($H_0$). There is a significant difference among the models.
- If p-value $\ge$ $\alpha$: We fail to reject $H_0$. The models are statistically tied.
alpha = 0.05
if f_p_value < alpha:
print(f"REJECT H0 (p < {alpha}).")
print("Conclusion: At least one model performs significantly differently from the others.")
print("Next step: Proceed to post-hoc tests (e.g., Wilcoxon-Holm).")
else:
print(f"FAIL TO REJECT H0 (p >= {alpha}).")
print("Conclusion: The models perform equally well across these datasets.")
print("Next step: Stop here. Post-hoc tests are not valid.")
REJECT H0 (p < 0.05). Conclusion: At least one model performs significantly differently from the others. Next step: Proceed to post-hoc tests (e.g., Wilcoxon-Holm).
Step 3: An Example of a Failed Test¶
What happens if we test a set of models that are basically identical? The Friedman test should protect us from claiming false victories. Let's create dummy data where all models score around 0.80.
# Data where models are statistically indistinguishable
tied_data_dict = {
'Model_A': [0.80, 0.81, 0.79, 0.80, 0.82],
'Model_B': [0.79, 0.80, 0.80, 0.81, 0.81],
'Model_C': [0.81, 0.79, 0.81, 0.79, 0.80]
}
df_tied = pd.DataFrame(tied_data_dict, index=[f'Dataset_{i+1}' for i in range(5)])
tied_eval_data = EvaluationData(df_tied, higher_is_better=True)
# Run the test on the tied data
stat_tied, p_tied = friedman_test(tied_eval_data)
print("--- Tied Data Test ---")
print(f"P-value: {p_tied:.4f}")
if p_tied < alpha:
print("Result: Reject H0")
else:
print(f"Result: Fail to reject H0. Models are statistically equivalent.")
--- Tied Data Test --- P-value: 0.8187 Result: Fail to reject H0. Models are statistically equivalent.