The illusory promise of the Aligned Rank Transform
Appendix. Additional experimental results
We present complementary results and new experiments that investigate additional scenarios. We also compare INT and RNK with other nonparametric methods. Unless explicitly mentioned in each section, we follow the experimental methodology presented in the main article. At the end of each section, we summarize our conclusions.
1 Results for \(\alpha = .01\)
Although we only presented results for \(\alpha = .05\) in the main article, we observe the same trends for other significance levels. The following figures show the Type I error rates of the methods for the \(4 \times 3\) within-subjects design when \(\alpha = .01\). Note that error rates are not proportional to the \(\alpha\) level.
For results from different experiments, we refer readers to our raw result files.
Conclusion
Results for \(\alpha = .01\) confirm that ART is not a robust method. It only behaves correctly when distributions are normal.
2 Main effects in the presence of interactions
In all experiments assessing Type I error rates reported in our article, we assumed no interaction effects. However, we also need to understand whether weak or strong interaction effects could affect the sensitivity of the methods in detecting main effects. This experiment evaluates the Type I error rate of the methods in the presence of an interaction effect alone, or alternatively, in the presence of a simultaneous main effect. We focus again on the three two-factor experimental designs we evaluated for our previous experiment, setting the sample size to \(n = 20\).
To simulate populations in which interactions emerge in the absence of main effects, we examine perfectly symmetric cross-interactions. To this end, we slightly change the method we use to encode the levels of each factor, such that levels are uniformly positioned around 0. For a factor with three levels, we numerically encode the levels as \(\{-0.5, 0, 0.5\}\). For a factor with four levels, we encode them as \(\{-0.5, -0.1667, 0.1667, 0.5\}\).
Interaction effect only. We first test how the interaction effect alone influences the Type I error rate on \(X_2\). Figure 5 presents our results. We observe that both PAR and ART fail for many configurations. Error rates are especially high in the case of the \(4 \times 3\) within-subjects design and the \(2\times 4\) mixed design under the log-normal and exponential distribution. ART exhibits the worst performance. In contrast, RNK and INT keep error rates close to nominal levels. However, when interaction effects become sufficiently large (\(a_{12} > 4\)), we observe that under the binomial and ordinal scale, they also start inflating errors.
Interaction effect combined with main effect. We also evaluate the Type I error rate on \(X_2\) when the interaction effect is combined with a main effect on \(X_1\). Figure 6 presents our results. The error rates of ART and PAR now explode for all non-normal distributions and all three designs. But the performance of RNK and INT is also affected. Their error rates become extremely low under continuous distributions, which suggests a lack of power in detecting small main effects when strong effects of other factors are combined with strong interactions. In contrast, the Type I error of the two methods explodes under the binomial and ordinal scale. Interestingly, RNK exhibits the best performance in these tests.
Conclusion
When distributions are skewed, the presence of an interaction effect can cause ART to detect a non-existent main effect. ART is more sensitive to such problems than parametric ANOVA. The performance of INT and RNK can also be affected by the presence of interaction effects, especially when the interaction effect is combined with a main effect, and distributions are either binomial or ordinal. More generally, main effects should be interpreted with caution when strong interactions exist.
3 Missing data
We evaluate how missing data can affect the performance of the four methods. Specifically, we study a scenario, where a random sample of \(10\%\) of the observations is missing. Missing observations lead to unbalanced designs. However, we emphasize that our scenario does not cover systematic imbalances due to missing data for specific levels of a factor.
Main effects. Figure 7 presents Type I error rates for the main effect of \(X_2\) as the magnitude of the main effect of \(X_1\) increases. We observe that missing data make the error rate for ART to increase even further, with a larger increase for the mixed design. This is also the case for the normal distribution. In contrast, the accuracy of the three other methods does not seem to be affected.
Interaction effects. Figure 8 and Figure 9 present Type I error rates for the interaction effect in the presence of a single main effect or two parallel main effects. The error levels for all methods, including ART, are now very similar to the ones observed with no missing data.
Conclusion
ART is sensitive to the presence of missing data when at least \(10\%\) of the observations are missing. Specifically, we found that its Type I error rate for main effects increases further, even with normal distributions. The other methods do not seem to be affected by missing data. However, we emphasize that we only tested missing data that are randomly drawn from the complete set of observations. Results might be different if there were systematic bias in the data imbalance structure.
4 Log-normal distributions
In a different experiment, we evaluate log-normal distributions with a wider range of \(\sigma\) parameters (see Figure 10), in particular distributions with less variance, which exhibit a lower degree of skew.
Main effects. Figure 11 presents our results on Type I error rates for main effects. As expected, ART’s inflation of error rates is less serious when distributions are closer to normal, while the problem becomes worse as distributions are more skewed.
Interaction effects. We observe similar patterns for the Type I error rate of the interaction effect in the presence of a single main effect (Figure 12) or two parallel main effects (Figure 13). As shown in Figure 13, any advantage of ART over RNK and INT for testing interaction disappears even when distributions exhibit light skew levels. We also observe again that the performance of RNK and INT remains identical across all skew levels.
Conclusion
ART’s robustness issues become less severe as log-normal distributions become less skewed and thus closer to normal. However, even under distributions with light skew, ART’s Type I error rates for interactions reach higher levels than those of both INT and RNK when parallel main effects are present.
5 Binomial distributions
We also evaluate a wider range of parameters for the binomial distribution. We focus on the lower range of probabilities \(p\). However, we expect results to be identical for their symmetric probabilities \(1-p\). Specifically, we test \(p=.05\), \(.1\), and \(.2\), and for each, we consider \(k=5\) and \(10\) task repetitions (Bernoulli trials).
Main effects. We present our results for the main effect in Figure 14. We observe that ART’s Type I error rates increase as the number of repetitions decreases and the probability of success approaches zero, reaching very high levels when \(k=5\) and \(p=.05\). This trend is consistent across designs. The other methods maintain low error rates. However, their error rates fall below nominal levels when the magnitude of the effect on \(X_1\) grows beyond a certain threshold, indicating a loss of power in these cases.
Interaction effects. Figure 15 shows similar patterns for the Type I error rate of the interaction effect in the presence of a single main effect. When both main effects increase beyond a certain level (see Figure 16), all methods seem to fail to control the error rate. ART again demonstrates the worst behavior, systematically inflating error rates even when main effects are absent. In several cases, RNK performs better than INT.
Conclusion
ART is extremely problematic under binomial distributions, raising Type I error at very high levels even in the absence of other effects. Testing interactions in the presence of parallel main effects can be problematic for all other methods.
6 Ordinal data
Given the frequent use of ART with ordinal data, we evaluate our complete set of ordinal scales, based on both equidistant and flexible thresholds, with additional experimental designs.
Main effects. Figure 17 presents Type I error rates for the main effect. ART preserves error rates at nominal levels under the \(2 \times 3\) between-subjects design and the \(2 \times 3\) mixed design, as long as thresholds are equidistant. Under the two within-subject designs, it inflates error rates, especially when there are fewer ordinal levels with flexible thresholds.
Interaction effects. Figure 8 and Figure 9 present Type I error rates for the interaction effect in the presence of a single main effect or two parallel main effects. These results lead to similar conclusions. Even in cases where ART keeps error rates close to nominal levels (e.g., under the between-subjects design with equidistance thresholds), the performance of parametric ANOVA is constantly better.
Conclusion
ART’s inflation of Type I error rates with ordinal data is confirmed across a range of designs. For the between-subjects and mixed designs, the problem primarily concerns ordinal scales with flexible thresholds. However, for within-subjects designs, ART also inflates error rates for scales with equidistant thresholds, particularly when the number of levels is as low as five or seven. Again, all methods may fail to correctly infer interactions when parallel main effects exceed a certain threshold.
7 ART with median alignment
We evaluate a modified implementation of ART (ART-MED), where we use medians instead of means to align ranks. This approach draws inspiration from results by Salter and Fawcett (1993), showing that median alignment corrects ART’s instable behavior under the Cauchy distribution. We only test the \(4 \times 3\) within-participants design for sample sizes \(n=10\), \(20\), and \(30\). For this experiment, we omit the RNK method and only present results for non-normal distributions.
We emphasize that Salter and Fawcett (1993) only apply mean and median alignment to interactions. Our implementation for main effects is based on the alignment approach of Wobbrock et al. (2011), where we simply replace means by medians.
Main effects. Our results presented in Figure 20 demonstrate that median alignment (ART-MED), or at least our implementation of the method, is not appropriate for testing main effects. Although Type I error rates are now lower for the Cauchy distribution compared to the original method, they are still above nominal levels. In addition, they are significantly higher for all other distributions.
Interaction effects. In contrast, median alignment works surprisingly well for interactions, correcting deficiencies of ART, especially when main effects are absent or weak. Figure 21 and Figure 22 present our results. Despite this improved performance, we cannot recommend using the method because it still cannot compete with INT. Additionally, its advantages over parametric ANOVA are only apparent for the Cauchy distribution.
Conclusion
Using median instead of mean alignment with ART significantly improves the method’s performance in testing interactions across all the distributions we tested. However, we cannot recommend it, as the method is still less robust than INT. Furthermore, it is unclear how to apply median alignment for testing main effects — using medians with the alignment method of Wobbrock et al. (2011) results in extremely high error rates.
8 Nonparametric tests in single-factor designs
We compare PAR, RNK, and INT to nonparamatric tests for within- and between-subjects single-factor designs, where the factor has two, three, or four levels. Depending on the design, we use different nonparametric tests. For within-subjects designs, we use the Wilcoxon sign-rank test if the factor has two levels (2 within) and the Friedman test if the factor has three (3 within) or four (4 within) levels. For between-subjects designs (2 between, 3 between, and 4 between), we use the Kruskal–Wallis test.
Power. Figure 23 compares the power of the various methods as the magnitude of the main effect increases, where we use the abbreviation NON to designate a nonparametric test. We observe that primarily INT, but also RNK, generally exhibit better power than the nonparametric tests. Differences are more pronounced for within-subjects designs, corroborating Conover’s (2012) observation that the rank transformation results in a test that is superior to the Friedman test under certain conditions.
We expect that the accuracy of ANOVA on rank-transformed values will decrease with smaller samples. However, our tests with smaller samples of \(n=10\) show that INT remains robust and still outperforms other nonparametric methods. Although it is possibe to couple INT with permutation testing for higher accuracy (Beasley, Erickson, and Allison 2009), we have not explored this possibility here.
Type I error rate under equal and unequal variances. Figure 24 presents the rate of positives under conditions of equal (\(r_{sd} = 0\)) and unequal variances (\(r_{sd} > 0\)). While this rate can be considered a Type I error rate when variances are equal, interpreting it under other conditions requires special attention because the hypothesis of interest may differ among methods. Parametric ANOVA is particularly sensitive to unequal variances when distributions are skewed because it tests differences among means. While the normal distributions of the latent space have the same means, this is not the case with the skewed distributions of the transformed variable, which have the same median but different means. All nonparametric methods we tested use ranks, which preserve medians and mitigate this problem. However, their rate of positives can still exceed \(5\%\) under certain conditions.
For between-subjects designs, we observe that the Kruskal–Wallis test and RNK yield very similar results. This is not surprising, as RNK is known to be a good approximation of the Kruskal–Wallis test (Conover 2012). INT’s positive rates are similar, although slightly higher under the binomial distribution. For within-subjects designs, differences among methods are more pronounced. The Wilcoxon sign-rank test (2 within) inflates rates well above \(5\%\), demonstrating that the test is not a pure test of medians. In contrast, the Friedman test (3 within and 4 within) provides the best control among all methods.
Figure 25 presents the same results but for \(\alpha = .01\). Discrepancies among different methods are now more pronounced, and we notice again that the Friedman test keeps rates closer to the nominal level of \(1\%\) compared to INT and RNK. Nevertheless, in addition to their greater power compared to the Friedman test, RNK or INT present other advantages, such as the possibility to use common ANOVA-based procedures to partly correct issues associated with unequal variances. For instance, we ran an experiment, where we applied a Greenhouse–Geisser correction following sphericity violations detected with the Mauchly’s sphericity test (\(\alpha = .05\)). We found that this correction brings the error rates of INT close to nominal levels for continuous distributions, such as the normal and log-normal distributions. In the case of the binomial and ordinal distributions, error rates are also significantly reduced, well below those of the Friedman test, although not reaching nominal levels.
Conclusion
We do not see significant benefits in using dedicated nonparametric tests over RNK or INT. INT can replace nonparametric tests even for single-factor designs. If, after transforming the data, the assumptions of homoscedasticity or sphericity are still not met, applying common correction procedures (e.g., a Greenhouse–Geisser correction for sphericity violations) on the transformed data can reduce the risk of Type I errors.
9 ANOVA-type statistic (ATS)
We compare PAR, RNK, and INT to the ANOVA-type statistic (ATS) (Brunner and Puri 2001) for two-factor designs. We use its implementation in the R package nparLD (Noguchi et al. 2012), which does not support between-subjects designs. Thus, we only evaluate it for the \(4 \times 3\) within-subjects and the \(2 \times 4\) mixed designs.
Type I error rates: Main effects. Figure 26 presents Type I error rates for the main effect of \(X_2\). Under the mixed design, RNK, INT, and ATS exhibit very similar error rates, which are close to nominal levels. In the within-subjects design, the error rates of ATS tend to be slightly above \(5\%\). Additionally, unlike the other methods whose error rates drop significantly below \(5\%\) when the effect of \(X_1\) becomes stronger under binomial and ordinal scales, the power of ATS does not seem to be affected in these cases.
Figure 27 presents results for the main effect of \(X_1\). The error rates of ATS are now slightly inflated under the mixed design. The other methods exhibit the same trends as for the other factor.
Type I error rates: Interactions. Figure 28 presents Type I error rates for the interaction in the presence of a single main effect. Results are again very similar for all three nonparametric methods under the mixed design. In contrast, the error rates of ATS tend to be lower than nominal levels under the within-subjects design, often falling below \(4\%\).
When two parallel main effects are present, ATS and RNK lead to very similar trends (see Figure 29). Overall, INT appears to be a more robust method with the exception of the binomial distribution, for which error rates are higher for this method.
Power: Main effects. As shown in Figure 30 and Figure 31, ATS appears as the most powerful method for detecting effects of \(X_1\) for the mixed design. However, in all other situations, it has less power than INT and offer less power than RNK.
Power: Interactions. Figure 32 shows results on power for interactions. INT emerges again as the most powerful method. The power of ATS is particularly low under the within-subjects design.
Conclusion
Although ATS appears to be a valid alternative, it does not offer clear performance advantages over INT, which is also simpler and more versatile.
10 Generalizations of nonparametric tests
Finally, we evaluate the generalizations of nonparametric tests recommended by Lüpsen (2018; 2023) as implemented in his np.anova function (Lüpsen 2021). Specifically, we evaluate the generalization of the van der Waerden test (VDW), and the generalization of the Kruskal-Wallis and Friedman test (KWF). Their implementation uses R’s aov function and requires considering random slopes in the error term of the model, that is, using Error(s/(x1*x2))
for the two-factor within-participants design and Error(s/x2)
for the mixed design, where s
is the subject identifier variable. We also used the aov function and the same formulation of the error term for all other methods.
Type I error rates: Main effects. Figure 33 presents Type I error rates for the main effect of \(X_2\). While all methods perform well under the within-subjects and mixed designs, the error rates of VDW and KWF drop drastically as the effect of \(X_1\) grows under the between-subjects design. We will see below that the power of these methods evaporates in these cases.
Type I error rates: Interactions. Figure 34 presents the Type I error rates for the interaction, in the presence of a single main effect. Once again, the error rates of VDW and KWF decrease rapidly in both the between-subjects and mixed designs. Figure 35 shows the results when both main effects are present. Under the within-subjects design, KWF is the poorest-performing technique. While VDW outperforms RNK, it remains inferior to INT. For the between-subjects and mixed designs, KFW and VDW yield low or near-zero error rates when main effects are large, likely due to their extreme loss of power in these scenarios.
Power: Main effects. Figure 36 presents power results for detecting the main effect of \(X_1\). For the within-subjects and mixed design, KFW and VDW exhibit a lower power than RNK and INT, but differences are generally small. Although VDW appears as powerful as INT under the between-subjects design, this only occurs when the effect of the second factor is zero. As shown in Figure 37, the power of both KFD and VDW drops drastically as the effect of \(X_2\) increases for this design.
Figure 38 also presents results for the main effect of \(X_1\), where we observe once again that the generalized tests cannot compete with the more powerful INT, or even with RNK.
Power: Interactions. We also present results on the power of methods to detect interactions in Figure 39, confirming the advantage of INT across all designs. Figure 40 provides a clearer picture of how power is affected by the presence of a main effect. We observe that the power of all methods drops as the main effect of \(X_2\) increases, but this trend is more pronounced for KFD and VDW, particularly for the between-subjects and mixed designs.
Conclusion
Our results do not support Lüpsen’s (2018; 2023) conclusions. The behavior of the generalized nonparametric tests presents issues in numerous scenarios. While these tests exhibit lower error rates under specific conditions, this is due to a significant loss of power when other effects are at play. Therefore, we advise against the use of these methods.