Estimating Effect Size for Non-Parametric Data

Non-parametric analysis is used in situations when the data is not normally distributed (e.g., ranked data, skewed data, rates, etc.). For this reason, non-parametrical analysis is said to be ‘distribution free’.

Effect Size for Non-parametric Analysis

The ks.test package in R performs compares the empirical cumulative distribution functions (eCDF) of a varible to either a reference eCDF (in the one way test) or to the eCDF from another variable in the two-way test to measure a significant difference in the distributions. The D statistic measures the maximum distance between the two distributions.

Since the D statistic is a difference score, it could be used to measure the magnitude of two effect sizes (recall that the Cohen’s d is a standardized difference score). In order to see if one effect is significantly greater than another, we could bootstrap confidence intervals around the respective D statistics and see if the intervals overlap.

Comparing Effect Size of Two Distributions

In order to compare the effect size of two distributions, we will first generate a gamma distributed variable with a shape parameter of ‘1.2’ and a gamma distributed variable with a shape parameter of ‘1.4’ and compare the two. Then, we will generate a gamma distributed parameter with a shape parameter ‘2.3’ and compare this with the ‘1.2’ shape variable. A gamma distribution with a shape parameter greater than ‘1’ produces a unimodal, skewed distribution. We would expect the effect size (that is, the difference between the two eCDFs) to be significantly greater for the second comparison.

q2 <- rgamma(150, 1.2)
q3 <- rgamma(150, 1.4)
plot(ecdf(q2), col = "red", xlim = range(c(q2, q3)), main = "Compare Q2 with Q3")
plot(ecdf(q3),col = "blue", add = TRUE, lty = "dashed")

ks.test(q2, q3, alternative = "greater")

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  q2 and q3
## D^+ = 0.12, p-value = 0.1153
## alternative hypothesis: the CDF of x lies above that of y

##x is greater than x2 if the cumulative distribution function (CDF)##
##of x2 is to the left (and therefore above the CDF of x)##
q4 <-rgamma(150, 2.5)

plot(ecdf(q2), col = "red", xlim = range(c(q2, q4)), main = "Compare Q2 with Q4")
plot(ecdf(q4),col = "blue", add = TRUE, lty = "dashed")

ks.test(q2, q4, alternative = "greater")

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  q2 and q4
## D^+ = 0.45333, p-value = 4.094e-14
## alternative hypothesis: the CDF of x lies above that of y

Bootstrap 95% Confidence Interval around the D Statistic

To see if the effect sizes are significantly different for these two comparisons, we could bootstrap a 95% confidence interval around both D statistics to see if there is any overlap. If there is no overlap, then we could conclude that there is a significant difference in the effect sizes. For this procedure, we will use the boot package in R.

In order to obtain a 95% interval around the D statistic, you can perform a bootstrapping procedure. You’ll notice that the bootstrapped D statistic follows a normal distribution, therefore allowing us to construct the confidence interval based on a normal distribution.

library(boot)
df1 <- rgamma(150, 1.2) #refer to this as rgamma '1.2'
df2 <- rgamma(150, 1.4)
df3 <- rgamma(150, 2.5) #refer to this as rgamma '2.5'
df4 <-as.data.frame(cbind(df1, df2, df3))
View(df4)
fc <- function(d, i){
    d2 <- d[i,]
    eval.test1 <- ks.test(d2[,1], d2[,2], alternative = "greater")
    D <- eval.test1$statistic[[1]]
    return(D)
          }
set.seed(374)
boot.nat.log <- boot(df4, fc, R = 5000)
boot.nat.log

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = df4, statistic = fc, R = 5000)
## 
## 
## Bootstrap Statistics :
##     original     bias    std. error
## t1*     0.18 0.02797067  0.04981018

hist(boot.nat.log$t, xlab = "D Statistic", main = "Distribution of Bootstrapped D Statistics")

boot.ci(boot.out = boot.nat.log, type = c("norm", "basic", "perc", "bca"))

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 5000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot.nat.log, type = c("norm", "basic", "perc", 
##     "bca"))
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 0.0544,  0.2497 )   ( 0.0533,  0.2467 )  
## 
## Level     Percentile            BCa          
## 95%   ( 0.1133,  0.3067 )   ( 0.0553,  0.2467 )  
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable

bootConfInt<-boot.ci(boot.out = boot.nat.log, type = "norm")
###'D' statistic for the second comparison between rgamma '1.2' and rgamma '2.5'
fc2 <- function(d, i){
    d2 <- d[i,]
    eval.test1 <- ks.test(d2[,1], d2[,3], alternative = "greater")
    D <- eval.test1$statistic[[1]]
    return(D)
          }
set.seed(374)
boot.nat.log2 <- boot(df4, fc2, R = 500)
boot.nat.log2

## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = df4, statistic = fc2, R = 500)
## 
## 
## Bootstrap Statistics :
##      original  bias    std. error
## t1* 0.4866667   0.023  0.04157632

hist(boot.nat.log2$t, xlab = "D Statistic", main = "Distribution of Bootstrapped D Statistics")

boot.ci(boot.out = boot.nat.log2, type = c("norm", "basic", "perc", "bca"))

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 500 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot.nat.log2, type = c("norm", "basic", "perc", 
##     "bca"))
## 
## Intervals : 
## Level      Normal              Basic         
## 95%   ( 0.3822,  0.5452 )   ( 0.3800,  0.5400 )  
## 
## Level     Percentile            BCa          
## 95%   ( 0.4333,  0.5933 )   ( 0.4000,  0.5400 )  
## Calculations and Intervals on Original Scale
## Warning : BCa Intervals used Extreme Quantiles
## Some BCa intervals may be unstable

bootConfInt2<-boot.ci(boot.out = boot.nat.log2, type = "norm")
sprintf("The 'D' statistic for the rgamma '1.2' and rgamma '1.4' comparison is %f, with a 95 percent confidence interval [ %f, %f]. The 'D' statistic for the rgamma '1.2' and rgamma '2.5' comparision is %f, with a 95 percent confidence interval of [ %f, %f].", boot.nat.log$t0, bootConfInt$normal[2], bootConfInt$normal[3],boot.nat.log2$t0, bootConfInt2$normal[2], bootConfInt2$normal[3])

## [1] "The 'D' statistic for the rgamma '1.2' and rgamma '1.4' comparison is 0.180000, with a 95 percent confidence interval [ 0.054403, 0.249655]. The 'D' statistic for the rgamma '1.2' and rgamma '2.5' comparision is 0.486667, with a 95 percent confidence interval of [ 0.382179, 0.545155]."