Non-parametric analysis is used in situations when the data is not normally distributed (e.g., ranked data, skewed data, rates, etc.). For this reason, non-parametrical analysis is said to be ‘distribution free’.
The ks.test package in R performs compares the empirical cumulative distribution functions (eCDF) of a varible to either a reference eCDF (in the one way test) or to the eCDF from another variable in the two-way test to measure a significant difference in the distributions. The D statistic measures the maximum distance between the two distributions.
Since the D statistic is a difference score, it could be used to measure the magnitude of two effect sizes (recall that the Cohen’s d is a standardized difference score). In order to see if one effect is significantly greater than another, we could bootstrap confidence intervals around the respective D statistics and see if the intervals overlap.
In order to compare the effect size of two distributions, we will first generate a gamma distributed variable with a shape parameter of ‘1.2’ and a gamma distributed variable with a shape parameter of ‘1.4’ and compare the two. Then, we will generate a gamma distributed parameter with a shape parameter ‘2.3’ and compare this with the ‘1.2’ shape variable. A gamma distribution with a shape parameter greater than ‘1’ produces a unimodal, skewed distribution. We would expect the effect size (that is, the difference between the two eCDFs) to be significantly greater for the second comparison.
q2 <- rgamma(150, 1.2)
q3 <- rgamma(150, 1.4)
plot(ecdf(q2), col = "red", xlim = range(c(q2, q3)), main = "Compare Q2 with Q3")
plot(ecdf(q3),col = "blue", add = TRUE, lty = "dashed")
ks.test(q2, q3, alternative = "greater")
##
## Two-sample Kolmogorov-Smirnov test
##
## data: q2 and q3
## D^+ = 0.12, p-value = 0.1153
## alternative hypothesis: the CDF of x lies above that of y
##x is greater than x2 if the cumulative distribution function (CDF)##
##of x2 is to the left (and therefore above the CDF of x)##
q4 <-rgamma(150, 2.5)
plot(ecdf(q2), col = "red", xlim = range(c(q2, q4)), main = "Compare Q2 with Q4")
plot(ecdf(q4),col = "blue", add = TRUE, lty = "dashed")
ks.test(q2, q4, alternative = "greater")
##
## Two-sample Kolmogorov-Smirnov test
##
## data: q2 and q4
## D^+ = 0.45333, p-value = 4.094e-14
## alternative hypothesis: the CDF of x lies above that of y
To see if the effect sizes are significantly different for these two comparisons, we could bootstrap a 95% confidence interval around both D statistics to see if there is any overlap. If there is no overlap, then we could conclude that there is a significant difference in the effect sizes. For this procedure, we will use the boot package in R.
In order to obtain a 95% interval around the D statistic, you can perform a bootstrapping procedure. You’ll notice that the bootstrapped D statistic follows a normal distribution, therefore allowing us to construct the confidence interval based on a normal distribution.
library(boot)
df1 <- rgamma(150, 1.2) #refer to this as rgamma '1.2'
df2 <- rgamma(150, 1.4)
df3 <- rgamma(150, 2.5) #refer to this as rgamma '2.5'
df4 <-as.data.frame(cbind(df1, df2, df3))
View(df4)
fc <- function(d, i){
d2 <- d[i,]
eval.test1 <- ks.test(d2[,1], d2[,2], alternative = "greater")
D <- eval.test1$statistic[[1]]
return(D)
}
set.seed(374)
boot.nat.log <- boot(df4, fc, R = 5000)
boot.nat.log
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = df4, statistic = fc, R = 5000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.18 0.02797067 0.04981018
hist(boot.nat.log$t, xlab = "D Statistic", main = "Distribution of Bootstrapped D Statistics")
boot.ci(boot.out = boot.nat.log, type = c("norm", "basic", "perc", "bca"))
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 5000 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot.nat.log, type = c("norm", "basic", "perc",
## "bca"))
##
## Intervals :
## Level Normal Basic
## 95% ( 0.0544, 0.2497 ) ( 0.0533, 0.2467 )
##
## Level Percentile BCa
## 95% ( 0.1133, 0.3067 ) ( 0.0553, 0.2467 )
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable
bootConfInt<-boot.ci(boot.out = boot.nat.log, type = "norm")
###'D' statistic for the second comparison between rgamma '1.2' and rgamma '2.5'
fc2 <- function(d, i){
d2 <- d[i,]
eval.test1 <- ks.test(d2[,1], d2[,3], alternative = "greater")
D <- eval.test1$statistic[[1]]
return(D)
}
set.seed(374)
boot.nat.log2 <- boot(df4, fc2, R = 500)
boot.nat.log2
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = df4, statistic = fc2, R = 500)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.4866667 0.023 0.04157632
hist(boot.nat.log2$t, xlab = "D Statistic", main = "Distribution of Bootstrapped D Statistics")
boot.ci(boot.out = boot.nat.log2, type = c("norm", "basic", "perc", "bca"))
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 500 bootstrap replicates
##
## CALL :
## boot.ci(boot.out = boot.nat.log2, type = c("norm", "basic", "perc",
## "bca"))
##
## Intervals :
## Level Normal Basic
## 95% ( 0.3822, 0.5452 ) ( 0.3800, 0.5400 )
##
## Level Percentile BCa
## 95% ( 0.4333, 0.5933 ) ( 0.4000, 0.5400 )
## Calculations and Intervals on Original Scale
## Warning : BCa Intervals used Extreme Quantiles
## Some BCa intervals may be unstable
bootConfInt2<-boot.ci(boot.out = boot.nat.log2, type = "norm")
sprintf("The 'D' statistic for the rgamma '1.2' and rgamma '1.4' comparison is %f, with a 95 percent confidence interval [ %f, %f]. The 'D' statistic for the rgamma '1.2' and rgamma '2.5' comparision is %f, with a 95 percent confidence interval of [ %f, %f].", boot.nat.log$t0, bootConfInt$normal[2], bootConfInt$normal[3],boot.nat.log2$t0, bootConfInt2$normal[2], bootConfInt2$normal[3])
## [1] "The 'D' statistic for the rgamma '1.2' and rgamma '1.4' comparison is 0.180000, with a 95 percent confidence interval [ 0.054403, 0.249655]. The 'D' statistic for the rgamma '1.2' and rgamma '2.5' comparision is 0.486667, with a 95 percent confidence interval of [ 0.382179, 0.545155]."