Class KolmogorovSmirnovTest
The one-sample test uses a D statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis.
The two-sample test uses a D statistic based on the maximum deviation of the two empirical
distributions of sample data points. The two-sample tests evaluate the null hypothesis that
the two samples x
and y
come from the same underlying distribution.
References:
- Marsaglia, G., Tsang, W. W., & Wang, J. (2003). Evaluating Kolmogorov's Distribution. Journal of Statistical Software, 8(18), 1–4.
- Simard, R., & L’Ecuyer, P. (2011). Computing the Two-Sided Kolmogorov-Smirnov Distribution. Journal of Statistical Software, 39(11), 1–18.
- Sekhon, J. S. (2011). Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R. Journal of Statistical Software, 42(7), 1–52.
- Viehmann, T (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv:2102.08037
- Hodges, J. L. (1958). The significance probability of the smirnov two-sample test. Arkiv for Matematik, 3(5), 469-486.
Note that [1] contains an error in computing h, refer to MATH-437 for details.
- Since:
- 1.1
- See Also:
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic class
Result for the one-sample Kolmogorov-Smirnov test.static final class
Result for the two-sample Kolmogorov-Smirnov test. -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final AlternativeHypothesis
Alternative hypothesis.private static final KolmogorovSmirnovTest
Default instance.private static final long[]
Placeholder to use for the two-sample ties D array when the value can be ignored.private static final int[]
Placeholder to use for the two-sample sign array when the value can be ignored.private final int
Number of iterations .private static final int
When the largest sample size exceeds this value, 2-sample test AUTO p-value uses an asymptotic distribution to compute the p-value.private static final int
Maximum length of an array.private static final int
Maximum finite factorial.private static final long
The maximum least common multiple (lcm) to attempt the exact p-value computation.private final PValueMethod
Method to compute the p-value.private final org.apache.commons.rng.UniformRandomProvider
Source of randomness.private static final String
Name for sample 1.private static final String
Name for sample 2.private final boolean
Use a strict inequality for the two-sample exact p-value. -
Constructor Summary
ConstructorsModifierConstructorDescriptionprivate
KolmogorovSmirnovTest
(AlternativeHypothesis alternative, PValueMethod method, boolean strict, org.apache.commons.rng.UniformRandomProvider rng, int iterations) -
Method Summary
Modifier and TypeMethodDescriptionprivate static double
binom
(int n, int k) Compute the binomial coefficient binom(n, k).private static int
checkArrayLength
(double[] array) Verifies thatarray
has length at least 2.private static double
computeD
(long dnm, int n, int m, int gcd) Compute the D statistic from the integral D value.private long
computeIntegralKolmogorovSmirnovStatistic
(double[] x, double[] y, int[] sign, long[] tiesD) Computes the two-sample Kolmogorov-Smirnov test statistic.private double
computeStatistic
(double[] x, DoubleUnaryOperator cdf, int[] sign) Computes the magnitude of the one-sample Kolmogorov-Smirnov test statistic.private static DoubleSupplier
createSampler
(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng) Creates a sampler to sample randomly from the combined distribution of the two samples.(package private) static DoubleSupplier
createSampler
(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng, int maxArraySize) Creates a sampler to sample randomly from the combined distribution of the two samples.private double
estimateP
(double[] x, double[] y, long dnm) Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution.private static double[]
Sort the input array.double
statistic
(double[] x, double[] y) Computes the two-sample Kolmogorov-Smirnov test statistic.double
statistic
(double[] x, DoubleUnaryOperator cdf) Computes the one-sample Kolmogorov-Smirnov test statistic.test
(double[] x, double[] y) Performs a two-sample Kolmogorov-Smirnov test on samplesx
andy
.test
(double[] x, DoubleUnaryOperator cdf) Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
conforms to the distribution cumulative density function (cdf
).private static boolean
testIntegralKolmogorovSmirnovStatistic
(double[] x, double[] y, long plus, long minus) Test if the two-sample integral Kolmogorov-Smirnov statistic is strictly greater than the specified values for D+ and D-.(package private) static double
twoSampleApproximateP
(double d, int n, int m, boolean twoSided) Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.(package private) static double
twoSampleExactP
(long dnm, int n, int m, int gcd, boolean strict, boolean twoSided) Computes \(P(D_{n,m} > d)\) ifstrict
istrue
; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\}.private static double
twoSampleOneSidedPOutside
(long d, int n, int m, int gcd) Computes \(P(D_{n,m}^+ \ge d)\), where \(D_{n,m}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.private static double
twoSampleOneSidedPOutsideSquare
(long d, int n) Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.private double
twoSampleP
(long dnm, int n, int m, int gcd, double d, boolean exact) Computes \(P(D_{n,m} > d)\) for the 2-sample Kolmogorov-Smirnov statistic.private static double
twoSampleTwoSidedPOutsideSquare
(long d, int n) Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample two-sided Kolmogorov-Smirnov statistic.private static double
twoSampleTwoSidedPStabilizedInner
(long d, int n, int m, int gcd) Computes \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.with
(org.apache.commons.rng.UniformRandomProvider v) Return an instance with the configured source of randomness.Return an instance with the configured alternative hypothesis.with
(Inequality v) Return an instance with the configured inequality.with
(PValueMethod v) Return an instance with the configured p-value method.static KolmogorovSmirnovTest
Return an instance using the default options.withIterations
(int v) Return an instance with the configured number of iterations.
-
Field Details
-
SAMPLE_1_NAME
Name for sample 1.- See Also:
-
SAMPLE_2_NAME
Name for sample 2.- See Also:
-
LARGE_SAMPLE
private static final int LARGE_SAMPLEWhen the largest sample size exceeds this value, 2-sample test AUTO p-value uses an asymptotic distribution to compute the p-value.- See Also:
-
MAX_FACTORIAL
private static final int MAX_FACTORIALMaximum finite factorial.- See Also:
-
MAX_ARRAY_SIZE
private static final int MAX_ARRAY_SIZEMaximum length of an array. This is used to determine if two arrays can be concatenated to create a sampler from the joint distribution. The limit is copied from the limit of java.util.ArrayList.- See Also:
-
MAX_LCM_TWO_SAMPLE_EXACT_P
private static final long MAX_LCM_TWO_SAMPLE_EXACT_PThe maximum least common multiple (lcm) to attempt the exact p-value computation. The integral d value is in [0, n*m] in steps of the greatest common denominator (gcd), thus lcm = n*m/gcd is the number of possible different p-values. Some methods have a lower limit due to computation limits. This should be larger than LARGE_SAMPLE^2 so all AUTO p-values attempt an exact computation, i.e. at least 10000^2 ~ 2^26.56.- See Also:
-
IGNORED_SIGN
private static final int[] IGNORED_SIGNPlaceholder to use for the two-sample sign array when the value can be ignored. -
IGNORED_D
private static final long[] IGNORED_DPlaceholder to use for the two-sample ties D array when the value can be ignored. -
DEFAULT
Default instance. -
alternative
Alternative hypothesis. -
pValueMethod
Method to compute the p-value. -
strictInequality
private final boolean strictInequalityUse a strict inequality for the two-sample exact p-value. -
rng
private final org.apache.commons.rng.UniformRandomProvider rngSource of randomness. -
iterations
private final int iterationsNumber of iterations .
-
-
Constructor Details
-
KolmogorovSmirnovTest
private KolmogorovSmirnovTest(AlternativeHypothesis alternative, PValueMethod method, boolean strict, org.apache.commons.rng.UniformRandomProvider rng, int iterations) - Parameters:
alternative
- Alternative hypothesis.method
- P-value method.strict
- true to use a strict inequality.rng
- Source of randomness.iterations
- Number of iterations.
-
-
Method Details
-
withDefaults
Return an instance using the default options.- Returns:
- default instance
-
with
Return an instance with the configured alternative hypothesis.- Parameters:
v
- Value.- Returns:
- an instance
-
with
Return an instance with the configured p-value method.For the one-sample two-sided test Kolmogorov's asymptotic approximation can be used; otherwise the p-value uses the distribution of the D statistic.
For the two-sample test the exact p-value can be computed for small sample sizes; otherwise the p-value resorts to the asymptotic approximation. Alternatively a p-value can be estimated from the combined distribution of the samples. This requires a source of randomness.
- Parameters:
v
- Value.- Returns:
- an instance
- See Also:
-
with
Return an instance with the configured inequality.Computes the p-value for the two-sample test as \(P(D_{n,m} > d)\) if strict; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\) or \(D_{n,m}^-\).
- Parameters:
v
- Value.- Returns:
- an instance
-
with
Return an instance with the configured source of randomness.Applies to the two-sample test when the p-value method is set to
ESTIMATE
. The randomness is used for sampling of the combined distribution.There is no default source of randomness. If the randomness is not set then estimation is not possible and an
IllegalStateException
will be raised in the two-sample test.- Parameters:
v
- Value.- Returns:
- an instance
- See Also:
-
withIterations
Return an instance with the configured number of iterations.Applies to the two-sample test when the p-value method is set to
ESTIMATE
. This is the number of sampling iterations used to estimate the p-value. The p-value is a fraction using theiterations
as the denominator. The number of significant digits in the p-value is upper bounded by log10(iterations); small p-values have fewer significant digits. A large number of iterations is recommended when using a small critical value to reject the null hypothesis.- Parameters:
v
- Value.- Returns:
- an instance
- Throws:
IllegalArgumentException
- if the number of iterations is not strictly positive
-
statistic
Computes the one-sample Kolmogorov-Smirnov test statistic.- two-sided: \(D_n=\sup_x |F_n(x)-F(x)|\)
- greater: \(D_n^+=\sup_x (F_n(x)-F(x))\)
- less: \(D_n^-=\sup_x (F(x)-F_n(x))\)
where \(F\) is the distribution cumulative density function (
cdf
), \(n\) is the length ofx
and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
.The cumulative distribution function should map a real value
x
to a probability in [0, 1]. To use a reference distribution the CDF can be passed using a method reference:UniformContinuousDistribution dist = UniformContinuousDistribution.of(0, 1); UniformRandomProvider rng = RandomSource.KISS.create(123); double[] x = dist.sampler(rng).samples(100); double d = KolmogorovSmirnovTest.withDefaults().statistic(x, dist::cumulativeProbability);
- Parameters:
x
- Sample being evaluated.cdf
- Reference cumulative distribution function.- Returns:
- Kolmogorov-Smirnov statistic
- Throws:
IllegalArgumentException
- ifdata
does not have length at least 2; or contains NaN values.- See Also:
-
statistic
public double statistic(double[] x, double[] y) Computes the two-sample Kolmogorov-Smirnov test statistic.- two-sided: \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\)
- greater: \(D_{n,m}^+=\sup_x (F_n(x)-F_m(x))\)
- less: \(D_{n,m}^-=\sup_x (F_m(x)-F_n(x))\)
where \(n\) is the length of
x
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution that puts mass \(1/m\) at each of the values iny
.- Parameters:
x
- First sample.y
- Second sample.- Returns:
- Kolmogorov-Smirnov statistic
- Throws:
IllegalArgumentException
- if eitherx
ory
does not have length at least 2; or contain NaN values.- See Also:
-
test
Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
conforms to the distribution cumulative density function (cdf
).The test is defined by the
AlternativeHypothesis
:- Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F(i)\) for all \( i \); the alternative is that the are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
- Greater evaluates the null hypothesis that the \(F_n(i) <= F(i)\) for all \( i \); the alternative is \(F_n(i) > F(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
- Less evaluates the null hypothesis that the \(F_n(i) >= F(i)\) for all \( i \); the alternative is \(F_n(i) < F(i)\) for at least one \( i \). The statistic is \( D_n^- \).
The p-value method defaults to exact. The one-sided p-value uses Smirnov's stable formula:
\[ P(D_n^+ \ge x) = x \sum_{j=0}^{\lfloor n(1-x) \rfloor} \binom{n}{j} \left(\frac{j}{n} + x\right)^{j-1} \left(1-x-\frac{j}{n} \right)^{n-j} \]
The two-sided p-value is computed using methods described in Simard & L’Ecuyer (2011). The two-sided test supports an asymptotic p-value using Kolmogorov's formula:
\[ \lim_{n\to\infty} P(\sqrt{n}D_n > z) = 2 \sum_{i=1}^\infty (-1)^{i-1} e^{-2 i^2 z^2} \]
- Parameters:
x
- Sample being being evaluated.cdf
- Reference cumulative distribution function.- Returns:
- test result
- Throws:
IllegalArgumentException
- ifdata
does not have length at least 2; or contains NaN values.- See Also:
-
test
Performs a two-sample Kolmogorov-Smirnov test on samplesx
andy
. Test the empirical distributions \(F_n\) and \(F_m\) where \(n\) is the length ofx
, \(m\) is the length ofy
, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values inx
and \(F_m\) is the empirical distribution that puts mass \(1/m\) of they
values.The test is defined by the
AlternativeHypothesis
:- Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F_m(i)\) for all \( i \); the alternative is that they are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
- Greater evaluates the null hypothesis that the \(F_n(i) <= F_m(i)\) for all \( i \); the alternative is \(F_n(i) > F_m(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
- Less evaluates the null hypothesis that the \(F_n(i) >= F_m(i)\) for all \( i \); the alternative is \(F_n(i) < F_m(i)\) for at least one \( i \). The statistic is \( D_n^- \).
If the p-value method is auto, then an exact p computation is attempted if both sample sizes are less than 10000 using the methods presented in Viehmann (2021) and Hodges (1958); otherwise an asymptotic p-value is returned. The two-sided p-value is \(\overline{F}(d, \sqrt{mn / (m + n)})\) where \(\overline{F}\) is the complementary cumulative density function of the two-sided one-sample Kolmogorov-Smirnov distribution. The one-sided p-value uses an approximation from Hodges (1958) Eq 5.3.
\(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \gt d \) differ from \(H_0 : D_{n,m} \ge d \) by the mass of the observed value \(d\). These can be distinguished using an
Inequality
parameter. This is ignored for large samples.If the data contains ties there is no defined ordering in the tied region to use for the difference between the two empirical distributions. Each ordering of the tied region may create a different D statistic. All possible orderings generate a distribution for the D value. In this case the tied region is traversed entirely and the effect on the D value evaluated at the end of the tied region. This is the path with the least change on the D statistic. The path with the greatest change on the D statistic is also computed as the upper bound on D. If these two values are different then the tied region is known to generate a distribution for the D statistic and the p-value is an over estimate for the cases with a larger D statistic. The presence of any significant tied regions is returned in the result.
If the p-value method is
ESTIMATE
then the p-value is estimated by repeat sampling of the joint distribution ofx
andy
. The p-value is the frequency that a sample creates a D statistic greater than or equal to (or greater than for strict inequality) the observed value. In this case a source of randomness must be configured or anIllegalStateException
will be raised. The p-value for the upper bound on D will not be estimated and is set toNaN
. This estimation procedure is not affected by ties in the data and is increasingly robust for larger datasets. The method is modeled after ks.boot in the RMatching
package (Sekhon (2011)).- Parameters:
x
- First sample.y
- Second sample.- Returns:
- test result
- Throws:
IllegalArgumentException
- if eitherx
ory
does not have length at least 2; or contain NaN values.IllegalStateException
- if the p-value method isESTIMATE
and there is no source of randomness.- See Also:
-
estimateP
private double estimateP(double[] x, double[] y, long dnm) Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis thatx
andy
are samples drawn from the same probability distribution.This method will destructively modify the input arrays (via a sort).
This method estimates the p-value by repeatedly sampling sets of size
x.length
andy.length
from the empirical distribution of the combined sample. The memory requirement is an array copy of each of the input arguments.When using strict inequality, this is equivalent to the algorithm implemented in the R function
ks.boot
as described in Sekhon (2011) Journal of Statistical Software, 42(7), 1–52 [3].- Parameters:
x
- First sample.y
- Second sample.dnm
- Integral D statistic.- Returns:
- p-value
- Throws:
IllegalStateException
- if the source of randomness is null.
-
computeStatistic
Computes the magnitude of the one-sample Kolmogorov-Smirnov test statistic. The sign of the statistic is optionally returned insign
. For the two-sided case the sign is 0 if the magnitude of D+ and D- was equal; otherwise it indicates which D had the larger magnitude.- Parameters:
x
- Sample being evaluated.cdf
- Reference cumulative distribution function.sign
- Sign of the statistic (non-zero length).- Returns:
- Kolmogorov-Smirnov statistic
- Throws:
IllegalArgumentException
- ifdata
does not have length at least 2; or contains NaN values.
-
computeIntegralKolmogorovSmirnovStatistic
private long computeIntegralKolmogorovSmirnovStatistic(double[] x, double[] y, int[] sign, long[] tiesD) Computes the two-sample Kolmogorov-Smirnov test statistic. The statistic is integral and has a value in[0, n*m]
. Not all values are possible; the smallest increment is the greatest common divisor ofn
andm
.This method will destructively modify the input arrays (via a sort).
The sign of the statistic is returned in
sign
. For the two-sided case the sign is 0 if the magnitude of D+ and D- was equal; otherwise it indicates which D had the larger magnitude. If the two-sided statistic is zero the two arrays are identical, or are 'identical' data of a single value (sample sizes may be different), or have a sequence of ties of 'identical' data with a net zero effect on the D statistic e.g.[1,2,3] vs [1,2,3] [0,0,0,0] vs [0,0,0] [0,0,0,0,1,1,1,1] vs [0,0,0,1,1,1]
This method detects ties in the input data. Not all ties will invalidate the returned statistic. Ties between the data can be interpreted as if the values were different but within machine epsilon. In this case the path through the tie region is not known. All paths through the tie region end at the same point. This method will compute the most extreme path through each tie region and the D statistic for these paths. If the ties D statistic is a larger magnitude than the returned D statistic then at least one tie region lies at a point on the full path that could result in a different statistic in the absence of ties. This signals the P-value computed using the returned D statistic is one of many possible p-values given the data and how ties are resolved. Note: The tiesD value may be smaller than the returned D statistic as it is not set to the maximum of D or tiesD. The only result of interest is when
tiesD > D
due to a tie region that can change the output D. On outputtiesD[0] != 0
if there were ties between samples andtiesD[1] = D
of the most extreme path through the ties.- Parameters:
x
- First sample (destructively modified by sort).y
- Second sample (destructively modified by sort).sign
- Sign of the statistic (non-zero length).tiesD
- Integral statistic for the most extreme path through any ties (length at least 2).- Returns:
- integral Kolmogorov-Smirnov statistic
- Throws:
IllegalArgumentException
- if eitherx
ory
contain NaN values.
-
testIntegralKolmogorovSmirnovStatistic
private static boolean testIntegralKolmogorovSmirnovStatistic(double[] x, double[] y, long plus, long minus) Test if the two-sample integral Kolmogorov-Smirnov statistic is strictly greater than the specified values for D+ and D-. Note that D- should have a negative sign to impose an inclusive lower bound.This method will destructively modify the input arrays (via a sort).
For a two-sided alternative hypothesis
plus
andminus
should have the same magnitude with opposite signs.For a one-sided alternative hypothesis the value of
plus
orminus
should have the expected value of the statistic, and the opposite D should have the maximum or minimum long value. Theminus
should be negatively signed:- greater:
plus
= D,minus
=Long.MIN_VALUE
- greater:
minus
= -D,plus
=Long.MAX_VALUE
Note: This method has not been specialized for the one-sided case. Specialization would eliminate a conditional branch for
d > Long.MAX_VALUE
ord < Long.MIN_VALUE
. Since these branches are never possible in the one-sided case this should be efficiently chosen by branch prediction in a processor pipeline.- Parameters:
x
- First sample (destructively modified by sort; must not contain NaN).y
- Second sample (destructively modified by sort; must not contain NaN).plus
- Limit on D+ (inclusive upper bound).minus
- Limit on D- (inclusive lower bound).- Returns:
- true if the D value exceeds the provided limits
- greater:
-
createSampler
private static DoubleSupplier createSampler(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng) Creates a sampler to sample randomly from the combined distribution of the two samples.- Parameters:
x
- First sample.y
- Second sample.rng
- Source of randomness.- Returns:
- the sampler
-
createSampler
static DoubleSupplier createSampler(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng, int maxArraySize) Creates a sampler to sample randomly from the combined distribution of the two samples. This will copy the input data for the sampler.- Parameters:
x
- First sample.y
- Second sample.rng
- Source of randomness.maxArraySize
- Maximum size of a single array.- Returns:
- the sampler
-
computeD
private static double computeD(long dnm, int n, int m, int gcd) Compute the D statistic from the integral D value.- Parameters:
dnm
- Integral D-statistic value (in [0, n*m]).n
- First sample size.m
- Second sample size.gcd
- Greatest common divisor of n and m.- Returns:
- D-statistic value (in [0, 1]).
-
twoSampleP
private double twoSampleP(long dnm, int n, int m, int gcd, double d, boolean exact) Computes \(P(D_{n,m} > d)\) for the 2-sample Kolmogorov-Smirnov statistic.- Parameters:
dnm
- Integral D-statistic value (in [0, n*m]).n
- First sample size.m
- Second sample size.gcd
- Greatest common divisor of n and m.d
- D-statistic value (in [0, 1]).exact
- whether to compute the exact probability; otherwise approximate.- Returns:
- probability
- See Also:
-
twoSampleExactP
static double twoSampleExactP(long dnm, int n, int m, int gcd, boolean strict, boolean twoSided) Computes \(P(D_{n,m} > d)\) ifstrict
istrue
; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\}. Seestatistic(double[], double[])
for the definition of \(D_{n,m}\).The returned probability is exact. If the value cannot be computed this returns -1.
Note: This requires the greatest common divisor of n and m. The integral D statistic in the range [0, n*m] is separated by increments of the gcd. The method will only compute p-values for valid values of D by calculating for D/gcd. Strict inquality is performed using the next valid value for D.
- Parameters:
dnm
- Integral D-statistic value (in [0, n*m]).n
- First sample size.m
- Second sample size.gcd
- Greatest common divisor of n and m.strict
- whether or not the probability to compute is expressed as a strict inequality.twoSided
- whether D refers to D or D+.- Returns:
- probability that a randomly selected m-n partition of m + n generates D
greater than (resp. greater than or equal to)
d
(or -1)
-
twoSampleTwoSidedPStabilizedInner
private static double twoSampleTwoSidedPStabilizedInner(long d, int n, int m, int gcd) Computes \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.The returned probability is exact, implemented using the stabilized inner method presented in Viehmann (2021).
This is optimized for
m <= n
. Ifm > n
and index-out-of-bounds exception can occur.- Parameters:
d
- Integral D-statistic value (in [2, lcm])n
- Larger sample size.m
- Smaller sample size.gcd
- Greatest common divisor of n and m.- Returns:
- probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\)
greater than or equal to
d
-
twoSampleOneSidedPOutside
private static double twoSampleOneSidedPOutside(long d, int n, int m, int gcd) Computes \(P(D_{n,m}^+ \ge d)\), where \(D_{n,m}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.The returned probability is exact, implemented using the outer method presented in Hodges (1958).
This method will fail-fast and return -1 if the computation of the numbers of paths overflows.
- Parameters:
d
- Integral D-statistic value (in [0, lcm])n
- First sample size.m
- Second sample size.gcd
- Greatest common divisor of n and m.- Returns:
- probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\)
greater than or equal to
d
-
twoSampleOneSidedPOutsideSquare
private static double twoSampleOneSidedPOutsideSquare(long d, int n) Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.The returned probability is exact, implemented using the outer method presented in Hodges (1958).
- Parameters:
d
- Integral D-statistic value (in [1, lcm])n
- Sample size.- Returns:
- probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\)
greater than or equal to
d
-
twoSampleTwoSidedPOutsideSquare
private static double twoSampleTwoSidedPOutsideSquare(long d, int n) Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample two-sided Kolmogorov-Smirnov statistic.The returned probability is exact, implemented using the outer method presented in Hodges (1958).
- Parameters:
d
- Integral D-statistic value (in [1, n])n
- Sample size.- Returns:
- probability that a randomly selected m-n partition of n + n generates \(D_{n,n}\)
greater than or equal to
d
-
binom
private static double binom(int n, int k) Compute the binomial coefficient binom(n, k).- Parameters:
n
- N.k
- K.- Returns:
- binom(n, k)
-
twoSampleApproximateP
static double twoSampleApproximateP(double d, int n, int m, boolean twoSided) Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. Seestatistic(double[], double[])
for the definition of \(D_{n,m}\).Specifically, what is returned is \(1 - CDF(d, \sqrt{mn / (m + n)})\) where CDF is the cumulative density function of the two-sided one-sample Kolmogorov-Smirnov distribution.
- Parameters:
d
- D-statistic value.n
- First sample size.m
- Second sample size.twoSided
- True to compute the two-sided p-value; else one-sided.- Returns:
- approximate probability that a randomly selected m-n partition of m + n generates
\(D_{n,m}\) greater than
d
-
checkArrayLength
private static int checkArrayLength(double[] array) Verifies thatarray
has length at least 2.- Parameters:
array
- Array to test.- Returns:
- the length
- Throws:
IllegalArgumentException
- if array is too short
-
sort
Sort the input array. Throws an exception if NaN values are present. It is assumed the array is non-zero length.- Parameters:
x
- Input array.name
- Name of the array.- Returns:
- a reference to the input (sorted) array
- Throws:
IllegalArgumentException
- ifx
contains NaN values.
-