Class KolmogorovSmirnovTest

java.lang.Object
org.apache.commons.statistics.inference.KolmogorovSmirnovTest

public final class KolmogorovSmirnovTest extends Object
Implements the Kolmogorov-Smirnov (K-S) test for equality of continuous distributions.

The one-sample test uses a D statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis.

The two-sample test uses a D statistic based on the maximum deviation of the two empirical distributions of sample data points. The two-sample tests evaluate the null hypothesis that the two samples x and y come from the same underlying distribution.

References:

  1. Marsaglia, G., Tsang, W. W., & Wang, J. (2003). Evaluating Kolmogorov's Distribution. Journal of Statistical Software, 8(18), 1–4.
  2. Simard, R., & L’Ecuyer, P. (2011). Computing the Two-Sided Kolmogorov-Smirnov Distribution. Journal of Statistical Software, 39(11), 1–18.
  3. Sekhon, J. S. (2011). Multivariate and Propensity Score Matching Software with Automated Balance Optimization: The Matching package for R. Journal of Statistical Software, 42(7), 1–52.
  4. Viehmann, T (2021). Numerically more stable computation of the p-values for the two-sample Kolmogorov-Smirnov test. arXiv:2102.08037
  5. Hodges, J. L. (1958). The significance probability of the smirnov two-sample test. Arkiv for Matematik, 3(5), 469-486.

Note that [1] contains an error in computing h, refer to MATH-437 for details.

Since:
1.1
See Also:
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static class 
    Result for the one-sample Kolmogorov-Smirnov test.
    static final class 
    Result for the two-sample Kolmogorov-Smirnov test.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    private final AlternativeHypothesis
    Alternative hypothesis.
    private static final KolmogorovSmirnovTest
    Default instance.
    private static final long[]
    Placeholder to use for the two-sample ties D array when the value can be ignored.
    private static final int[]
    Placeholder to use for the two-sample sign array when the value can be ignored.
    private final int
    Number of iterations .
    private static final int
    When the largest sample size exceeds this value, 2-sample test AUTO p-value uses an asymptotic distribution to compute the p-value.
    private static final int
    Maximum length of an array.
    private static final int
    Maximum finite factorial.
    private static final long
    The maximum least common multiple (lcm) to attempt the exact p-value computation.
    private final PValueMethod
    Method to compute the p-value.
    private final org.apache.commons.rng.UniformRandomProvider
    Source of randomness.
    private static final String
    Name for sample 1.
    private static final String
    Name for sample 2.
    private final boolean
    Use a strict inequality for the two-sample exact p-value.
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    private
    KolmogorovSmirnovTest(AlternativeHypothesis alternative, PValueMethod method, boolean strict, org.apache.commons.rng.UniformRandomProvider rng, int iterations)
     
  • Method Summary

    Modifier and Type
    Method
    Description
    private static double
    binom(int n, int k)
    Compute the binomial coefficient binom(n, k).
    private static int
    checkArrayLength(double[] array)
    Verifies that array has length at least 2.
    private static double
    computeD(long dnm, int n, int m, int gcd)
    Compute the D statistic from the integral D value.
    private long
    computeIntegralKolmogorovSmirnovStatistic(double[] x, double[] y, int[] sign, long[] tiesD)
    Computes the two-sample Kolmogorov-Smirnov test statistic.
    private double
    computeStatistic(double[] x, DoubleUnaryOperator cdf, int[] sign)
    Computes the magnitude of the one-sample Kolmogorov-Smirnov test statistic.
    private static DoubleSupplier
    createSampler(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng)
    Creates a sampler to sample randomly from the combined distribution of the two samples.
    (package private) static DoubleSupplier
    createSampler(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng, int maxArraySize)
    Creates a sampler to sample randomly from the combined distribution of the two samples.
    private double
    estimateP(double[] x, double[] y, long dnm)
    Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution.
    private static double[]
    sort(double[] x, String name)
    Sort the input array.
    double
    statistic(double[] x, double[] y)
    Computes the two-sample Kolmogorov-Smirnov test statistic.
    double
    statistic(double[] x, DoubleUnaryOperator cdf)
    Computes the one-sample Kolmogorov-Smirnov test statistic.
    test(double[] x, double[] y)
    Performs a two-sample Kolmogorov-Smirnov test on samples x and y.
    test(double[] x, DoubleUnaryOperator cdf)
    Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x conforms to the distribution cumulative density function (cdf).
    private static boolean
    testIntegralKolmogorovSmirnovStatistic(double[] x, double[] y, long plus, long minus)
    Test if the two-sample integral Kolmogorov-Smirnov statistic is strictly greater than the specified values for D+ and D-.
    (package private) static double
    twoSampleApproximateP(double d, int n, int m, boolean twoSided)
    Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.
    (package private) static double
    twoSampleExactP(long dnm, int n, int m, int gcd, boolean strict, boolean twoSided)
    Computes \(P(D_{n,m} > d)\) if strict is true; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\}.
    private static double
    twoSampleOneSidedPOutside(long d, int n, int m, int gcd)
    Computes \(P(D_{n,m}^+ \ge d)\), where \(D_{n,m}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.
    private static double
    Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.
    private double
    twoSampleP(long dnm, int n, int m, int gcd, double d, boolean exact)
    Computes \(P(D_{n,m} > d)\) for the 2-sample Kolmogorov-Smirnov statistic.
    private static double
    Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample two-sided Kolmogorov-Smirnov statistic.
    private static double
    twoSampleTwoSidedPStabilizedInner(long d, int n, int m, int gcd)
    Computes \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.
    with(org.apache.commons.rng.UniformRandomProvider v)
    Return an instance with the configured source of randomness.
    Return an instance with the configured alternative hypothesis.
    Return an instance with the configured inequality.
    Return an instance with the configured p-value method.
    Return an instance using the default options.
    Return an instance with the configured number of iterations.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • SAMPLE_1_NAME

      private static final String SAMPLE_1_NAME
      Name for sample 1.
      See Also:
    • SAMPLE_2_NAME

      private static final String SAMPLE_2_NAME
      Name for sample 2.
      See Also:
    • LARGE_SAMPLE

      private static final int LARGE_SAMPLE
      When the largest sample size exceeds this value, 2-sample test AUTO p-value uses an asymptotic distribution to compute the p-value.
      See Also:
    • MAX_FACTORIAL

      private static final int MAX_FACTORIAL
      Maximum finite factorial.
      See Also:
    • MAX_ARRAY_SIZE

      private static final int MAX_ARRAY_SIZE
      Maximum length of an array. This is used to determine if two arrays can be concatenated to create a sampler from the joint distribution. The limit is copied from the limit of java.util.ArrayList.
      See Also:
    • MAX_LCM_TWO_SAMPLE_EXACT_P

      private static final long MAX_LCM_TWO_SAMPLE_EXACT_P
      The maximum least common multiple (lcm) to attempt the exact p-value computation. The integral d value is in [0, n*m] in steps of the greatest common denominator (gcd), thus lcm = n*m/gcd is the number of possible different p-values. Some methods have a lower limit due to computation limits. This should be larger than LARGE_SAMPLE^2 so all AUTO p-values attempt an exact computation, i.e. at least 10000^2 ~ 2^26.56.
      See Also:
    • IGNORED_SIGN

      private static final int[] IGNORED_SIGN
      Placeholder to use for the two-sample sign array when the value can be ignored.
    • IGNORED_D

      private static final long[] IGNORED_D
      Placeholder to use for the two-sample ties D array when the value can be ignored.
    • DEFAULT

      private static final KolmogorovSmirnovTest DEFAULT
      Default instance.
    • alternative

      private final AlternativeHypothesis alternative
      Alternative hypothesis.
    • pValueMethod

      private final PValueMethod pValueMethod
      Method to compute the p-value.
    • strictInequality

      private final boolean strictInequality
      Use a strict inequality for the two-sample exact p-value.
    • rng

      private final org.apache.commons.rng.UniformRandomProvider rng
      Source of randomness.
    • iterations

      private final int iterations
      Number of iterations .
  • Constructor Details

    • KolmogorovSmirnovTest

      private KolmogorovSmirnovTest(AlternativeHypothesis alternative, PValueMethod method, boolean strict, org.apache.commons.rng.UniformRandomProvider rng, int iterations)
      Parameters:
      alternative - Alternative hypothesis.
      method - P-value method.
      strict - true to use a strict inequality.
      rng - Source of randomness.
      iterations - Number of iterations.
  • Method Details

    • withDefaults

      public static KolmogorovSmirnovTest withDefaults()
      Returns:
      default instance
    • with

      Return an instance with the configured alternative hypothesis.
      Parameters:
      v - Value.
      Returns:
      an instance
    • with

      Return an instance with the configured p-value method.

      For the one-sample two-sided test Kolmogorov's asymptotic approximation can be used; otherwise the p-value uses the distribution of the D statistic.

      For the two-sample test the exact p-value can be computed for small sample sizes; otherwise the p-value resorts to the asymptotic approximation. Alternatively a p-value can be estimated from the combined distribution of the samples. This requires a source of randomness.

      Parameters:
      v - Value.
      Returns:
      an instance
      See Also:
    • with

      Return an instance with the configured inequality.

      Computes the p-value for the two-sample test as \(P(D_{n,m} > d)\) if strict; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\) or \(D_{n,m}^-\).

      Parameters:
      v - Value.
      Returns:
      an instance
    • with

      public KolmogorovSmirnovTest with(org.apache.commons.rng.UniformRandomProvider v)
      Return an instance with the configured source of randomness.

      Applies to the two-sample test when the p-value method is set to ESTIMATE. The randomness is used for sampling of the combined distribution.

      There is no default source of randomness. If the randomness is not set then estimation is not possible and an IllegalStateException will be raised in the two-sample test.

      Parameters:
      v - Value.
      Returns:
      an instance
      See Also:
    • withIterations

      public KolmogorovSmirnovTest withIterations(int v)
      Return an instance with the configured number of iterations.

      Applies to the two-sample test when the p-value method is set to ESTIMATE. This is the number of sampling iterations used to estimate the p-value. The p-value is a fraction using the iterations as the denominator. The number of significant digits in the p-value is upper bounded by log10(iterations); small p-values have fewer significant digits. A large number of iterations is recommended when using a small critical value to reject the null hypothesis.

      Parameters:
      v - Value.
      Returns:
      an instance
      Throws:
      IllegalArgumentException - if the number of iterations is not strictly positive
    • statistic

      public double statistic(double[] x, DoubleUnaryOperator cdf)
      Computes the one-sample Kolmogorov-Smirnov test statistic.
      • two-sided: \(D_n=\sup_x |F_n(x)-F(x)|\)
      • greater: \(D_n^+=\sup_x (F_n(x)-F(x))\)
      • less: \(D_n^-=\sup_x (F(x)-F_n(x))\)

      where \(F\) is the distribution cumulative density function (cdf), \(n\) is the length of x and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x.

      The cumulative distribution function should map a real value x to a probability in [0, 1]. To use a reference distribution the CDF can be passed using a method reference:

       UniformContinuousDistribution dist = UniformContinuousDistribution.of(0, 1);
       UniformRandomProvider rng = RandomSource.KISS.create(123);
       double[] x = dist.sampler(rng).samples(100);
       double d = KolmogorovSmirnovTest.withDefaults().statistic(x, dist::cumulativeProbability);
       
      Parameters:
      x - Sample being evaluated.
      cdf - Reference cumulative distribution function.
      Returns:
      Kolmogorov-Smirnov statistic
      Throws:
      IllegalArgumentException - if data does not have length at least 2; or contains NaN values.
      See Also:
    • statistic

      public double statistic(double[] x, double[] y)
      Computes the two-sample Kolmogorov-Smirnov test statistic.
      • two-sided: \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\)
      • greater: \(D_{n,m}^+=\sup_x (F_n(x)-F_m(x))\)
      • less: \(D_{n,m}^-=\sup_x (F_m(x)-F_n(x))\)

      where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution that puts mass \(1/m\) at each of the values in y.

      Parameters:
      x - First sample.
      y - Second sample.
      Returns:
      Kolmogorov-Smirnov statistic
      Throws:
      IllegalArgumentException - if either x or y does not have length at least 2; or contain NaN values.
      See Also:
    • test

      public KolmogorovSmirnovTest.OneResult test(double[] x, DoubleUnaryOperator cdf)
      Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x conforms to the distribution cumulative density function (cdf).

      The test is defined by the AlternativeHypothesis:

      • Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F(i)\) for all \( i \); the alternative is that the are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
      • Greater evaluates the null hypothesis that the \(F_n(i) <= F(i)\) for all \( i \); the alternative is \(F_n(i) > F(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
      • Less evaluates the null hypothesis that the \(F_n(i) >= F(i)\) for all \( i \); the alternative is \(F_n(i) < F(i)\) for at least one \( i \). The statistic is \( D_n^- \).

      The p-value method defaults to exact. The one-sided p-value uses Smirnov's stable formula:

      \[ P(D_n^+ \ge x) = x \sum_{j=0}^{\lfloor n(1-x) \rfloor} \binom{n}{j} \left(\frac{j}{n} + x\right)^{j-1} \left(1-x-\frac{j}{n} \right)^{n-j} \]

      The two-sided p-value is computed using methods described in Simard & L’Ecuyer (2011). The two-sided test supports an asymptotic p-value using Kolmogorov's formula:

      \[ \lim_{n\to\infty} P(\sqrt{n}D_n > z) = 2 \sum_{i=1}^\infty (-1)^{i-1} e^{-2 i^2 z^2} \]

      Parameters:
      x - Sample being being evaluated.
      cdf - Reference cumulative distribution function.
      Returns:
      test result
      Throws:
      IllegalArgumentException - if data does not have length at least 2; or contains NaN values.
      See Also:
    • test

      public KolmogorovSmirnovTest.TwoResult test(double[] x, double[] y)
      Performs a two-sample Kolmogorov-Smirnov test on samples x and y. Test the empirical distributions \(F_n\) and \(F_m\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution that puts mass \(1/m\) of the y values.

      The test is defined by the AlternativeHypothesis:

      • Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F_m(i)\) for all \( i \); the alternative is that they are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
      • Greater evaluates the null hypothesis that the \(F_n(i) <= F_m(i)\) for all \( i \); the alternative is \(F_n(i) > F_m(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
      • Less evaluates the null hypothesis that the \(F_n(i) >= F_m(i)\) for all \( i \); the alternative is \(F_n(i) < F_m(i)\) for at least one \( i \). The statistic is \( D_n^- \).

      If the p-value method is auto, then an exact p computation is attempted if both sample sizes are less than 10000 using the methods presented in Viehmann (2021) and Hodges (1958); otherwise an asymptotic p-value is returned. The two-sided p-value is \(\overline{F}(d, \sqrt{mn / (m + n)})\) where \(\overline{F}\) is the complementary cumulative density function of the two-sided one-sample Kolmogorov-Smirnov distribution. The one-sided p-value uses an approximation from Hodges (1958) Eq 5.3.

      \(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \gt d \) differ from \(H_0 : D_{n,m} \ge d \) by the mass of the observed value \(d\). These can be distinguished using an Inequality parameter. This is ignored for large samples.

      If the data contains ties there is no defined ordering in the tied region to use for the difference between the two empirical distributions. Each ordering of the tied region may create a different D statistic. All possible orderings generate a distribution for the D value. In this case the tied region is traversed entirely and the effect on the D value evaluated at the end of the tied region. This is the path with the least change on the D statistic. The path with the greatest change on the D statistic is also computed as the upper bound on D. If these two values are different then the tied region is known to generate a distribution for the D statistic and the p-value is an over estimate for the cases with a larger D statistic. The presence of any significant tied regions is returned in the result.

      If the p-value method is ESTIMATE then the p-value is estimated by repeat sampling of the joint distribution of x and y. The p-value is the frequency that a sample creates a D statistic greater than or equal to (or greater than for strict inequality) the observed value. In this case a source of randomness must be configured or an IllegalStateException will be raised. The p-value for the upper bound on D will not be estimated and is set to NaN. This estimation procedure is not affected by ties in the data and is increasingly robust for larger datasets. The method is modeled after ks.boot in the R Matching package (Sekhon (2011)).

      Parameters:
      x - First sample.
      y - Second sample.
      Returns:
      test result
      Throws:
      IllegalArgumentException - if either x or y does not have length at least 2; or contain NaN values.
      IllegalStateException - if the p-value method is ESTIMATE and there is no source of randomness.
      See Also:
    • estimateP

      private double estimateP(double[] x, double[] y, long dnm)
      Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution.

      This method will destructively modify the input arrays (via a sort).

      This method estimates the p-value by repeatedly sampling sets of size x.length and y.length from the empirical distribution of the combined sample. The memory requirement is an array copy of each of the input arguments.

      When using strict inequality, this is equivalent to the algorithm implemented in the R function ks.boot as described in Sekhon (2011) Journal of Statistical Software, 42(7), 1–52 [3].

      Parameters:
      x - First sample.
      y - Second sample.
      dnm - Integral D statistic.
      Returns:
      p-value
      Throws:
      IllegalStateException - if the source of randomness is null.
    • computeStatistic

      private double computeStatistic(double[] x, DoubleUnaryOperator cdf, int[] sign)
      Computes the magnitude of the one-sample Kolmogorov-Smirnov test statistic. The sign of the statistic is optionally returned in sign. For the two-sided case the sign is 0 if the magnitude of D+ and D- was equal; otherwise it indicates which D had the larger magnitude.
      Parameters:
      x - Sample being evaluated.
      cdf - Reference cumulative distribution function.
      sign - Sign of the statistic (non-zero length).
      Returns:
      Kolmogorov-Smirnov statistic
      Throws:
      IllegalArgumentException - if data does not have length at least 2; or contains NaN values.
    • computeIntegralKolmogorovSmirnovStatistic

      private long computeIntegralKolmogorovSmirnovStatistic(double[] x, double[] y, int[] sign, long[] tiesD)
      Computes the two-sample Kolmogorov-Smirnov test statistic. The statistic is integral and has a value in [0, n*m]. Not all values are possible; the smallest increment is the greatest common divisor of n and m.

      This method will destructively modify the input arrays (via a sort).

      The sign of the statistic is returned in sign. For the two-sided case the sign is 0 if the magnitude of D+ and D- was equal; otherwise it indicates which D had the larger magnitude. If the two-sided statistic is zero the two arrays are identical, or are 'identical' data of a single value (sample sizes may be different), or have a sequence of ties of 'identical' data with a net zero effect on the D statistic e.g.

        [1,2,3]           vs [1,2,3]
        [0,0,0,0]         vs [0,0,0]
        [0,0,0,0,1,1,1,1] vs [0,0,0,1,1,1]
       

      This method detects ties in the input data. Not all ties will invalidate the returned statistic. Ties between the data can be interpreted as if the values were different but within machine epsilon. In this case the path through the tie region is not known. All paths through the tie region end at the same point. This method will compute the most extreme path through each tie region and the D statistic for these paths. If the ties D statistic is a larger magnitude than the returned D statistic then at least one tie region lies at a point on the full path that could result in a different statistic in the absence of ties. This signals the P-value computed using the returned D statistic is one of many possible p-values given the data and how ties are resolved. Note: The tiesD value may be smaller than the returned D statistic as it is not set to the maximum of D or tiesD. The only result of interest is when tiesD > D due to a tie region that can change the output D. On output tiesD[0] != 0 if there were ties between samples and tiesD[1] = D of the most extreme path through the ties.

      Parameters:
      x - First sample (destructively modified by sort).
      y - Second sample (destructively modified by sort).
      sign - Sign of the statistic (non-zero length).
      tiesD - Integral statistic for the most extreme path through any ties (length at least 2).
      Returns:
      integral Kolmogorov-Smirnov statistic
      Throws:
      IllegalArgumentException - if either x or y contain NaN values.
    • testIntegralKolmogorovSmirnovStatistic

      private static boolean testIntegralKolmogorovSmirnovStatistic(double[] x, double[] y, long plus, long minus)
      Test if the two-sample integral Kolmogorov-Smirnov statistic is strictly greater than the specified values for D+ and D-. Note that D- should have a negative sign to impose an inclusive lower bound.

      This method will destructively modify the input arrays (via a sort).

      For a two-sided alternative hypothesis plus and minus should have the same magnitude with opposite signs.

      For a one-sided alternative hypothesis the value of plus or minus should have the expected value of the statistic, and the opposite D should have the maximum or minimum long value. The minus should be negatively signed:

      Note: This method has not been specialized for the one-sided case. Specialization would eliminate a conditional branch for d > Long.MAX_VALUE or d < Long.MIN_VALUE. Since these branches are never possible in the one-sided case this should be efficiently chosen by branch prediction in a processor pipeline.

      Parameters:
      x - First sample (destructively modified by sort; must not contain NaN).
      y - Second sample (destructively modified by sort; must not contain NaN).
      plus - Limit on D+ (inclusive upper bound).
      minus - Limit on D- (inclusive lower bound).
      Returns:
      true if the D value exceeds the provided limits
    • createSampler

      private static DoubleSupplier createSampler(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng)
      Creates a sampler to sample randomly from the combined distribution of the two samples.
      Parameters:
      x - First sample.
      y - Second sample.
      rng - Source of randomness.
      Returns:
      the sampler
    • createSampler

      static DoubleSupplier createSampler(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng, int maxArraySize)
      Creates a sampler to sample randomly from the combined distribution of the two samples. This will copy the input data for the sampler.
      Parameters:
      x - First sample.
      y - Second sample.
      rng - Source of randomness.
      maxArraySize - Maximum size of a single array.
      Returns:
      the sampler
    • computeD

      private static double computeD(long dnm, int n, int m, int gcd)
      Compute the D statistic from the integral D value.
      Parameters:
      dnm - Integral D-statistic value (in [0, n*m]).
      n - First sample size.
      m - Second sample size.
      gcd - Greatest common divisor of n and m.
      Returns:
      D-statistic value (in [0, 1]).
    • twoSampleP

      private double twoSampleP(long dnm, int n, int m, int gcd, double d, boolean exact)
      Computes \(P(D_{n,m} > d)\) for the 2-sample Kolmogorov-Smirnov statistic.
      Parameters:
      dnm - Integral D-statistic value (in [0, n*m]).
      n - First sample size.
      m - Second sample size.
      gcd - Greatest common divisor of n and m.
      d - D-statistic value (in [0, 1]).
      exact - whether to compute the exact probability; otherwise approximate.
      Returns:
      probability
      See Also:
    • twoSampleExactP

      static double twoSampleExactP(long dnm, int n, int m, int gcd, boolean strict, boolean twoSided)
      Computes \(P(D_{n,m} > d)\) if strict is true; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\}. See statistic(double[], double[]) for the definition of \(D_{n,m}\).

      The returned probability is exact. If the value cannot be computed this returns -1.

      Note: This requires the greatest common divisor of n and m. The integral D statistic in the range [0, n*m] is separated by increments of the gcd. The method will only compute p-values for valid values of D by calculating for D/gcd. Strict inquality is performed using the next valid value for D.

      Parameters:
      dnm - Integral D-statistic value (in [0, n*m]).
      n - First sample size.
      m - Second sample size.
      gcd - Greatest common divisor of n and m.
      strict - whether or not the probability to compute is expressed as a strict inequality.
      twoSided - whether D refers to D or D+.
      Returns:
      probability that a randomly selected m-n partition of m + n generates D greater than (resp. greater than or equal to) d (or -1)
    • twoSampleTwoSidedPStabilizedInner

      private static double twoSampleTwoSidedPStabilizedInner(long d, int n, int m, int gcd)
      Computes \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.

      The returned probability is exact, implemented using the stabilized inner method presented in Viehmann (2021).

      This is optimized for m <= n. If m > n and index-out-of-bounds exception can occur.

      Parameters:
      d - Integral D-statistic value (in [2, lcm])
      n - Larger sample size.
      m - Smaller sample size.
      gcd - Greatest common divisor of n and m.
      Returns:
      probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\) greater than or equal to d
    • twoSampleOneSidedPOutside

      private static double twoSampleOneSidedPOutside(long d, int n, int m, int gcd)
      Computes \(P(D_{n,m}^+ \ge d)\), where \(D_{n,m}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.

      The returned probability is exact, implemented using the outer method presented in Hodges (1958).

      This method will fail-fast and return -1 if the computation of the numbers of paths overflows.

      Parameters:
      d - Integral D-statistic value (in [0, lcm])
      n - First sample size.
      m - Second sample size.
      gcd - Greatest common divisor of n and m.
      Returns:
      probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\) greater than or equal to d
    • twoSampleOneSidedPOutsideSquare

      private static double twoSampleOneSidedPOutsideSquare(long d, int n)
      Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.

      The returned probability is exact, implemented using the outer method presented in Hodges (1958).

      Parameters:
      d - Integral D-statistic value (in [1, lcm])
      n - Sample size.
      Returns:
      probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\) greater than or equal to d
    • twoSampleTwoSidedPOutsideSquare

      private static double twoSampleTwoSidedPOutsideSquare(long d, int n)
      Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample two-sided Kolmogorov-Smirnov statistic.

      The returned probability is exact, implemented using the outer method presented in Hodges (1958).

      Parameters:
      d - Integral D-statistic value (in [1, n])
      n - Sample size.
      Returns:
      probability that a randomly selected m-n partition of n + n generates \(D_{n,n}\) greater than or equal to d
    • binom

      private static double binom(int n, int k)
      Compute the binomial coefficient binom(n, k).
      Parameters:
      n - N.
      k - K.
      Returns:
      binom(n, k)
    • twoSampleApproximateP

      static double twoSampleApproximateP(double d, int n, int m, boolean twoSided)
      Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. See statistic(double[], double[]) for the definition of \(D_{n,m}\).

      Specifically, what is returned is \(1 - CDF(d, \sqrt{mn / (m + n)})\) where CDF is the cumulative density function of the two-sided one-sample Kolmogorov-Smirnov distribution.

      Parameters:
      d - D-statistic value.
      n - First sample size.
      m - Second sample size.
      twoSided - True to compute the two-sided p-value; else one-sided.
      Returns:
      approximate probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\) greater than d
    • checkArrayLength

      private static int checkArrayLength(double[] array)
      Verifies that array has length at least 2.
      Parameters:
      array - Array to test.
      Returns:
      the length
      Throws:
      IllegalArgumentException - if array is too short
    • sort

      private static double[] sort(double[] x, String name)
      Sort the input array. Throws an exception if NaN values are present. It is assumed the array is non-zero length.
      Parameters:
      x - Input array.
      name - Name of the array.
      Returns:
      a reference to the input (sorted) array
      Throws:
      IllegalArgumentException - if x contains NaN values.