Class KolmogorovSmirnovTest

    • Field Summary

      Fields 
      Modifier and Type Field Description
      private AlternativeHypothesis alternative
      Alternative hypothesis.
      private static KolmogorovSmirnovTest DEFAULT
      Default instance.
      private static long[] IGNORED_D
      Placeholder to use for the two-sample ties D array when the value can be ignored.
      private static int[] IGNORED_SIGN
      Placeholder to use for the two-sample sign array when the value can be ignored.
      private int iterations
      Number of iterations .
      private static int LARGE_SAMPLE
      When the largest sample size exceeds this value, 2-sample test AUTO p-value uses an asymptotic distribution to compute the p-value.
      private static int MAX_ARRAY_SIZE
      Maximum length of an array.
      private static int MAX_FACTORIAL
      Maximum finite factorial.
      private static long MAX_LCM_TWO_SAMPLE_EXACT_P
      The maximum least common multiple (lcm) to attempt the exact p-value computation.
      private PValueMethod pValueMethod
      Method to compute the p-value.
      private org.apache.commons.rng.UniformRandomProvider rng
      Source of randomness.
      private static java.lang.String SAMPLE_1_NAME
      Name for sample 1.
      private static java.lang.String SAMPLE_2_NAME
      Name for sample 2.
      private boolean strictInequality
      Use a strict inequality for the two-sample exact p-value.
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      private static double binom​(int n, int k)
      Compute the binomial coefficient binom(n, k).
      private static int checkArrayLength​(double[] array)
      Verifies that array has length at least 2.
      private static double computeD​(long dnm, int n, int m, int gcd)
      Compute the D statistic from the integral D value.
      private long computeIntegralKolmogorovSmirnovStatistic​(double[] x, double[] y, int[] sign, long[] tiesD)
      Computes the two-sample Kolmogorov-Smirnov test statistic.
      private double computeStatistic​(double[] x, java.util.function.DoubleUnaryOperator cdf, int[] sign)
      Computes the magnitude of the one-sample Kolmogorov-Smirnov test statistic.
      private static java.util.function.DoubleSupplier createSampler​(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng)
      Creates a sampler to sample randomly from the combined distribution of the two samples.
      (package private) static java.util.function.DoubleSupplier createSampler​(double[] x, double[] y, org.apache.commons.rng.UniformRandomProvider rng, int maxArraySize)
      Creates a sampler to sample randomly from the combined distribution of the two samples.
      private double estimateP​(double[] x, double[] y, long dnm)
      Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution.
      private static double[] sort​(double[] x, java.lang.String name)
      Sort the input array.
      double statistic​(double[] x, double[] y)
      Computes the two-sample Kolmogorov-Smirnov test statistic.
      double statistic​(double[] x, java.util.function.DoubleUnaryOperator cdf)
      Computes the one-sample Kolmogorov-Smirnov test statistic.
      KolmogorovSmirnovTest.TwoResult test​(double[] x, double[] y)
      Performs a two-sample Kolmogorov-Smirnov test on samples x and y.
      KolmogorovSmirnovTest.OneResult test​(double[] x, java.util.function.DoubleUnaryOperator cdf)
      Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x conforms to the distribution cumulative density function (cdf).
      private static boolean testIntegralKolmogorovSmirnovStatistic​(double[] x, double[] y, long plus, long minus)
      Test if the two-sample integral Kolmogorov-Smirnov statistic is strictly greater than the specified values for D+ and D-.
      (package private) static double twoSampleApproximateP​(double d, int n, int m, boolean twoSided)
      Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.
      (package private) static double twoSampleExactP​(long dnm, int n, int m, int gcd, boolean strict, boolean twoSided)
      Computes \(P(D_{n,m} > d)\) if strict is true; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\}.
      private static double twoSampleOneSidedPOutside​(long d, int n, int m, int gcd)
      Computes \(P(D_{n,m}^+ \ge d)\), where \(D_{n,m}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.
      private static double twoSampleOneSidedPOutsideSquare​(long d, int n)
      Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.
      private double twoSampleP​(long dnm, int n, int m, int gcd, double d, boolean exact)
      Computes \(P(D_{n,m} > d)\) for the 2-sample Kolmogorov-Smirnov statistic.
      private static double twoSampleTwoSidedPOutsideSquare​(long d, int n)
      Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample two-sided Kolmogorov-Smirnov statistic.
      private static double twoSampleTwoSidedPStabilizedInner​(long d, int n, int m, int gcd)
      Computes \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.
      KolmogorovSmirnovTest with​(org.apache.commons.rng.UniformRandomProvider v)
      Return an instance with the configured source of randomness.
      KolmogorovSmirnovTest with​(AlternativeHypothesis v)
      Return an instance with the configured alternative hypothesis.
      KolmogorovSmirnovTest with​(Inequality v)
      Return an instance with the configured inequality.
      KolmogorovSmirnovTest with​(PValueMethod v)
      Return an instance with the configured p-value method.
      static KolmogorovSmirnovTest withDefaults()
      Return an instance using the default options.
      KolmogorovSmirnovTest withIterations​(int v)
      Return an instance with the configured number of iterations.
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Field Detail

      • SAMPLE_1_NAME

        private static final java.lang.String SAMPLE_1_NAME
        Name for sample 1.
        See Also:
        Constant Field Values
      • SAMPLE_2_NAME

        private static final java.lang.String SAMPLE_2_NAME
        Name for sample 2.
        See Also:
        Constant Field Values
      • LARGE_SAMPLE

        private static final int LARGE_SAMPLE
        When the largest sample size exceeds this value, 2-sample test AUTO p-value uses an asymptotic distribution to compute the p-value.
        See Also:
        Constant Field Values
      • MAX_FACTORIAL

        private static final int MAX_FACTORIAL
        Maximum finite factorial.
        See Also:
        Constant Field Values
      • MAX_ARRAY_SIZE

        private static final int MAX_ARRAY_SIZE
        Maximum length of an array. This is used to determine if two arrays can be concatenated to create a sampler from the joint distribution. The limit is copied from the limit of java.util.ArrayList.
        See Also:
        Constant Field Values
      • MAX_LCM_TWO_SAMPLE_EXACT_P

        private static final long MAX_LCM_TWO_SAMPLE_EXACT_P
        The maximum least common multiple (lcm) to attempt the exact p-value computation. The integral d value is in [0, n*m] in steps of the greatest common denominator (gcd), thus lcm = n*m/gcd is the number of possible different p-values. Some methods have a lower limit due to computation limits. This should be larger than LARGE_SAMPLE^2 so all AUTO p-values attempt an exact computation, i.e. at least 10000^2 ~ 2^26.56.
        See Also:
        Constant Field Values
      • IGNORED_SIGN

        private static final int[] IGNORED_SIGN
        Placeholder to use for the two-sample sign array when the value can be ignored.
      • IGNORED_D

        private static final long[] IGNORED_D
        Placeholder to use for the two-sample ties D array when the value can be ignored.
      • pValueMethod

        private final PValueMethod pValueMethod
        Method to compute the p-value.
      • strictInequality

        private final boolean strictInequality
        Use a strict inequality for the two-sample exact p-value.
      • rng

        private final org.apache.commons.rng.UniformRandomProvider rng
        Source of randomness.
      • iterations

        private final int iterations
        Number of iterations .
    • Constructor Detail

      • KolmogorovSmirnovTest

        private KolmogorovSmirnovTest​(AlternativeHypothesis alternative,
                                      PValueMethod method,
                                      boolean strict,
                                      org.apache.commons.rng.UniformRandomProvider rng,
                                      int iterations)
        Parameters:
        alternative - Alternative hypothesis.
        method - P-value method.
        strict - true to use a strict inequality.
        rng - Source of randomness.
        iterations - Number of iterations.
    • Method Detail

      • with

        public KolmogorovSmirnovTest with​(PValueMethod v)
        Return an instance with the configured p-value method.

        For the one-sample two-sided test Kolmogorov's asymptotic approximation can be used; otherwise the p-value uses the distribution of the D statistic.

        For the two-sample test the exact p-value can be computed for small sample sizes; otherwise the p-value resorts to the asymptotic approximation. Alternatively a p-value can be estimated from the combined distribution of the samples. This requires a source of randomness.

        Parameters:
        v - Value.
        Returns:
        an instance
        See Also:
        with(UniformRandomProvider)
      • with

        public KolmogorovSmirnovTest with​(Inequality v)
        Return an instance with the configured inequality.

        Computes the p-value for the two-sample test as \(P(D_{n,m} > d)\) if strict; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\) or \(D_{n,m}^-\).

        Parameters:
        v - Value.
        Returns:
        an instance
      • with

        public KolmogorovSmirnovTest with​(org.apache.commons.rng.UniformRandomProvider v)
        Return an instance with the configured source of randomness.

        Applies to the two-sample test when the p-value method is set to ESTIMATE. The randomness is used for sampling of the combined distribution.

        There is no default source of randomness. If the randomness is not set then estimation is not possible and an IllegalStateException will be raised in the two-sample test.

        Parameters:
        v - Value.
        Returns:
        an instance
        See Also:
        with(PValueMethod)
      • withIterations

        public KolmogorovSmirnovTest withIterations​(int v)
        Return an instance with the configured number of iterations.

        Applies to the two-sample test when the p-value method is set to ESTIMATE. This is the number of sampling iterations used to estimate the p-value. The p-value is a fraction using the iterations as the denominator. The number of significant digits in the p-value is upper bounded by log10(iterations); small p-values have fewer significant digits. A large number of iterations is recommended when using a small critical value to reject the null hypothesis.

        Parameters:
        v - Value.
        Returns:
        an instance
        Throws:
        java.lang.IllegalArgumentException - if the number of iterations is not strictly positive
      • statistic

        public double statistic​(double[] x,
                                java.util.function.DoubleUnaryOperator cdf)
        Computes the one-sample Kolmogorov-Smirnov test statistic.
        • two-sided: \(D_n=\sup_x |F_n(x)-F(x)|\)
        • greater: \(D_n^+=\sup_x (F_n(x)-F(x))\)
        • less: \(D_n^-=\sup_x (F(x)-F_n(x))\)

        where \(F\) is the distribution cumulative density function (cdf), \(n\) is the length of x and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x.

        The cumulative distribution function should map a real value x to a probability in [0, 1]. To use a reference distribution the CDF can be passed using a method reference:

         UniformContinuousDistribution dist = UniformContinuousDistribution.of(0, 1);
         UniformRandomProvider rng = RandomSource.KISS.create(123);
         double[] x = dist.sampler(rng).samples(100);
         double d = KolmogorovSmirnovTest.withDefaults().statistic(x, dist::cumulativeProbability);
         
        Parameters:
        cdf - Reference cumulative distribution function.
        x - Sample being evaluated.
        Returns:
        Kolmogorov-Smirnov statistic
        Throws:
        java.lang.IllegalArgumentException - if data does not have length at least 2; or contains NaN values.
        See Also:
        test(double[], DoubleUnaryOperator)
      • statistic

        public double statistic​(double[] x,
                                double[] y)
        Computes the two-sample Kolmogorov-Smirnov test statistic.
        • two-sided: \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\)
        • greater: \(D_{n,m}^+=\sup_x (F_n(x)-F_m(x))\)
        • less: \(D_{n,m}^-=\sup_x (F_m(x)-F_n(x))\)

        where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution that puts mass \(1/m\) at each of the values in y.

        Parameters:
        x - First sample.
        y - Second sample.
        Returns:
        Kolmogorov-Smirnov statistic
        Throws:
        java.lang.IllegalArgumentException - if either x or y does not have length at least 2; or contain NaN values.
        See Also:
        test(double[], double[])
      • test

        public KolmogorovSmirnovTest.OneResult test​(double[] x,
                                                    java.util.function.DoubleUnaryOperator cdf)
        Performs a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x conforms to the distribution cumulative density function (cdf).

        The test is defined by the AlternativeHypothesis:

        • Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F(i)\) for all \( i \); the alternative is that the are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
        • Greater evaluates the null hypothesis that the \(F_n(i) <= F(i)\) for all \( i \); the alternative is \(F_n(i) > F(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
        • Less evaluates the null hypothesis that the \(F_n(i) >= F(i)\) for all \( i \); the alternative is \(F_n(i) < F(i)\) for at least one \( i \). The statistic is \( D_n^- \).

        The p-value method defaults to exact. The one-sided p-value uses Smirnov's stable formula:

        \[ P(D_n^+ \ge x) = x \sum_{j=0}^{\lfloor n(1-x) \rfloor} \binom{n}{j} \left(\frac{j}{n} + x\right)^{j-1} \left(1-x-\frac{j}{n} \right)^{n-j} \]

        The two-sided p-value is computed using methods described in Simard & L’Ecuyer (2011). The two-sided test supports an asymptotic p-value using Kolmogorov's formula:

        \[ \lim_{n\to\infty} P(\sqrt{n}D_n > z) = 2 \sum_{i=1}^\infty (-1)^{i-1} e^{-2 i^2 z^2} \]

        Parameters:
        x - Sample being being evaluated.
        cdf - Reference cumulative distribution function.
        Returns:
        test result
        Throws:
        java.lang.IllegalArgumentException - if data does not have length at least 2; or contains NaN values.
        See Also:
        statistic(double[], DoubleUnaryOperator)
      • test

        public KolmogorovSmirnovTest.TwoResult test​(double[] x,
                                                    double[] y)
        Performs a two-sample Kolmogorov-Smirnov test on samples x and y. Test the empirical distributions \(F_n\) and \(F_m\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution that puts mass \(1/m\) of the y values.

        The test is defined by the AlternativeHypothesis:

        • Two-sided evaluates the null hypothesis that the two distributions are identical, \(F_n(i) = F_m(i)\) for all \( i \); the alternative is that they are not identical. The statistic is \( max(D_n^+, D_n^-) \) and the sign of \( D \) is provided.
        • Greater evaluates the null hypothesis that the \(F_n(i) <= F_m(i)\) for all \( i \); the alternative is \(F_n(i) > F_m(i)\) for at least one \( i \). The statistic is \( D_n^+ \).
        • Less evaluates the null hypothesis that the \(F_n(i) >= F_m(i)\) for all \( i \); the alternative is \(F_n(i) < F_m(i)\) for at least one \( i \). The statistic is \( D_n^- \).

        If the p-value method is auto, then an exact p computation is attempted if both sample sizes are less than 10000 using the methods presented in Viehmann (2021) and Hodges (1958); otherwise an asymptotic p-value is returned. The two-sided p-value is \(\overline{F}(d, \sqrt{mn / (m + n)})\) where \(\overline{F}\) is the complementary cumulative density function of the two-sided one-sample Kolmogorov-Smirnov distribution. The one-sided p-value uses an approximation from Hodges (1958) Eq 5.3.

        \(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \gt d \) differ from \(H_0 : D_{n,m} \ge d \) by the mass of the observed value \(d\). These can be distinguished using an Inequality parameter. This is ignored for large samples.

        If the data contains ties there is no defined ordering in the tied region to use for the difference between the two empirical distributions. Each ordering of the tied region may create a different D statistic. All possible orderings generate a distribution for the D value. In this case the tied region is traversed entirely and the effect on the D value evaluated at the end of the tied region. This is the path with the least change on the D statistic. The path with the greatest change on the D statistic is also computed as the upper bound on D. If these two values are different then the tied region is known to generate a distribution for the D statistic and the p-value is an over estimate for the cases with a larger D statistic. The presence of any significant tied regions is returned in the result.

        If the p-value method is ESTIMATE then the p-value is estimated by repeat sampling of the joint distribution of x and y. The p-value is the frequency that a sample creates a D statistic greater than or equal to (or greater than for strict inequality) the observed value. In this case a source of randomness must be configured or an IllegalStateException will be raised. The p-value for the upper bound on D will not be estimated and is set to NaN. This estimation procedure is not affected by ties in the data and is increasingly robust for larger datasets. The method is modeled after ks.boot in the R Matching package (Sekhon (2011)).

        Parameters:
        x - First sample.
        y - Second sample.
        Returns:
        test result
        Throws:
        java.lang.IllegalArgumentException - if either x or y does not have length at least 2; or contain NaN values.
        java.lang.IllegalStateException - if the p-value method is ESTIMATE and there is no source of randomness.
        See Also:
        statistic(double[], double[])
      • estimateP

        private double estimateP​(double[] x,
                                 double[] y,
                                 long dnm)
        Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution.

        This method will destructively modify the input arrays (via a sort).

        This method estimates the p-value by repeatedly sampling sets of size x.length and y.length from the empirical distribution of the combined sample. The memory requirement is an array copy of each of the input arguments.

        When using strict inequality, this is equivalent to the algorithm implemented in the R function ks.boot as described in Sekhon (2011) Journal of Statistical Software, 42(7), 1–52 [3].

        Parameters:
        x - First sample.
        y - Second sample.
        dnm - Integral D statistic.
        Returns:
        p-value
        Throws:
        java.lang.IllegalStateException - if the source of randomness is null.
      • computeStatistic

        private double computeStatistic​(double[] x,
                                        java.util.function.DoubleUnaryOperator cdf,
                                        int[] sign)
        Computes the magnitude of the one-sample Kolmogorov-Smirnov test statistic. The sign of the statistic is optionally returned in sign. For the two-sided case the sign is 0 if the magnitude of D+ and D- was equal; otherwise it indicates which D had the larger magnitude.
        Parameters:
        x - Sample being evaluated.
        cdf - Reference cumulative distribution function.
        sign - Sign of the statistic (non-zero length).
        Returns:
        Kolmogorov-Smirnov statistic
        Throws:
        java.lang.IllegalArgumentException - if data does not have length at least 2; or contains NaN values.
      • computeIntegralKolmogorovSmirnovStatistic

        private long computeIntegralKolmogorovSmirnovStatistic​(double[] x,
                                                               double[] y,
                                                               int[] sign,
                                                               long[] tiesD)
        Computes the two-sample Kolmogorov-Smirnov test statistic. The statistic is integral and has a value in [0, n*m]. Not all values are possible; the smallest increment is the greatest common divisor of n and m.

        This method will destructively modify the input arrays (via a sort).

        The sign of the statistic is returned in sign. For the two-sided case the sign is 0 if the magnitude of D+ and D- was equal; otherwise it indicates which D had the larger magnitude. If the two-sided statistic is zero the two arrays are identical, or are 'identical' data of a single value (sample sizes may be different), or have a sequence of ties of 'identical' data with a net zero effect on the D statistic e.g.

          [1,2,3]           vs [1,2,3]
          [0,0,0,0]         vs [0,0,0]
          [0,0,0,0,1,1,1,1] vs [0,0,0,1,1,1]
         

        This method detects ties in the input data. Not all ties will invalidate the returned statistic. Ties between the data can be interpreted as if the values were different but within machine epsilon. In this case the path through the tie region is not known. All paths through the tie region end at the same point. This method will compute the most extreme path through each tie region and the D statistic for these paths. If the ties D statistic is a larger magnitude than the returned D statistic then at least one tie region lies at a point on the full path that could result in a different statistic in the absence of ties. This signals the P-value computed using the returned D statistic is one of many possible p-values given the data and how ties are resolved. Note: The tiesD value may be smaller than the returned D statistic as it is not set to the maximum of D or tiesD. The only result of interest is when tiesD > D due to a tie region that can change the output D. On output tiesD[0] != 0 if there were ties between samples and tiesD[1] = D of the most extreme path through the ties.

        Parameters:
        x - First sample (destructively modified by sort).
        y - Second sample (destructively modified by sort).
        sign - Sign of the statistic (non-zero length).
        tiesD - Integral statistic for the most extreme path through any ties (length at least 2).
        Returns:
        integral Kolmogorov-Smirnov statistic
        Throws:
        java.lang.IllegalArgumentException - if either x or y contain NaN values.
      • testIntegralKolmogorovSmirnovStatistic

        private static boolean testIntegralKolmogorovSmirnovStatistic​(double[] x,
                                                                      double[] y,
                                                                      long plus,
                                                                      long minus)
        Test if the two-sample integral Kolmogorov-Smirnov statistic is strictly greater than the specified values for D+ and D-. Note that D- should have a negative sign to impose an inclusive lower bound.

        This method will destructively modify the input arrays (via a sort).

        For a two-sided alternative hypothesis plus and minus should have the same magnitude with opposite signs.

        For a one-sided alternative hypothesis the value of plus or minus should have the expected value of the statistic, and the opposite D should have the maximum or minimum long value. The minus should be negatively signed:

        • greater: plus = D, minus = Long.MIN_VALUE
        • greater: minus = -D, plus = Long.MAX_VALUE

        Note: This method has not been specialized for the one-sided case. Specialization would eliminate a conditional branch for d > Long.MAX_VALUE or d < Long.MIN_VALUE. Since these branches are never possible in the one-sided case this should be efficiently chosen by branch prediction in a processor pipeline.

        Parameters:
        x - First sample (destructively modified by sort; must not contain NaN).
        y - Second sample (destructively modified by sort; must not contain NaN).
        plus - Limit on D+ (inclusive upper bound).
        minus - Limit on D- (inclusive lower bound).
        Returns:
        true if the D value exceeds the provided limits
      • createSampler

        private static java.util.function.DoubleSupplier createSampler​(double[] x,
                                                                       double[] y,
                                                                       org.apache.commons.rng.UniformRandomProvider rng)
        Creates a sampler to sample randomly from the combined distribution of the two samples.
        Parameters:
        x - First sample.
        y - Second sample.
        rng - Source of randomness.
        Returns:
        the sampler
      • createSampler

        static java.util.function.DoubleSupplier createSampler​(double[] x,
                                                               double[] y,
                                                               org.apache.commons.rng.UniformRandomProvider rng,
                                                               int maxArraySize)
        Creates a sampler to sample randomly from the combined distribution of the two samples. This will copy the input data for the sampler.
        Parameters:
        x - First sample.
        y - Second sample.
        rng - Source of randomness.
        maxArraySize - Maximum size of a single array.
        Returns:
        the sampler
      • computeD

        private static double computeD​(long dnm,
                                       int n,
                                       int m,
                                       int gcd)
        Compute the D statistic from the integral D value.
        Parameters:
        dnm - Integral D-statistic value (in [0, n*m]).
        n - First sample size.
        m - Second sample size.
        gcd - Greatest common divisor of n and m.
        Returns:
        D-statistic value (in [0, 1]).
      • twoSampleP

        private double twoSampleP​(long dnm,
                                  int n,
                                  int m,
                                  int gcd,
                                  double d,
                                  boolean exact)
        Computes \(P(D_{n,m} > d)\) for the 2-sample Kolmogorov-Smirnov statistic.
        Parameters:
        dnm - Integral D-statistic value (in [0, n*m]).
        n - First sample size.
        m - Second sample size.
        gcd - Greatest common divisor of n and m.
        d - D-statistic value (in [0, 1]).
        exact - whether to compute the exact probability; otherwise approximate.
        Returns:
        probability
        See Also:
        twoSampleExactP(long, int, int, int, boolean, boolean), twoSampleApproximateP(double, int, int, boolean)
      • twoSampleExactP

        static double twoSampleExactP​(long dnm,
                                      int n,
                                      int m,
                                      int gcd,
                                      boolean strict,
                                      boolean twoSided)
        Computes \(P(D_{n,m} > d)\) if strict is true; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic, either the two-sided \(D_{n,m}\) or one-sided \(D_{n,m}^+\}. See statistic(double[], double[]) for the definition of \(D_{n,m}\).

        The returned probability is exact. If the value cannot be computed this returns -1.

        Note: This requires the greatest common divisor of n and m. The integral D statistic in the range [0, n*m] is separated by increments of the gcd. The method will only compute p-values for valid values of D by calculating for D/gcd. Strict inquality is performed using the next valid value for D.

        Parameters:
        dnm - Integral D-statistic value (in [0, n*m]).
        n - First sample size.
        m - Second sample size.
        gcd - Greatest common divisor of n and m.
        strict - whether or not the probability to compute is expressed as a strict inequality.
        twoSided - whether D refers to D or D+.
        Returns:
        probability that a randomly selected m-n partition of m + n generates D greater than (resp. greater than or equal to) d (or -1)
      • twoSampleTwoSidedPStabilizedInner

        private static double twoSampleTwoSidedPStabilizedInner​(long d,
                                                                int n,
                                                                int m,
                                                                int gcd)
        Computes \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.

        The returned probability is exact, implemented using the stabilized inner method presented in Viehmann (2021).

        This is optimized for m <= n. If m > n and index-out-of-bounds exception can occur.

        Parameters:
        d - Integral D-statistic value (in [2, lcm])
        n - Larger sample size.
        m - Smaller sample size.
        gcd - Greatest common divisor of n and m.
        Returns:
        probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\) greater than or equal to d
      • twoSampleOneSidedPOutside

        private static double twoSampleOneSidedPOutside​(long d,
                                                        int n,
                                                        int m,
                                                        int gcd)
        Computes \(P(D_{n,m}^+ \ge d)\), where \(D_{n,m}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.

        The returned probability is exact, implemented using the outer method presented in Hodges (1958).

        This method will fail-fast and return -1 if the computation of the numbers of paths overflows.

        Parameters:
        d - Integral D-statistic value (in [0, lcm])
        n - First sample size.
        m - Second sample size.
        gcd - Greatest common divisor of n and m.
        Returns:
        probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\) greater than or equal to d
      • twoSampleOneSidedPOutsideSquare

        private static double twoSampleOneSidedPOutsideSquare​(long d,
                                                              int n)
        Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample one-sided Kolmogorov-Smirnov statistic.

        The returned probability is exact, implemented using the outer method presented in Hodges (1958).

        Parameters:
        d - Integral D-statistic value (in [1, lcm])
        n - Sample size.
        Returns:
        probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\) greater than or equal to d
      • twoSampleTwoSidedPOutsideSquare

        private static double twoSampleTwoSidedPOutsideSquare​(long d,
                                                              int n)
        Computes \(P(D_{n,n}^+ \ge d)\), where \(D_{n,n}^+\) is the 2-sample two-sided Kolmogorov-Smirnov statistic.

        The returned probability is exact, implemented using the outer method presented in Hodges (1958).

        Parameters:
        d - Integral D-statistic value (in [1, n])
        n - Sample size.
        Returns:
        probability that a randomly selected m-n partition of n + n generates \(D_{n,n}\) greater than or equal to d
      • binom

        private static double binom​(int n,
                                    int k)
        Compute the binomial coefficient binom(n, k).
        Parameters:
        n - N.
        k - K.
        Returns:
        binom(n, k)
      • twoSampleApproximateP

        static double twoSampleApproximateP​(double d,
                                            int n,
                                            int m,
                                            boolean twoSided)
        Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. See statistic(double[], double[]) for the definition of \(D_{n,m}\).

        Specifically, what is returned is \(1 - CDF(d, \sqrt{mn / (m + n)})\) where CDF is the cumulative density function of the two-sided one-sample Kolmogorov-Smirnov distribution.

        Parameters:
        d - D-statistic value.
        n - First sample size.
        m - Second sample size.
        twoSided - True to compute the two-sided p-value; else one-sided.
        Returns:
        approximate probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\) greater than d
      • checkArrayLength

        private static int checkArrayLength​(double[] array)
        Verifies that array has length at least 2.
        Parameters:
        array - Array to test.
        Returns:
        the length
        Throws:
        java.lang.IllegalArgumentException - if array is too short
      • sort

        private static double[] sort​(double[] x,
                                     java.lang.String name)
        Sort the input array. Throws an exception if NaN values are present. It is assumed the array is non-zero length.
        Parameters:
        x - Input array.
        name - Name of the array.
        Returns:
        a reference to the input (sorted) array
        Throws:
        java.lang.IllegalArgumentException - if x contains NaN values.