Class KolmogorovSmirnovTest


  • public class KolmogorovSmirnovTest
    extends java.lang.Object
    Implementation of the Kolmogorov-Smirnov (K-S) test for equality of continuous distributions.

    The K-S test uses a statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis. For one-sample tests evaluating the null hypothesis that a set of sample data points follow a given distribution, the test statistic is \(D_n=\sup_x |F_n(x)-F(x)|\), where \(F\) is the expected distribution and \(F_n\) is the empirical distribution of the \(n\) sample data points. The distribution of \(D_n\) is estimated using a method based on [1] with certain quick decisions for extreme values given in [2].

    Two-sample tests are also supported, evaluating the null hypothesis that the two samples x and y come from the same underlying distribution. In this case, the test statistic is \(D_{n,m}=\sup_t | F_n(t)-F_m(t)|\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution of the y values. The default 2-sample test method, kolmogorovSmirnovTest(double[], double[]) works as follows:

    • For small samples (where the product of the sample sizes is less than 10000), the method presented in [4] is used to compute the exact p-value for the 2-sample test.
    • When the product of the sample sizes exceeds 10000, the asymptotic distribution of \(D_{n,m}\) is used. See approximateP(double, int, int) for details on the approximation.

    If the product of the sample sizes is less than 10000 and the sample data contains ties, random jitter is added to the sample data to break ties before applying the algorithm above. Alternatively, the bootstrap(double[], double[], int, boolean) method, modeled after ks.boot in the R Matching package [3], can be used if ties are known to be present in the data.

    In the two-sample case, \(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \ge d \) differ from \(H_0 : D_{n,m} > d \) by the mass of the observed value \(d\). To distinguish these, the two-sample tests use a boolean strict parameter. This parameter is ignored for large samples.

    The methods used by the 2-sample default implementation are also exposed directly:

    References:


    Note that [1] contains an error in computing h, refer to MATH-437 for details.

    Since:
    3.3
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      double approximateP​(double d, int n, int m)
      Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.
      double bootstrap​(double[] x, double[] y, int iterations)
      Computes bootstrap(x, y, iterations, true).
      double bootstrap​(double[] x, double[] y, int iterations, boolean strict)
      Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution.
      private static int c​(int i, int j, int m, int n, long cmn, boolean strict)
      The function C(i, j) defined in [4] (class javadoc), formula (5.5).
      private static long calculateIntegralD​(double d, int n, int m, boolean strict)
      Given a d-statistic in the range [0, 1] and the two sample sizes n and m, an integral d-statistic in the range [0, n*m] is calculated, that can be used for comparison with other integral d-statistics.
      double cdf​(double d, int n)
      Calculates \(P(D_n < d)\) using the method described in [1] with quick decisions for extreme values given in [2] (see above).
      double cdf​(double d, int n, boolean exact)
      Calculates P(D_n < d) using method described in [1] with quick decisions for extreme values given in [2] (see above).
      double cdfExact​(double d, int n)
      Calculates P(D_n < d).
      private void checkArray​(double[] array)
      Verifies that array has length at least 2.
      private FieldMatrix<BigFraction> createExactH​(double d, int n)
      Creates H of size m x m as described in [1] (see above).
      private RealMatrix createRoundedH​(double d, int n)
      Creates H of size m x m as described in [1] (see above) using double-precision.
      private double exactK​(double d, int n)
      Calculates the exact value of P(D_n < d) using the method described in [1] (reference in class javadoc above) and BigFraction (see above).
      double exactP​(double d, int n, int m, boolean strict)
      Computes \(P(D_{n,m} > d)\) if strict is true; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.
      (package private) static void fillBooleanArrayRandomlyWithFixedNumberTrueValues​(boolean[] b, int numberOfTrueValues, RandomGenerator rng)
      Fills a boolean array randomly with a fixed number of true values.
      private static void fixTies​(double[] x, double[] y)
      If there are no ties in the combined dataset formed from x and y, this method is a no-op.
      private static boolean hasTies​(double[] x, double[] y)
      Returns true iff there are ties in the combined sample formed from x and y.
      private long integralKolmogorovSmirnovStatistic​(double[] x, double[] y)
      Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution of the y values.
      private double integralMonteCarloP​(long d, int n, int m, int iterations)
      Uses Monte Carlo simulation to approximate \(P(D_{n,m} >= d/(n*m))\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.
      private static void jitter​(double[] data, RealDistribution dist)
      Adds random jitter to data using deviates sampled from dist.
      double kolmogorovSmirnovStatistic​(double[] x, double[] y)
      Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution of the y values.
      double kolmogorovSmirnovStatistic​(RealDistribution distribution, double[] data)
      Computes the one-sample Kolmogorov-Smirnov test statistic, \(D_n=\sup_x |F_n(x)-F(x)|\) where \(F\) is the distribution (cdf) function associated with distribution, \(n\) is the length of data and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in data.
      double kolmogorovSmirnovTest​(double[] x, double[] y)
      Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution.
      double kolmogorovSmirnovTest​(double[] x, double[] y, boolean strict)
      Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution.
      double kolmogorovSmirnovTest​(RealDistribution distribution, double[] data)
      Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.
      double kolmogorovSmirnovTest​(RealDistribution distribution, double[] data, boolean exact)
      Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.
      boolean kolmogorovSmirnovTest​(RealDistribution distribution, double[] data, double alpha)
      Performs a Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.
      double ksSum​(double t, double tolerance, int maxIterations)
      Computes \( 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2} \) stopping when successive partial sums are within tolerance of one another, or when maxIterations partial sums have been computed.
      double monteCarloP​(double d, int n, int m, boolean strict, int iterations)
      Uses Monte Carlo simulation to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.
      private static double n​(int i, int j, int m, int n, long cnm, boolean strict)
      The function N(i, j) defined in [4] (class javadoc).
      double pelzGood​(double d, int n)
      Computes the Pelz-Good approximation for \(P(D_n < d)\) as described in [2] in the class javadoc.
      private double roundedK​(double d, int n)
      Calculates P(D_n < d) using method described in [1] and doubles (see above).
      • Methods inherited from class java.lang.Object

        clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
    • Constructor Detail

      • KolmogorovSmirnovTest

        public KolmogorovSmirnovTest()
        Construct a KolmogorovSmirnovTest instance with a default random data generator.
      • KolmogorovSmirnovTest

        @Deprecated
        public KolmogorovSmirnovTest​(RandomGenerator rng)
        Deprecated.
        Construct a KolmogorovSmirnovTest with the provided random data generator. The #monteCarloP(double, int, int, boolean, int) that uses the generator supplied to this constructor is deprecated as of version 3.6.
        Parameters:
        rng - random data generator used by monteCarloP(double, int, int, boolean, int)
    • Method Detail

      • kolmogorovSmirnovTest

        public double kolmogorovSmirnovTest​(RealDistribution distribution,
                                            double[] data,
                                            boolean exact)
        Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution. If exact is true, the distribution used to compute the p-value is computed using extended precision. See cdfExact(double, int).
        Parameters:
        distribution - reference distribution
        data - sample being being evaluated
        exact - whether or not to force exact computation of the p-value
        Returns:
        the p-value associated with the null hypothesis that data is a sample from distribution
        Throws:
        InsufficientDataException - if data does not have length at least 2
        NullArgumentException - if data is null
      • kolmogorovSmirnovStatistic

        public double kolmogorovSmirnovStatistic​(RealDistribution distribution,
                                                 double[] data)
        Computes the one-sample Kolmogorov-Smirnov test statistic, \(D_n=\sup_x |F_n(x)-F(x)|\) where \(F\) is the distribution (cdf) function associated with distribution, \(n\) is the length of data and \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in data.
        Parameters:
        distribution - reference distribution
        data - sample being evaluated
        Returns:
        Kolmogorov-Smirnov statistic \(D_n\)
        Throws:
        InsufficientDataException - if data does not have length at least 2
        NullArgumentException - if data is null
      • kolmogorovSmirnovTest

        public double kolmogorovSmirnovTest​(double[] x,
                                            double[] y,
                                            boolean strict)
        Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution. Specifically, what is returned is an estimate of the probability that the kolmogorovSmirnovStatistic(double[], double[]) associated with a randomly selected partition of the combined sample into subsamples of sizes x.length and y.length will strictly exceed (if strict is true) or be at least as large as strict = false) as kolmogorovSmirnovStatistic(x, y).
        • For small samples (where the product of the sample sizes is less than 10000), the exact p-value is computed using the method presented in [4], implemented in exactP(double, int, int, boolean).
        • When the product of the sample sizes exceeds 10000, the asymptotic distribution of \(D_{n,m}\) is used. See approximateP(double, int, int) for details on the approximation.

        If x.length * y.length < 10000 and the combined set of values in x and y contains ties, random jitter is added to x and y to break ties before computing \(D_{n,m}\) and the p-value. The jitter is uniformly distributed on (-minDelta / 2, minDelta / 2) where minDelta is the smallest pairwise difference between values in the combined sample.

        If ties are known to be present in the data, bootstrap(double[], double[], int, boolean) may be used as an alternative method for estimating the p-value.

        Parameters:
        x - first sample dataset
        y - second sample dataset
        strict - whether or not the probability to compute is expressed as a strict inequality (ignored for large samples)
        Returns:
        p-value associated with the null hypothesis that x and y represent samples from the same distribution
        Throws:
        InsufficientDataException - if either x or y does not have length at least 2
        NullArgumentException - if either x or y is null
        See Also:
        bootstrap(double[], double[], int, boolean)
      • kolmogorovSmirnovTest

        public double kolmogorovSmirnovTest​(double[] x,
                                            double[] y)
        Computes the p-value, or observed significance level, of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution. Assumes the strict form of the inequality used to compute the p-value. See kolmogorovSmirnovTest(RealDistribution, double[], boolean).
        Parameters:
        x - first sample dataset
        y - second sample dataset
        Returns:
        p-value associated with the null hypothesis that x and y represent samples from the same distribution
        Throws:
        InsufficientDataException - if either x or y does not have length at least 2
        NullArgumentException - if either x or y is null
      • kolmogorovSmirnovStatistic

        public double kolmogorovSmirnovStatistic​(double[] x,
                                                 double[] y)
        Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution of the y values.
        Parameters:
        x - first sample
        y - second sample
        Returns:
        test statistic \(D_{n,m}\) used to evaluate the null hypothesis that x and y represent samples from the same underlying distribution
        Throws:
        InsufficientDataException - if either x or y does not have length at least 2
        NullArgumentException - if either x or y is null
      • integralKolmogorovSmirnovStatistic

        private long integralKolmogorovSmirnovStatistic​(double[] x,
                                                        double[] y)
        Computes the two-sample Kolmogorov-Smirnov test statistic, \(D_{n,m}=\sup_x |F_n(x)-F_m(x)|\) where \(n\) is the length of x, \(m\) is the length of y, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in x and \(F_m\) is the empirical distribution of the y values. Finally \(n m D_{n,m}\) is returned as long value.
        Parameters:
        x - first sample
        y - second sample
        Returns:
        test statistic \(n m D_{n,m}\) used to evaluate the null hypothesis that x and y represent samples from the same underlying distribution
        Throws:
        InsufficientDataException - if either x or y does not have length at least 2
        NullArgumentException - if either x or y is null
      • kolmogorovSmirnovTest

        public double kolmogorovSmirnovTest​(RealDistribution distribution,
                                            double[] data)
        Computes the p-value, or observed significance level, of a one-sample Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.
        Parameters:
        distribution - reference distribution
        data - sample being being evaluated
        Returns:
        the p-value associated with the null hypothesis that data is a sample from distribution
        Throws:
        InsufficientDataException - if data does not have length at least 2
        NullArgumentException - if data is null
      • kolmogorovSmirnovTest

        public boolean kolmogorovSmirnovTest​(RealDistribution distribution,
                                             double[] data,
                                             double alpha)
        Performs a Kolmogorov-Smirnov test evaluating the null hypothesis that data conforms to distribution.
        Parameters:
        distribution - reference distribution
        data - sample being being evaluated
        alpha - significance level of the test
        Returns:
        true iff the null hypothesis that data is a sample from distribution can be rejected with confidence 1 - alpha
        Throws:
        InsufficientDataException - if data does not have length at least 2
        NullArgumentException - if data is null
      • bootstrap

        public double bootstrap​(double[] x,
                                double[] y,
                                int iterations,
                                boolean strict)
        Estimates the p-value of a two-sample Kolmogorov-Smirnov test evaluating the null hypothesis that x and y are samples drawn from the same probability distribution. This method estimates the p-value by repeatedly sampling sets of size x.length and y.length from the empirical distribution of the combined sample. When strict is true, this is equivalent to the algorithm implemented in the R function ks.boot, described in
         Jasjeet S. Sekhon. 2011. 'Multivariate and Propensity Score Matching
         Software with Automated Balance Optimization: The Matching package for R.'
         Journal of Statistical Software, 42(7): 1-52.
         
        Parameters:
        x - first sample
        y - second sample
        iterations - number of bootstrap resampling iterations
        strict - whether or not the null hypothesis is expressed as a strict inequality
        Returns:
        estimated p-value
      • bootstrap

        public double bootstrap​(double[] x,
                                double[] y,
                                int iterations)
        Computes bootstrap(x, y, iterations, true). This is equivalent to ks.boot(x,y, nboots=iterations) using the R Matching package function. See #bootstrap(double[], double[], int, boolean).
        Parameters:
        x - first sample
        y - second sample
        iterations - number of bootstrap resampling iterations
        Returns:
        estimated p-value
      • cdf

        public double cdf​(double d,
                          int n)
                   throws MathArithmeticException
        Calculates \(P(D_n < d)\) using the method described in [1] with quick decisions for extreme values given in [2] (see above). The result is not exact as with cdfExact(double, int) because calculations are based on double rather than BigFraction.
        Parameters:
        d - statistic
        n - sample size
        Returns:
        \(P(D_n < d)\)
        Throws:
        MathArithmeticException - if algorithm fails to convert h to a BigFraction in expressing d as \((k - h) / m\) for integer k, m and \(0 \le h < 1\)
      • cdfExact

        public double cdfExact​(double d,
                               int n)
                        throws MathArithmeticException
        Calculates P(D_n < d). The result is exact in the sense that BigFraction/BigReal is used everywhere at the expense of very slow execution time. Almost never choose this in real applications unless you are very sure; this is almost solely for verification purposes. Normally, you would choose cdf(double, int). See the class javadoc for definitions and algorithm description.
        Parameters:
        d - statistic
        n - sample size
        Returns:
        \(P(D_n < d)\)
        Throws:
        MathArithmeticException - if the algorithm fails to convert h to a BigFraction in expressing d as \((k - h) / m\) for integer k, m and \(0 \le h < 1\)
      • cdf

        public double cdf​(double d,
                          int n,
                          boolean exact)
                   throws MathArithmeticException
        Calculates P(D_n < d) using method described in [1] with quick decisions for extreme values given in [2] (see above).
        Parameters:
        d - statistic
        n - sample size
        exact - whether the probability should be calculated exact using BigFraction everywhere at the expense of very slow execution time, or if double should be used convenient places to gain speed. Almost never choose true in real applications unless you are very sure; true is almost solely for verification purposes.
        Returns:
        \(P(D_n < d)\)
        Throws:
        MathArithmeticException - if algorithm fails to convert h to a BigFraction in expressing d as \((k - h) / m\) for integer k, m and \(0 \le h < 1\).
      • exactK

        private double exactK​(double d,
                              int n)
                       throws MathArithmeticException
        Calculates the exact value of P(D_n < d) using the method described in [1] (reference in class javadoc above) and BigFraction (see above).
        Parameters:
        d - statistic
        n - sample size
        Returns:
        the two-sided probability of \(P(D_n < d)\)
        Throws:
        MathArithmeticException - if algorithm fails to convert h to a BigFraction in expressing d as \((k - h) / m\) for integer k, m and \(0 \le h < 1\).
      • roundedK

        private double roundedK​(double d,
                                int n)
        Calculates P(D_n < d) using method described in [1] and doubles (see above).
        Parameters:
        d - statistic
        n - sample size
        Returns:
        \(P(D_n < d)\)
      • pelzGood

        public double pelzGood​(double d,
                               int n)
        Computes the Pelz-Good approximation for \(P(D_n < d)\) as described in [2] in the class javadoc.
        Parameters:
        d - value of d-statistic (x in [2])
        n - sample size
        Returns:
        \(P(D_n < d)\)
        Since:
        3.4
      • createRoundedH

        private RealMatrix createRoundedH​(double d,
                                          int n)
                                   throws NumberIsTooLargeException
        Creates H of size m x m as described in [1] (see above) using double-precision.
        Parameters:
        d - statistic
        n - sample size
        Returns:
        H matrix
        Throws:
        NumberIsTooLargeException - if fractional part is greater than 1
      • checkArray

        private void checkArray​(double[] array)
        Verifies that array has length at least 2.
        Parameters:
        array - array to test
        Throws:
        NullArgumentException - if array is null
        InsufficientDataException - if array is too short
      • ksSum

        public double ksSum​(double t,
                            double tolerance,
                            int maxIterations)
        Computes \( 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2} \) stopping when successive partial sums are within tolerance of one another, or when maxIterations partial sums have been computed. If the sum does not converge before maxIterations iterations a TooManyIterationsException is thrown.
        Parameters:
        t - argument
        tolerance - Cauchy criterion for partial sums
        maxIterations - maximum number of partial sums to compute
        Returns:
        Kolmogorov sum evaluated at t
        Throws:
        TooManyIterationsException - if the series does not converge
      • calculateIntegralD

        private static long calculateIntegralD​(double d,
                                               int n,
                                               int m,
                                               boolean strict)
        Given a d-statistic in the range [0, 1] and the two sample sizes n and m, an integral d-statistic in the range [0, n*m] is calculated, that can be used for comparison with other integral d-statistics. Depending whether strict is true or not, the returned value divided by (n*m) is greater than (resp greater than or equal to) the given d value (allowing some tolerance).
        Parameters:
        d - a d-statistic in the range [0, 1]
        n - first sample size
        m - second sample size
        strict - whether the returned value divided by (n*m) is allowed to be equal to d
        Returns:
        the integral d-statistic in the range [0, n*m]
      • exactP

        public double exactP​(double d,
                             int n,
                             int m,
                             boolean strict)
        Computes \(P(D_{n,m} > d)\) if strict is true; otherwise \(P(D_{n,m} \ge d)\), where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. See kolmogorovSmirnovStatistic(double[], double[]) for the definition of \(D_{n,m}\).

        The returned probability is exact, implemented by unwinding the recursive function definitions presented in [4] (class javadoc).

        Parameters:
        d - D-statistic value
        n - first sample size
        m - second sample size
        strict - whether or not the probability to compute is expressed as a strict inequality
        Returns:
        probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\) greater than (resp. greater than or equal to) d
      • approximateP

        public double approximateP​(double d,
                                   int n,
                                   int m)
        Uses the Kolmogorov-Smirnov distribution to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. See kolmogorovSmirnovStatistic(double[], double[]) for the definition of \(D_{n,m}\).

        Specifically, what is returned is \(1 - k(d \sqrt{mn / (m + n)})\) where \(k(t) = 1 + 2 \sum_{i=1}^\infty (-1)^i e^{-2 i^2 t^2}\). See ksSum(double, double, int) for details on how convergence of the sum is determined. This implementation passes ksSum 1.0E-20 as tolerance and 100000 as maxIterations.

        Parameters:
        d - D-statistic value
        n - first sample size
        m - second sample size
        Returns:
        approximate probability that a randomly selected m-n partition of m + n generates \(D_{n,m}\) greater than d
      • fillBooleanArrayRandomlyWithFixedNumberTrueValues

        static void fillBooleanArrayRandomlyWithFixedNumberTrueValues​(boolean[] b,
                                                                      int numberOfTrueValues,
                                                                      RandomGenerator rng)
        Fills a boolean array randomly with a fixed number of true values. The method uses a simplified version of the Fisher-Yates shuffle algorithm. By processing first the true values followed by the remaining false values less random numbers need to be generated. The method is optimized for the case that the number of true values is larger than or equal to the number of false values.
        Parameters:
        b - boolean array
        numberOfTrueValues - number of true values the boolean array should finally have
        rng - random data generator
      • monteCarloP

        public double monteCarloP​(double d,
                                  int n,
                                  int m,
                                  boolean strict,
                                  int iterations)
        Uses Monte Carlo simulation to approximate \(P(D_{n,m} > d)\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic. See kolmogorovSmirnovStatistic(double[], double[]) for the definition of \(D_{n,m}\).

        The simulation generates iterations random partitions of m + n into an n set and an m set, computing \(D_{n,m}\) for each partition and returning the proportion of values that are greater than d, or greater than or equal to d if strict is false.

        Parameters:
        d - D-statistic value
        n - first sample size
        m - second sample size
        iterations - number of random partitions to generate
        strict - whether or not the probability to compute is expressed as a strict inequality
        Returns:
        proportion of randomly generated m-n partitions of m + n that result in \(D_{n,m}\) greater than (resp. greater than or equal to) d
      • integralMonteCarloP

        private double integralMonteCarloP​(long d,
                                           int n,
                                           int m,
                                           int iterations)
        Uses Monte Carlo simulation to approximate \(P(D_{n,m} >= d/(n*m))\) where \(D_{n,m}\) is the 2-sample Kolmogorov-Smirnov statistic.

        Here d is the D-statistic represented as long value. The real D-statistic is obtained by dividing d by n*m. See also monteCarloP(double, int, int, boolean, int).

        Parameters:
        d - integral D-statistic
        n - first sample size
        m - second sample size
        iterations - number of random partitions to generate
        Returns:
        proportion of randomly generated m-n partitions of m + n that result in \(D_{n,m}\) greater than or equal to d/(n*m))
      • fixTies

        private static void fixTies​(double[] x,
                                    double[] y)
        If there are no ties in the combined dataset formed from x and y, this method is a no-op. If there are ties, a uniform random deviate in (-minDelta / 2, minDelta / 2) - {0} is added to each value in x and y, where minDelta is the minimum difference between unequal values in the combined sample. A fixed seed is used to generate the jitter, so repeated activations with the same input arrays result in the same values. NOTE: if there are ties in the data, this method overwrites the data in x and y with the jittered values.
        Parameters:
        x - first sample
        y - second sample
      • hasTies

        private static boolean hasTies​(double[] x,
                                       double[] y)
        Returns true iff there are ties in the combined sample formed from x and y.
        Parameters:
        x - first sample
        y - second sample
        Returns:
        true if x and y together contain ties
      • jitter

        private static void jitter​(double[] data,
                                   RealDistribution dist)
        Adds random jitter to data using deviates sampled from dist.

        Note that jitter is applied in-place - i.e., the array values are overwritten with the result of applying jitter.

        Parameters:
        data - input/output data array - entries overwritten by the method
        dist - probability distribution to sample for jitter values
        Throws:
        java.lang.NullPointerException - if either of the parameters is null
      • c

        private static int c​(int i,
                             int j,
                             int m,
                             int n,
                             long cmn,
                             boolean strict)
        The function C(i, j) defined in [4] (class javadoc), formula (5.5). defined to return 1 if |i/n - j/m| <= c; 0 otherwise. Here c is scaled up and recoded as a long to avoid rounding errors in comparison tests, so what is actually tested is |im - jn| <= cmn.
        Parameters:
        i - first path parameter
        j - second path paramter
        m - first sample size
        n - second sample size
        cmn - integral D-statistic (see calculateIntegralD(double, int, int, boolean))
        strict - whether or not the null hypothesis uses strict inequality
        Returns:
        C(i,j) for given m, n, c
      • n

        private static double n​(int i,
                                int j,
                                int m,
                                int n,
                                long cnm,
                                boolean strict)
        The function N(i, j) defined in [4] (class javadoc). Returns the number of paths over the lattice {(i,j) : 0 <= i <= n, 0 <= j <= m} from (0,0) to (i,j) satisfying C(h,k, m, n, c) = 1 for each (h,k) on the path. The return value is integral, but subject to overflow, so it is maintained and returned as a double.
        Parameters:
        i - first path parameter
        j - second path parameter
        m - first sample size
        n - second sample size
        cnm - integral D-statistic (see calculateIntegralD(double, int, int, boolean))
        strict - whether or not the null hypothesis uses strict inequality
        Returns:
        number or paths to (i, j) from (0,0) representing D-values as large as c for given m, n