Class UnconditionedExactTest

java.lang.Object
org.apache.commons.statistics.inference.UnconditionedExactTest

public final class UnconditionedExactTest extends Object
Implements an unconditioned exact test for a contingency table.

Performs an exact test for the statistical significance of the association (contingency) between two kinds of categorical classification. A 2x2 contingency table is:

\[ \left[ {\begin{array}{cc} a & b \\ c & d \\ \end{array} } \right] \]

This test applies to the case of a 2x2 contingency table with one margin fixed. Note that if both margins are fixed (the row sums and column sums are not random) then Fisher's exact test can be applied.

This implementation fixes the column sums \( m = a + c \) and \( n = b + d \). All possible tables can be created using \( 0 \le a \le m \) and \( 0 \le b \le n \). The random values \( a \) and \( b \) follow a binomial distribution with probabilities \( p_0 \) and \( p_1 \) such that \( a \sim B(m, p_0) \) and \( b \sim B(n, p_1) \). The p-value of the 2x2 table is the product of two binomials:

\[ \begin{aligned} p &= Pr(a; m, p_0) \times Pr(b; n, p_1) \\ &= \binom{m}{a} p_0^a (1-p_0)^{m-a} \times \binom{n}{b} p_1^b (1-p_1)^{n-b} \end{aligned} \]

For the binomial model, the null hypothesis is the two nuisance parameters are equal \( p_0 = p_1 = \pi\), with \( \pi \) the probability for equal proportions, and the probability of any single table is:

\[ p = \binom{m}{a} \binom{n}{b} \pi^{a+b} (1-\pi)^{m+n-a-b} \]

The p-value of the observed table is calculated by maximising the sum of the as or more extreme tables over the domain of the nuisance parameter \( 0 \lt \pi \lt 1 \):

\[ p(a, b) = \sum_{i,j} \binom{m}{i} \binom{n}{j} \pi^{i+j} (1-\pi)^{m+n-i-j} \]

where table \( (i,j) \) is as or more extreme than the observed table \( (a, b) \). The test can be configured to select more extreme tables using various methods.

Note that the sum of the joint binomial distribution is a univariate function for the nuisance parameter \( \pi \). This function may have many local maxima and the search enumerates the range with a configured number of points. The best candidates are optionally used as the start point for an optimized search for a local maxima.

References:

  1. Barnard, G.A. (1947). Significance tests for 2x2 tables. Biometrika, 34, Issue 1-2, 123–138.
  2. Boschloo, R.D. (1970). Raised conditional level of significance for the 2 × 2-table when testing the equality of two probabilities. Statistica neerlandica, 24(1), 1–9.
  3. Suisaa, A and Shuster, J.J. (1985). Exact Unconditional Sample Sizes for the 2 × 2 Binomial Trial. Journal of the Royal Statistical Society. Series A (General), 148(4), 317-327.
Since:
1.1
See Also:
  • Field Details

    • DEFAULT

      private static final UnconditionedExactTest DEFAULT
      Default instance.

      SciPy's boschloo_exact and barnard_exact tests use 32 points in the interval [0, 1) The R Exact package uses 100 in the interval [1e-5, 1-1e-5]. Barnards 1947 paper describes the nuisance parameter in the open interval 0 < pi < 1. Here we respect the open-interval for the initial candidates and ignore 0 and 1. The initial bounds used are the same as the R Exact package. We closely match the inner 31 points from SciPy by using 33 points by default.

    • LOWER_BOUND

      private static final double LOWER_BOUND
      Lower bound for the enumerated interval. The upper bound is 1 - lower.
      See Also:
    • SOLVER_RELATIVE_EPS

      private static final double SOLVER_RELATIVE_EPS
      Relative epsilon for the Brent solver. This is limited for a univariate function to approximately sqrt(eps) with eps = 2^-52.
      See Also:
    • INC_FRACTION

      private static final double INC_FRACTION
      Fraction of the increment (interval between enumerated points) to initialise the bracket for the minima. Note the minima should lie between x +/- increment. The bracket should search within this range. Set to 1/8 and so the initial point of the bracket is approximately 1.61 * 1/8 = 0.2 of the increment away from initial points a or b.
      See Also:
    • MAX_CANDIDATES

      private static final int MAX_CANDIDATES
      Maximum number of candidate to optimize. This is a safety limit to avoid excess optimization. Only candidates within a relative tolerance of the best candidate are stored. If the number of candidates exceeds this value then many candidates have a very similar p-value and the top candidates will be optimized. Using a value of 3 allows at least one other candidate to be optimized when there is two-fold symmetry in the energy function.
      See Also:
    • MINIMA_EPS

      private static final double MINIMA_EPS
      Relative distance of candidate minima from the lowest candidate. Used to exclude poor candidates from optimization.
      See Also:
    • MAX_TABLES

      private static final int MAX_TABLES
      The maximum number of tables. This is limited by the maximum number of indices that can be maintained in memory. Potentially up to this number of tables must be tracked during computation of the p-value for as or more extreme tables. The limit is set using the same limit for maximum capacity as java.util.ArrayList. In practice any table anywhere near this limit can be computed using an alternative such as a chi-squared or g test.
      See Also:
    • COLUMN_SUM

      private static final String COLUMN_SUM
      Error message text for zero column sums.
      See Also:
    • alternative

      private final AlternativeHypothesis alternative
      Alternative hypothesis.
    • method

      private final UnconditionedExactTest.Method method
      Method to identify more extreme tables.
    • points

      private final int points
      Number of initial points.
    • optimize

      private final boolean optimize
      Option to optimize the best initial point(s).
  • Constructor Details

    • UnconditionedExactTest

      private UnconditionedExactTest(AlternativeHypothesis alternative, UnconditionedExactTest.Method method, int points, boolean optimize)
      Parameters:
      alternative - Alternative hypothesis.
      method - Method to identify more extreme tables.
      points - Number of initial points.
      optimize - Option to optimize the best initial point(s).
  • Method Details

    • withDefaults

      public static UnconditionedExactTest withDefaults()
      Returns:
      default instance
    • with

      Return an instance with the configured alternative hypothesis.
      Parameters:
      v - Value.
      Returns:
      an instance
    • with

      Return an instance with the configured method.
      Parameters:
      v - Value.
      Returns:
      an instance
    • withInitialPoints

      public UnconditionedExactTest withInitialPoints(int v)
      Return an instance with the configured number of initial points.

      The search for the nuisance parameter will use \( v \) points in the open interval \( (0, 1) \). The interval is evaluated by including start and end points approximately equal to 0 and 1. Additional internal points are enumerated using increments of approximately \( \frac{1}{v-1} \). The minimum number of points is 2. Increasing the number of points increases the precision of the search at the cost of performance.

      To approximately double the number of points so that all existing points are included and additional points half-way between them are sampled requires using 2p - 1 where p is the existing number of points.

      Parameters:
      v - Value.
      Returns:
      an instance
      Throws:
      IllegalArgumentException - if the value is < 2.
    • withOptimize

      public UnconditionedExactTest withOptimize(boolean v)
      Return an instance with the configured optimization of initial search points.

      If enabled then the initial point(s) with the highest probability is/are used as the start for an optimization to find a local maxima.

      Parameters:
      v - Value.
      Returns:
      an instance
      See Also:
    • statistic

      public double statistic(int[][] table)
      Compute the statistic for the unconditioned exact test. The statistic returned depends on the configured method.
      Parameters:
      table - 2-by-2 contingency table.
      Returns:
      test statistic
      Throws:
      IllegalArgumentException - if the table is not a 2-by-2 table; any table entry is negative; any column sum is zero; the table sum is zero or not an integer; or the number of possible tables exceeds the maximum array capacity.
      See Also:
    • test

      public UnconditionedExactTest.Result test(int[][] table)
      Performs an unconditioned exact test on the 2-by-2 contingency table. The statistic and p-value returned depends on the configured method and alternative hypothesis.

      The search for the nuisance parameter that maximises the p-value can be configured to: start with a number of initial points; and optimize the best points.

      Parameters:
      table - 2-by-2 contingency table.
      Returns:
      test result
      Throws:
      IllegalArgumentException - if the table is not a 2-by-2 table; any table entry is negative; any column sum is zero; the table sum is zero or not an integer; or the number of possible tables exceeds the maximum array capacity.
      See Also:
    • findExtremeTables

      private double findExtremeTables(int a, int b, UnconditionedExactTest.XYList tableList)
      Find all tables that are as or more extreme than the observed table.

      If the list of tables is full then all tables are more extreme. Some configurations can detect this without performing a search and in this case the list of tables is returned as empty.

      Parameters:
      a - Observed value for a.
      b - Observed value for b.
      tableList - List to track more extreme tables.
      Returns:
      the test statistic
    • statisticZ

      private static double statisticZ(int a, int b, int m, int n, boolean pooled)
      Compute the statistic from a Z-test.
      Parameters:
      a - Observed value for a.
      b - Observed value for b.
      m - Column sum m.
      n - Column sum n.
      pooled - true to use a pooled variance.
      Returns:
      z
    • findExtremeTablesZ

      private double findExtremeTablesZ(int a, int b, int m, int n, boolean pooled, UnconditionedExactTest.XYList tableList)
      Find all tables that are as or more extreme than the observed table using the Z statistic.
      Parameters:
      a - Observed value for a.
      b - Observed value for b.
      m - Column sum m.
      n - Column sum n.
      pooled - true to use a pooled variance.
      tableList - List to track more extreme tables.
      Returns:
      observed z
    • statisticBoschloo

      private double statisticBoschloo(int a, int b, int m, int n)
      Compute the statistic using Fisher's p-value (also known as Boschloo's test).
      Parameters:
      a - Observed value for a.
      b - Observed value for b.
      m - Column sum m.
      n - Column sum n.
      Returns:
      p-value
    • statisticBoschlooTwoSided

      private static double statisticBoschlooTwoSided(Hypergeom distribution, int k)
      Compute the two-sided statistic using Fisher's p-value (also known as Boschloo's test).
      Parameters:
      distribution - Hypergeometric distribution.
      k - Observed value.
      Returns:
      p-value
    • findExtremeTablesBoschloo

      private double findExtremeTablesBoschloo(int a, int b, int m, int n, UnconditionedExactTest.XYList tableList)
      Find all tables that are as or more extreme than the observed table using the Fisher's p-value as the statistic (also known as Boschloo's test).
      Parameters:
      a - Observed value for a.
      b - Observed value for b.
      m - Column sum m.
      n - Column sum n.
      tableList - List to track more extreme tables.
      Returns:
      observed p-value
    • computePValue

      private double[] computePValue(UnconditionedExactTest.XYList tableList)
      Compute the nuisance parameter and p-value for the binomial model given the list of possible tables.

      The current method enumerates an initial set of points and stores local extrema as candidates. Any candidate within 2% of the best is optionally optimized; this is limited to the top 3 candidates. These settings could be exposed as configurable options. Currently only the choice to optimize or not is exposed.

      Parameters:
      tableList - List of tables.
      Returns:
      [nuisance parameter, p-value]
    • createBinomialModel

      private static DoubleUnaryOperator createBinomialModel(UnconditionedExactTest.XYList tableList)
      Creates the binomial model p-value function for the nuisance parameter. Note: This function computes the negative p-value so is suitable for optimization by a search for a minimum.
      Parameters:
      tableList - List of tables.
      Returns:
      the function
    • createLogBinomialCoefficients

      private static IntToDoubleFunction createLogBinomialCoefficients(int n)
      Create the natural logarithm of the binomial coefficient for all k = [0, n].
      Parameters:
      n - Limit N.
      Returns:
      ln binom(n, k)
    • addCandidate

      private void addCandidate(UnconditionedExactTest.Candidates minima, double v1, double v2, double v3, double x2)
      Add point 2 to the list of minima if neither neighbour value is lower.
       !(v1 invalid input: '<' v2 || v3 invalid input: '<' v2)
       
      Parameters:
      minima - Candidate minima.
      v1 - First point function value.
      v2 - Second point function value.
      v3 - Third point function value.
      x2 - Second point.
    • checkTable

      private static void checkTable(int[][] table)
      Check the input is a 2-by-2 contingency table.
      Parameters:
      table - Contingency table.
      Throws:
      IllegalArgumentException - if the table is not a 2-by-2 table; any table entry is negative; any column sum is zero; the table sum is zero or not an integer; or the number of possible tables exceeds the maximum array capacity.