Class ExtendedPrecision
It is based on the 1971 paper Dekker (1971) A floating-point technique for extending the available precision.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final int
The value 2046 converted for use if usingInteger.compareUnsigned(int, int)
.private static final int
The value -1 converted for use if usingInteger.compareUnsigned(int, int)
.private static final double
The scale to use when down-scaling during a split into a high part.private static final int
The mask to extract the raw 11-bit exponent.private static final double
The multiplier used to split the double value into high and low parts.private static final double
The upper limit above which a number may overflow during the split into a high part.private static final double
The scale to use when re-scaling during a split into a high part. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescription(package private) static double
highPartUnscaled
(double value) Implement Dekker's method to split a value into two parts.(package private) static boolean
isNotNormal
(double a) Checks if the number is not normal.(package private) static double
productLow
(double x, double y, double xy) Compute the low part of the double length number(z,zz)
for the exact product ofx
andy
.private static double
productLow
(double hx, double lx, double hy, double ly, double xy) Compute the low part of the double length number(z,zz)
for the exact product ofx
andy
using Dekker's mult12 algorithm.private static double
productLowUnscaled
(double x, double y, double xy) Compute the low part of the double length number(z,zz)
for the exact product ofx
andy
using Dekker's mult12 algorithm.(package private) static double
squareLowUnscaled
(double x, double xx) Compute the low part of the double length number(z,zz)
for the exact square ofx
using Dekker's mult12 algorithm.(package private) static double
twoSumLow
(double a, double b, double sum) Compute the round-off from the sum of two numbersa
andb
using Knuth's two-sum algorithm.
-
Field Details
-
MULTIPLIER
private static final double MULTIPLIERThe multiplier used to split the double value into high and low parts. From Dekker (1971): "The constant should be chosen equal to 2^(p - p/2) + 1, where p is the number of binary digits in the mantissa". Here p is 53 and the multiplier is2^27 + 1
.- See Also:
-
SAFE_UPPER
private static final double SAFE_UPPERThe upper limit above which a number may overflow during the split into a high part. Assuming the multiplier is above 2^27 and the maximum exponent is 1023 then a safe limit is a value with an exponent of (1023 - 27) = 2^996. 996 is the value obtained fromMath.getExponent(Double.MAX_VALUE / MULTIPLIER)
.- See Also:
-
DOWN_SCALE
private static final double DOWN_SCALEThe scale to use when down-scaling during a split into a high part. This must be smaller than the inverse of the multiplier and a power of 2 for exact scaling.- See Also:
-
UP_SCALE
private static final double UP_SCALEThe scale to use when re-scaling during a split into a high part. This is the inverse ofDOWN_SCALE
.- See Also:
-
EXP_MASK
private static final int EXP_MASKThe mask to extract the raw 11-bit exponent. The value must be shifted 52-bits to remove the mantissa bits.- See Also:
-
CMP_UNSIGNED_2046
private static final int CMP_UNSIGNED_2046The value 2046 converted for use if usingInteger.compareUnsigned(int, int)
. This requires addingInteger.MIN_VALUE
to 2046.- See Also:
-
CMP_UNSIGNED_MINUS_1
private static final int CMP_UNSIGNED_MINUS_1The value -1 converted for use if usingInteger.compareUnsigned(int, int)
. This requires addingInteger.MIN_VALUE
to -1.- See Also:
-
-
Constructor Details
-
ExtendedPrecision
private ExtendedPrecision()Private constructor.
-
-
Method Details
-
productLow
static double productLow(double x, double y, double xy) Compute the low part of the double length number(z,zz)
for the exact product ofx
andy
. This is equivalent to computing adouble
containing the magnitude of the rounding error when converting the exact 106-bit significand of the multiplication result to a 53-bit significand.The method is written to be functionally similar to using a fused multiply add (FMA) operation to compute the low part, for example JDK 9's Math.fma function (note the sign change in the input argument for the product):
double x = ...; double y = ...; double xy = x * y; double low1 = Math.fma(x, y, -xy); double low2 = productLow(x, y, xy);
Special cases:
- If
x * y
is sub-normal or zero then the result is 0.0. - If
x * y
is infinite or NaN then the result is NaN.
- Parameters:
x
- First factor.y
- Second factor.xy
- Product of the factors (x * y).- Returns:
- the low part of the product double length number
- See Also:
- If
-
isNotNormal
static boolean isNotNormal(double a) Checks if the number is not normal. This is functionally equivalent to:final double abs = Math.abs(a); return (abs invalid input: '<'= Double.MIN_NORMAL || !(abs invalid input: '<'= Double.MAX_VALUE));
- Parameters:
a
- The value.- Returns:
- true if the value is not normal
-
productLowUnscaled
private static double productLowUnscaled(double x, double y, double xy) Compute the low part of the double length number(z,zz)
for the exact product ofx
andy
using Dekker's mult12 algorithm. The standard precision productx*y
must be provided. The numbersx
andy
are split into high and low parts using Dekker's algorithm.Warning: This method does not perform scaling in Dekker's split and large finite numbers can create NaN results.
- Parameters:
x
- First factor.y
- Second factor.xy
- Product of the factors (x * y).- Returns:
- the low part of the product double length number
- See Also:
-
squareLowUnscaled
static double squareLowUnscaled(double x, double xx) Compute the low part of the double length number(z,zz)
for the exact square ofx
using Dekker's mult12 algorithm. The standard precision productx*x
must be provided. The numberx
is split into high and low parts using Dekker's algorithm.Warning: This method does not perform scaling in Dekker's split and large finite numbers can create NaN results.
- Parameters:
x
- Number to squarexx
- Standard precision productx*x
- Returns:
- the low part of the square double length number
-
productLow
private static double productLow(double hx, double lx, double hy, double ly, double xy) Compute the low part of the double length number(z,zz)
for the exact product ofx
andy
using Dekker's mult12 algorithm. The standard precision productx*y
must be provided. The numbersx
andy
should already be split into low and high parts.Note: This uses the high part of the result
(z,zz)
asx * y
and nothx * hy + hx * ty + tx * hy
as specified in Dekker's original paper. See Shewchuk (1997) for working examples.- Parameters:
hx
- High part of first factor.lx
- Low part of first factor.hy
- High part of second factor.ly
- Low part of second factor.xy
- Product of the factors.- Returns:
lx * ly - (((xy - hx * hy) - lx * hy) - hx * ly)
- See Also:
-
highPartUnscaled
static double highPartUnscaled(double value) Implement Dekker's method to split a value into two parts. Multiplying by (2^s + 1) creates a big value from which to derive the two split parts.c = (2^s + 1) * a a_big = c - a a_hi = c - a_big a_lo = a - a_hi a = a_hi + a_lo
The multiplicand allows a p-bit value to be split into (p-s)-bit value
a_hi
and a non-overlapping (s-1)-bit valuea_lo
. Combined they have (p-1) bits of significand but the sign bit ofa_lo
contains a bit of information. The constant is chosen so that s is ceil(p/2) where the precision p for a double is 53-bits (1-bit of the mantissa is assumed to be 1 for a non sub-normal number) and s is 27.This conversion does not use scaling and the result of overflow is NaN. Overflow may occur when the exponent of the input value is above 996.
Splitting a NaN or infinite value will return NaN.
- Parameters:
value
- Value.- Returns:
- the high part of the value.
- See Also:
-
twoSumLow
static double twoSumLow(double a, double b, double sum) Compute the round-off from the sum of two numbersa
andb
using Knuth's two-sum algorithm. The values are not required to be ordered by magnitude. The standard precision sum must be provided.- Parameters:
a
- First part of sum.b
- Second part of sum.sum
- Sum of the parts (a + b).- Returns:
(b - (sum - (sum - b))) + (a - (sum - b))
- See Also:
-