Class EmpiricalDistribution
- java.lang.Object
-
- org.apache.commons.math3.distribution.AbstractRealDistribution
-
- org.apache.commons.math3.random.EmpiricalDistribution
-
- All Implemented Interfaces:
java.io.Serializable
,RealDistribution
public class EmpiricalDistribution extends AbstractRealDistribution
Represents an empirical probability distribution -- a probability distribution derived from observed data without making any assumptions about the functional form of the population distribution that the data come from.
An
EmpiricalDistribution
maintains data structures, called distribution digests, that describe empirical distributions and support the following operations:- loading the distribution from a file of observed data values
- dividing the input data into "bin ranges" and reporting bin frequency counts (data for histogram)
- reporting univariate statistics describing the full set of data values as well as the observations within each bin
- generating random values from the distribution
EmpiricalDistribution
to build grouped frequency histograms representing the input data or to generate random values "like" those in the input file -- i.e., the values generated will follow the distribution of the values in the file.The implementation uses what amounts to the Variable Kernel Method with Gaussian smoothing:
Digesting the input file
- Pass the file once to compute min and max.
- Divide the range from min-max into
binCount
"bins." - Pass the data file again, computing bin counts and univariate statistics (mean, std dev.) for each of the bins
- Divide the interval (0,1) into subintervals associated with the bins, with the length of a bin's subinterval proportional to its count.
- Generate a uniformly distributed value in (0,1)
- Select the subinterval to which the value belongs.
- Generate a random Gaussian value with mean = mean of the associated bin and std dev = std dev of associated bin.
EmpiricalDistribution implements the
USAGE NOTES:RealDistribution
interface as follows. Given x within the range of values in the dataset, let B be the bin containing x and let K be the within-bin kernel for B. Let P(B-) be the sum of the probabilities of the bins below B and let K(B) be the mass of B under K (i.e., the integral of the kernel density over B). Then set P(X < x) = P(B-) + P(B) * K(x) / K(B) where K(x) is the kernel distribution evaluated at x. This results in a cdf that matches the grouped frequency distribution at the bin endpoints and interpolates within bins using within-bin kernels.- The
binCount
is set by default to 1000. A good rule of thumb is to set the bin count to approximately the length of the input file divided by 10. - The input file must be a plain text file containing one valid numeric entry per line.
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description private class
EmpiricalDistribution.ArrayDataAdapter
DataAdapter
for data provided as array of doubles.private class
EmpiricalDistribution.DataAdapter
Provides methods for computingsampleStats
andbeanStats
abstracting the source of data.private class
EmpiricalDistribution.StreamDataAdapter
DataAdapter
for data provided through some input stream
-
Field Summary
Fields Modifier and Type Field Description private int
binCount
number of binsprivate java.util.List<SummaryStatistics>
binStats
List of SummaryStatistics objects characterizing the binsstatic int
DEFAULT_BIN_COUNT
Default bin countprivate double
delta
Grid sizeprivate static java.lang.String
FILE_CHARSET
Character set for file inputprivate boolean
loaded
is the distribution loaded?private double
max
Max loaded valueprivate double
min
Min loaded valueprotected RandomDataGenerator
randomData
RandomDataGenerator instance to use in repeated calls to getNext()private SummaryStatistics
sampleStats
Sample statisticsprivate static long
serialVersionUID
Serializable version identifierprivate double[]
upperBounds
upper bounds of subintervals in (0,1) "belonging" to the bins-
Fields inherited from class org.apache.commons.math3.distribution.AbstractRealDistribution
random, SOLVER_DEFAULT_ABSOLUTE_ACCURACY
-
-
Constructor Summary
Constructors Modifier Constructor Description EmpiricalDistribution()
Creates a new EmpiricalDistribution with the default bin count.EmpiricalDistribution(int binCount)
Creates a new EmpiricalDistribution with the specified bin count.private
EmpiricalDistribution(int binCount, RandomDataGenerator randomData)
Private constructor to allow lazy initialisation of the RNG contained in therandomData
instance variable.EmpiricalDistribution(int binCount, RandomDataImpl randomData)
Deprecated.As of 3.1.EmpiricalDistribution(int binCount, RandomGenerator generator)
Creates a new EmpiricalDistribution with the specified bin count using the providedRandomGenerator
as the source of random data.EmpiricalDistribution(RandomDataImpl randomData)
Deprecated.As of 3.1.EmpiricalDistribution(RandomGenerator generator)
Creates a new EmpiricalDistribution with default bin count using the providedRandomGenerator
as the source of random data.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private double
cumBinP(int binIndex)
The combined probability of the bins up to and including binIndex.double
cumulativeProbability(double x)
For a random variableX
whose values are distributed according to this distribution, this method returnsP(X <= x)
.double
density(double x)
Returns the probability density function (PDF) of this distribution evaluated at the specified pointx
.private void
fillBinStats(EmpiricalDistribution.DataAdapter da)
Fills binStats array (second pass through data file).private int
findBin(double value)
Returns the index of the bin to which the given value belongsint
getBinCount()
Returns the number of bins.java.util.List<SummaryStatistics>
getBinStats()
Returns a List ofSummaryStatistics
instances containing statistics describing the values in each of the bins.double[]
getGeneratorUpperBounds()
Returns a fresh copy of the array of upper bounds of the subintervals of [0,1] used in generating data from the empirical distribution.protected RealDistribution
getKernel(SummaryStatistics bStats)
The within-bin smoothing kernel.double
getNextValue()
Generates a random value from this distribution.double
getNumericalMean()
Use this method to get the numerical value of the mean of this distribution.double
getNumericalVariance()
Use this method to get the numerical value of the variance of this distribution.StatisticalSummary
getSampleStats()
Returns aStatisticalSummary
describing this distribution.double
getSupportLowerBound()
Access the lower bound of the support.double
getSupportUpperBound()
Access the upper bound of the support.double[]
getUpperBounds()
Returns a fresh copy of the array of upper bounds for the bins.double
inverseCumulativeProbability(double p)
Computes the quantile function of this distribution.boolean
isLoaded()
Property indicating whether or not the distribution has been loaded.boolean
isSupportConnected()
Use this method to get information about whether the support is connected, i.e.boolean
isSupportLowerBoundInclusive()
Whether or not the lower bound of support is in the domain of the density function.boolean
isSupportUpperBoundInclusive()
Whether or not the upper bound of support is in the domain of the density function.private RealDistribution
k(double x)
The within-bin kernel of the bin that x belongs to.private double
kB(int i)
Mass of bin i under the within-bin kernel of the bin.void
load(double[] in)
Computes the empirical distribution from the provided array of numbers.void
load(java.io.File file)
Computes the empirical distribution from the input file.void
load(java.net.URL url)
Computes the empirical distribution using data read from a URL.private double
pB(int i)
The probability of bin i.private double
pBminus(int i)
The combined probability of the bins up to but not including bin i.double
probability(double x)
For a random variableX
whose values are distributed according to this distribution, this method returnsP(X = x)
.void
reSeed(long seed)
Reseeds the random number generator used bygetNextValue()
.void
reseedRandomGenerator(long seed)
Reseed the random generator used to generate samples.-
Methods inherited from class org.apache.commons.math3.distribution.AbstractRealDistribution
cumulativeProbability, getSolverAbsoluteAccuracy, logDensity, probability, sample, sample
-
-
-
-
Field Detail
-
DEFAULT_BIN_COUNT
public static final int DEFAULT_BIN_COUNT
Default bin count- See Also:
- Constant Field Values
-
FILE_CHARSET
private static final java.lang.String FILE_CHARSET
Character set for file input- See Also:
- Constant Field Values
-
serialVersionUID
private static final long serialVersionUID
Serializable version identifier- See Also:
- Constant Field Values
-
randomData
protected final RandomDataGenerator randomData
RandomDataGenerator instance to use in repeated calls to getNext()
-
binStats
private final java.util.List<SummaryStatistics> binStats
List of SummaryStatistics objects characterizing the bins
-
sampleStats
private SummaryStatistics sampleStats
Sample statistics
-
max
private double max
Max loaded value
-
min
private double min
Min loaded value
-
delta
private double delta
Grid size
-
binCount
private final int binCount
number of bins
-
loaded
private boolean loaded
is the distribution loaded?
-
upperBounds
private double[] upperBounds
upper bounds of subintervals in (0,1) "belonging" to the bins
-
-
Constructor Detail
-
EmpiricalDistribution
public EmpiricalDistribution()
Creates a new EmpiricalDistribution with the default bin count.
-
EmpiricalDistribution
public EmpiricalDistribution(int binCount)
Creates a new EmpiricalDistribution with the specified bin count.- Parameters:
binCount
- number of bins. Must be strictly positive.- Throws:
NotStrictlyPositiveException
- ifbinCount <= 0
.
-
EmpiricalDistribution
public EmpiricalDistribution(int binCount, RandomGenerator generator)
Creates a new EmpiricalDistribution with the specified bin count using the providedRandomGenerator
as the source of random data.- Parameters:
binCount
- number of bins. Must be strictly positive.generator
- random data generator (may be null, resulting in default JDK generator)- Throws:
NotStrictlyPositiveException
- ifbinCount <= 0
.- Since:
- 3.0
-
EmpiricalDistribution
public EmpiricalDistribution(RandomGenerator generator)
Creates a new EmpiricalDistribution with default bin count using the providedRandomGenerator
as the source of random data.- Parameters:
generator
- random data generator (may be null, resulting in default JDK generator)- Since:
- 3.0
-
EmpiricalDistribution
@Deprecated public EmpiricalDistribution(int binCount, RandomDataImpl randomData)
Deprecated.As of 3.1. Please useEmpiricalDistribution(int,RandomGenerator)
instead.Creates a new EmpiricalDistribution with the specified bin count using the providedRandomDataImpl
instance as the source of random data.- Parameters:
binCount
- number of binsrandomData
- random data generator (may be null, resulting in default JDK generator)- Since:
- 3.0
-
EmpiricalDistribution
@Deprecated public EmpiricalDistribution(RandomDataImpl randomData)
Deprecated.As of 3.1. Please useEmpiricalDistribution(RandomGenerator)
instead.Creates a new EmpiricalDistribution with default bin count using the providedRandomDataImpl
as the source of random data.- Parameters:
randomData
- random data generator (may be null, resulting in default JDK generator)- Since:
- 3.0
-
EmpiricalDistribution
private EmpiricalDistribution(int binCount, RandomDataGenerator randomData)
Private constructor to allow lazy initialisation of the RNG contained in therandomData
instance variable.- Parameters:
binCount
- number of bins. Must be strictly positive.randomData
- Random data generator.- Throws:
NotStrictlyPositiveException
- ifbinCount <= 0
.
-
-
Method Detail
-
load
public void load(double[] in) throws NullArgumentException
Computes the empirical distribution from the provided array of numbers.- Parameters:
in
- the input data array- Throws:
NullArgumentException
- if in is null
-
load
public void load(java.net.URL url) throws java.io.IOException, NullArgumentException, ZeroException
Computes the empirical distribution using data read from a URL.The input file must be an ASCII text file containing one valid numeric entry per line.
- Parameters:
url
- url of the input file- Throws:
java.io.IOException
- if an IO error occursNullArgumentException
- if url is nullZeroException
- if URL contains no data
-
load
public void load(java.io.File file) throws java.io.IOException, NullArgumentException
Computes the empirical distribution from the input file.The input file must be an ASCII text file containing one valid numeric entry per line.
- Parameters:
file
- the input file- Throws:
java.io.IOException
- if an IO error occursNullArgumentException
- if file is null
-
fillBinStats
private void fillBinStats(EmpiricalDistribution.DataAdapter da) throws java.io.IOException
Fills binStats array (second pass through data file).- Parameters:
da
- object providing access to the data- Throws:
java.io.IOException
- if an IO error occurs
-
findBin
private int findBin(double value)
Returns the index of the bin to which the given value belongs- Parameters:
value
- the value whose bin we are trying to find- Returns:
- the index of the bin containing the value
-
getNextValue
public double getNextValue() throws MathIllegalStateException
Generates a random value from this distribution. Preconditions:- the distribution must be loaded before invoking this method
- Returns:
- the random value.
- Throws:
MathIllegalStateException
- if the distribution has not been loaded
-
getSampleStats
public StatisticalSummary getSampleStats()
Returns aStatisticalSummary
describing this distribution. Preconditions:- the distribution must be loaded before invoking this method
- Returns:
- the sample statistics
- Throws:
java.lang.IllegalStateException
- if the distribution has not been loaded
-
getBinCount
public int getBinCount()
Returns the number of bins.- Returns:
- the number of bins.
-
getBinStats
public java.util.List<SummaryStatistics> getBinStats()
Returns a List ofSummaryStatistics
instances containing statistics describing the values in each of the bins. The list is indexed on the bin number.- Returns:
- List of bin statistics.
-
getUpperBounds
public double[] getUpperBounds()
Returns a fresh copy of the array of upper bounds for the bins. Bins are:
[min,upperBounds[0]],(upperBounds[0],upperBounds[1]],..., (upperBounds[binCount-2], upperBounds[binCount-1] = max].Note: In versions 1.0-2.0 of commons-math, this method incorrectly returned the array of probability generator upper bounds now returned by
getGeneratorUpperBounds()
.- Returns:
- array of bin upper bounds
- Since:
- 2.1
-
getGeneratorUpperBounds
public double[] getGeneratorUpperBounds()
Returns a fresh copy of the array of upper bounds of the subintervals of [0,1] used in generating data from the empirical distribution. Subintervals correspond to bins with lengths proportional to bin counts.
Preconditions:- the distribution must be loaded before invoking this method
In versions 1.0-2.0 of commons-math, this array was (incorrectly) returned by
getUpperBounds()
.- Returns:
- array of upper bounds of subintervals used in data generation
- Throws:
java.lang.NullPointerException
- unless aload
method has been called beforehand.- Since:
- 2.1
-
isLoaded
public boolean isLoaded()
Property indicating whether or not the distribution has been loaded.- Returns:
- true if the distribution has been loaded
-
reSeed
public void reSeed(long seed)
Reseeds the random number generator used bygetNextValue()
.- Parameters:
seed
- random generator seed- Since:
- 3.0
-
probability
public double probability(double x)
For a random variableX
whose values are distributed according to this distribution, this method returnsP(X = x)
. In other words, this method represents the probability mass function (PMF) for the distribution.- Specified by:
probability
in interfaceRealDistribution
- Overrides:
probability
in classAbstractRealDistribution
- Parameters:
x
- the point at which the PMF is evaluated- Returns:
- zero.
- Since:
- 3.1
-
density
public double density(double x)
Returns the probability density function (PDF) of this distribution evaluated at the specified pointx
. In general, the PDF is the derivative of theCDF
. If the derivative does not exist atx
, then an appropriate replacement should be returned, e.g.Double.POSITIVE_INFINITY
,Double.NaN
, or the limit inferior or limit superior of the difference quotient.Returns the kernel density normalized so that its integral over each bin equals the bin mass.
Algorithm description:
- Find the bin B that x belongs to.
- Compute K(B) = the mass of B with respect to the within-bin kernel (i.e., the integral of the kernel density over B).
- Return k(x) * P(B) / K(B), where k is the within-bin kernel density and P(B) is the mass of B.
- Parameters:
x
- the point at which the PDF is evaluated- Returns:
- the value of the probability density function at point
x
- Since:
- 3.1
-
cumulativeProbability
public double cumulativeProbability(double x)
For a random variableX
whose values are distributed according to this distribution, this method returnsP(X <= x)
. In other words, this method represents the (cumulative) distribution function (CDF) for this distribution.Algorithm description:
- Find the bin B that x belongs to.
- Compute P(B) = the mass of B and P(B-) = the combined mass of the bins below B.
- Compute K(B) = the probability mass of B with respect to the within-bin kernel and K(B-) = the kernel distribution evaluated at the lower endpoint of B
- Return P(B-) + P(B) * [K(x) - K(B-)] / K(B) where K(x) is the within-bin kernel distribution function evaluated at x.
- Parameters:
x
- the point at which the CDF is evaluated- Returns:
- the probability that a random variable with this
distribution takes a value less than or equal to
x
- Since:
- 3.1
-
inverseCumulativeProbability
public double inverseCumulativeProbability(double p) throws OutOfRangeException
Computes the quantile function of this distribution. For a random variableX
distributed according to this distribution, the returned value isinf{x in R | P(X<=x) >= p}
for0 < p <= 1
,inf{x in R | P(X<=x) > 0}
forp = 0
.
RealDistribution.getSupportLowerBound()
forp = 0
,RealDistribution.getSupportUpperBound()
forp = 1
.
Algorithm description:
- Find the smallest i such that the sum of the masses of the bins through i is at least p.
-
Let K be the within-bin kernel distribution for bin i.
Let K(B) be the mass of B under K.
Let K(B-) be K evaluated at the lower endpoint of B (the combined mass of the bins below B under K).
Let P(B) be the probability of bin i.
Let P(B-) be the sum of the bin masses below bin i.
Let pCrit = p - P(B-)
- Return the inverse of K evaluated at
K(B-) + pCrit * K(B) / P(B)
- Specified by:
inverseCumulativeProbability
in interfaceRealDistribution
- Overrides:
inverseCumulativeProbability
in classAbstractRealDistribution
- Parameters:
p
- the cumulative probability- Returns:
- the smallest
p
-quantile of this distribution (largest 0-quantile forp = 0
) - Throws:
OutOfRangeException
- ifp < 0
orp > 1
- Since:
- 3.1
-
getNumericalMean
public double getNumericalMean()
Use this method to get the numerical value of the mean of this distribution.- Returns:
- the mean or
Double.NaN
if it is not defined - Since:
- 3.1
-
getNumericalVariance
public double getNumericalVariance()
Use this method to get the numerical value of the variance of this distribution.- Returns:
- the variance (possibly
Double.POSITIVE_INFINITY
as for certain cases inTDistribution
) orDouble.NaN
if it is not defined - Since:
- 3.1
-
getSupportLowerBound
public double getSupportLowerBound()
Access the lower bound of the support. This method must return the same value asinverseCumulativeProbability(0)
. In other words, this method must returninf {x in R | P(X <= x) > 0}
.- Returns:
- lower bound of the support (might be
Double.NEGATIVE_INFINITY
) - Since:
- 3.1
-
getSupportUpperBound
public double getSupportUpperBound()
Access the upper bound of the support. This method must return the same value asinverseCumulativeProbability(1)
. In other words, this method must returninf {x in R | P(X <= x) = 1}
.- Returns:
- upper bound of the support (might be
Double.POSITIVE_INFINITY
) - Since:
- 3.1
-
isSupportLowerBoundInclusive
public boolean isSupportLowerBoundInclusive()
Whether or not the lower bound of support is in the domain of the density function. Returns true iffgetSupporLowerBound()
is finite anddensity(getSupportLowerBound())
returns a non-NaN, non-infinite value.- Returns:
- true if the lower bound of support is finite and the density function returns a non-NaN, non-infinite value there
- Since:
- 3.1
-
isSupportUpperBoundInclusive
public boolean isSupportUpperBoundInclusive()
Whether or not the upper bound of support is in the domain of the density function. Returns true iffgetSupportUpperBound()
is finite anddensity(getSupportUpperBound())
returns a non-NaN, non-infinite value.- Returns:
- true if the upper bound of support is finite and the density function returns a non-NaN, non-infinite value there
- Since:
- 3.1
-
isSupportConnected
public boolean isSupportConnected()
Use this method to get information about whether the support is connected, i.e. whether all values between the lower and upper bound of the support are included in the support.- Returns:
- whether the support is connected or not
- Since:
- 3.1
-
reseedRandomGenerator
public void reseedRandomGenerator(long seed)
Reseed the random generator used to generate samples.- Specified by:
reseedRandomGenerator
in interfaceRealDistribution
- Overrides:
reseedRandomGenerator
in classAbstractRealDistribution
- Parameters:
seed
- the new seed- Since:
- 3.1
-
pB
private double pB(int i)
The probability of bin i.- Parameters:
i
- the index of the bin- Returns:
- the probability that selection begins in bin i
-
pBminus
private double pBminus(int i)
The combined probability of the bins up to but not including bin i.- Parameters:
i
- the index of the bin- Returns:
- the probability that selection begins in a bin below bin i.
-
kB
private double kB(int i)
Mass of bin i under the within-bin kernel of the bin.- Parameters:
i
- index of the bin- Returns:
- the difference in the within-bin kernel cdf between the upper and lower endpoints of bin i
-
k
private RealDistribution k(double x)
The within-bin kernel of the bin that x belongs to.- Parameters:
x
- the value to locate within a bin- Returns:
- the within-bin kernel of the bin containing x
-
cumBinP
private double cumBinP(int binIndex)
The combined probability of the bins up to and including binIndex.- Parameters:
binIndex
- maximum bin index- Returns:
- sum of the probabilities of bins through binIndex
-
getKernel
protected RealDistribution getKernel(SummaryStatistics bStats)
The within-bin smoothing kernel. Returns a Gaussian distribution parameterized bybStats
, unless the bin contains only one observation, in which case a constant distribution is returned.- Parameters:
bStats
- summary statistics for the bin- Returns:
- within-bin kernel parameterized by bStats
-
-