Entropy

kxy.api.core.entropy.discrete_entropy(x)

Estimates the (Shannon) entropy of a discrete random variable taking up to q distinct values given n i.i.d samples,

\[h(x) = - \sum_{i=1}^q p_i \log p_i,\]

using the plug-in estimator.

Parameters:x ((n,) np.array) – i.i.d. samples from the distribution of interest.
Returns:h – The (differential) entropy of the discrete random variable of interest.
Return type:float
Raises:AssertionError – If the input has the wrong shape.
kxy.api.core.entropy.least_structured_continuous_entropy(x, space='dual', batch_indices=[])

Estimates the entropy of a continuous \(d\)-dimensional random variable under the least structured assumption for its copula.

When \(d>1\),

\[h(x) = h(u) + \sum_{i=1}^d h(x_i).\]
Parameters:
  • x ((n, d) np.array) – n i.i.d. draws from the data generating distribution.
  • space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When space='primal', the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. When space='dual', the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.
Returns:

h – The (differential) entropy of the data generating distribution, assuming its copula is maximum-entropy in the chosen space. By convention, when \(d=1\), this function is the same as scalar_continuous_entropy, and returns 0 when \(n=1\).

Return type:

float

kxy.api.core.entropy.least_structured_copula_entropy(x, space='dual', batch_indices=[])

Estimates the entropy of the maximum-entropy copula in the chosen space.

Note

This also corresponds to least amount of total correlation that is evidenced by our maximum-entropy constraints.

Parameters:
  • x ((n, d) np.array) – n i.i.d. draws from the data generating distribution.
  • space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When space='primal', the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. When space='dual', the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.
Returns:

h – The (differential) entropy of the least structured copula consistent with maximum-entropy constraints.

Return type:

float

kxy.api.core.entropy.least_structured_mixed_entropy(x_c, x_d, space='dual', batch_indices=[])

Estimates the joint entropy \(h(x_c, x_d)\), where \(x_c\) is continuous random vector and \(x_d\) is a discrete random vector.

Note

We use the identities

\[ \begin{align}\begin{aligned}h(x, y) &= h(y) + h(x|y) \\\ &= h(y) + E \left[ h(x \vert y=.) \right]\end{aligned}\end{align} \]

that are true when \(x\) and \(y\) are either both continuous or both discrete to extend the definition of the joint entropy to the case where one is continuous and the other discrete.

Specifically,

\[h(x_c, x_d) = h(x_d) + \sum_{j=1}^q \mathbb{P}(x_d=j) h\left(x_c \vert x_d=j \right).\]
Parameters:
  • x_c ((n, d) np.array) – n i.i.d. draws from the continuous data generating distribution.
  • x_d ((n,) np.array) – n i.i.d. draws from the discrete data generating distribution, jointly sampled with x_c.
  • space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When space='primal', the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. When space='dual', the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.
Returns:

h – The entropy of the least structured distribution as evidenced by average the maximum entropy constraints.

Return type:

float

kxy.api.core.entropy.scalar_continuous_entropy(x, space='dual', method='gaussian-kde')

Estimates the (differential) entropy of a continuous scalar random variable.

Multiple methods are supported:

  • 'gaussian' for Gaussian moment matching: \(h(x) = \frac{1}{2} \log\left(2 \pi e \sigma^2 \right)\).
  • '1-spacing' for the standard 1-spacing estimator (see [1] and [2]):
\[h(x) \approx - \gamma(1) + \frac{1}{n-1} \sum_{i=1}^{n-1} \log \left[ n \left(x_{(i+1)} - x_{(i)} \right) \right],\]

where \(x_{(i)}\) is the i-th smallest sample, and \(\gamma\) is the digamma function.

  • 'gaussian-kde' (the default) for Gaussian kernel density estimation.
\[h(x) \approx \frac{1}{n} \sum_{i=1}^n \log\left( \hat{p}\left(x_i\right) \right)\]

where \(\hat{p}\) is the Gaussian kernel density estimator of the true pdf using statsmodels.api.nonparametric.KDEUnivariate.

Parameters:x ((n,) np.array) – i.i.d. samples from the distribution of interest.
Returns:h – The (differential) entropy of the continuous scalar random variable of interest.
Return type:float
Raises:AssertionError – If the input has the wrong shape or the method is not supported.

References

[1]Kozachenko, L. F., and Nikolai N. Leonenko. “Sample estimate of the entropy of a random vector.” Problemy Peredachi Informatsii 23.2 (1987): 9-16.
[2]Beirlant, J., Dudewicz, E.J., Györfi, L., van der Meulen, E.C. “Nonparametric entropy estimation: an overview.” International Journal of Mathematical and Statistical Sciences. 6 (1): 17–40. (1997) ISSN 1055-7490.

Entropy Rate

kxy.api.core.entropy_rate.auto_predictability(sample, p=None, robust=False, p_ic='hqic', space='primal')

Estimates the measure of auto-predictability of a (vector-value) time series:

\[ \begin{align}\begin{aligned}\mathbb{PR}\left(\{x_t \} \right) :&= h\left(x_* \right) - h\left( \{ x_t \} \right) \\\ &= h\left(u_{x_*}\right) - h\left( \{ u_{x_t} \} \right).\end{aligned}\end{align} \]
Parameters:
  • sample ((T, d) np.array) – Array of T sample observations of a \(d\)-dimensional process.
  • p (int or None) – Number of lags to compute for the autocovariance function. If p=None (the default), it is inferred by fitting a VAR model on the sample, using as information criterion p_ic.
  • robust (bool) – If True, the Pearson autocovariance function is estimated by first estimating a Spearman rank correlation, and then inferring the equivalent Pearson autocovariance function, under the Gaussian assumption.
  • p_ic (str) – The criterion used to learn the optimal value of p (by fitting a VAR(p) model) when p=None. Should be one of ‘hqic’ (Hannan-Quinn Information Criterion), ‘aic’ (Akaike Information Criterion), ‘bic’ (Bayes Information Criterion) and ‘t-stat’ (based on last lag). Same as the ‘ic’ parameter of statsmodels.tsa.api.VAR.
  • space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When space='primal', the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. When space='dual', the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.

Warning

This function only supports primal inference for now.

kxy.api.core.entropy_rate.estimate_pearson_autocovariance(sample, p, robust=False)

Estimates the sample autocovariance function of a vector-value process \(\{x_t\}\) up to lag p (starting from 0).

Parameters:
  • sample ((T, d) np.array) – Array of T sample observations of a d-dimensional process.
  • p (int) – Number of lags to compute for the autocovariance function.
  • robust (bool) – If True, the Pearson autocovariance function is estimated by first estimating a Spearman rank correlation, and then inferring the equivalent Pearson autocovariance function, under the Gaussian assumption.
Returns:

ac – Sample autocovariance matrix whose ij block of size pxp is the covariance between \(x_{t+i}\) and \(x_{t+j}\).

Return type:

(dp, dp) np.array

kxy.api.core.entropy_rate.gaussian_var_copula_entropy_rate(sample, p=None, robust=False, p_ic='hqic')

Estimates the entropy rate of the copula-uniform dual representation of a stationary Gaussian VAR(p) (or AR(p)) process from a sample path.

We recall that the copula-uniform representation of a \(\mathbb{R}^d\)-valued process \(\{x_t\} := \{(x_{1t}, \dots, x_{dt}) \}\) is, by definition, the process \(\{ u_t \} := \{ \left( F_{1t}\left(x_{1t}\right), \dots, F_{dt}\left(x_{dt}\right) \right) \}\) where \(F_{it}\) is the cummulative density function of \(x_{it}\).

It can be shown that

\[h\left( \{ x_t \}\right) = h\left( \{ u_t \}\right) + \sum_{i=1}^d h\left( x_{i*}\right)\]

where \(h\left(x_{i*}\right)\) is the entropy of the i-th coordinate process at any time.

Parameters:
  • sample ((T, d) np.array) – Array of T sample observations of a \(d\)-dimensional process.
  • p (int or None) – Number of lags to compute for the autocovariance function. If p=None (the default), it is inferred by fitting a VAR model on the sample, using as information criterion p_ic.
  • robust (bool) – If True, the Pearson autocovariance function is estimated by first estimating a Spearman rank correlation, and then inferring the equivalent Pearson autocovariance function, under the Gaussian assumption.
  • p_ic (str) – The criterion used to learn the optimal value of p (by fitting a VAR(p) model) when p=None. Should be one of ‘hqic’ (Hannan-Quinn Information Criterion), ‘aic’ (Akaike Information Criterion), ‘bic’ (Bayes Information Criterion) and ‘t-stat’ (based on last lag). Same as the ‘ic’ parameter of statsmodels.tsa.api.VAR.
Returns:

  • h (float) – The entropy rate of the copula-uniform dual representation of the input process.
  • p (int) – Order of the VAR(p).

kxy.api.core.entropy_rate.gaussian_var_entropy_rate(sample, p, robust=False)

Estimates the entropy rate of a stationary Gaussian VAR(p) or AR(p), namely

\[h\left( \{x_t\} \right) = \frac{1}{2} \log \left( \frac{|K_p|}{|K_{p-1}|} \right) + \frac{d}{2} \log \left( 2 \pi e\right)\]

where \(|K_p|\) is the determinant of the lag-p autocovariance matrix corresponding to this process, from a sample path of size T.

Parameters:
  • sample ((T, d) np.array) – Array of T sample observations of a d-dimensional process.
  • p (int) – Number of lags to compute for the autocovariance function.
  • robust (bool) – If True, the Pearson autocovariance function is estimated by first estimating a Spearman rank correlation, and then inferring the equivalent Pearson autocovariance function, under the Gaussian assumption.
Returns:h – The entropy rate of the process.
Return type:float
kxy.api.core.entropy_rate.pearson_acovf(sample, max_order=10, robust=False)

Estimates the sample Pearson autocovariance function of a scalar-valued or vector-valued discrete-time stationary ergodic stochastic process \(\{z_t\}\) from a single sample of size \(T\), \((\hat{z}_1, \dots, \hat{z}_T)\).

\[C(h) := \frac{1}{T} \sum_{t=1+h}^T (\hat{z}_t - \bar{z})(\hat{z}_{t-h} - \bar{z})^T\]

with \(\bar{z} := \frac{1}{T} \sum_{t=1}^T \hat{z}_t\).

Parameters:
  • sample ((T, d) np.array) – Array of T sample observations of a d-dimensional process.
  • max_order (int) – Maximum number of lags to compute for the autocovariance function.
  • robust (bool) – If True, the Pearson autocovariance function is estimated by first estimating a Spearman rank correlation, and then inferring the equivalent Pearson autocovariance function, under the Gaussian assumption.
Returns:

acf – Sample autocovariance function up to order max_order.

Return type:

(max_order, d, d) np.array