Misc

kxy.api.core.utils.empirical_copula_uniform(x)

Evaluate the empirical copula-uniform dual representation of x as rank(x)/n.

Parameters:x ((n, d) np.array) – n i.i.d. draws from a d-dimensional distribution.
kxy.api.core.utils.get_x_data(x_c, x_d, n_outputs, categorical_encoding='two-split', non_monotonic_extension=True, space='dual')
kxy.api.core.utils.get_y_data(y_c, y_d, categorical_encoding='two-split')
kxy.api.core.utils.hqi(h, q)

Computes \(\bar{h}_q^{-1}(x)\) where

\[\bar{h}_q(a) = -a \log a -(1-a) \log \left(\frac{1-a}{q-1}\right), ~~~~ a \geq \frac{1}{q}.\]
kxy.api.core.utils.one_hot_encoding(x)

Computes the one hot encoding representation of the input array.

The representation used is the one where all distincts inputs are first converted to string, then sorted using Python’s sorted method.

The \(i\)-th element in this sort has its \(i\)-th bit from the right set to 1 and all others to 0.

Parameters:x ((n,) or (n, d) np.array) – Array of n inputs (typically strings) to encode and that take q distinct values.
Returns:Binary array representing the one-hot encoding representation of the inputs.
Return type:(n, q) np.array
kxy.api.core.utils.pearson_corr(x)

Calculate the Pearson correlation matrix, ignoring nans.

Parameters:x ((n, d) np.array) – Input data representing n i.i.d. draws from the d-dimensional random variable, whose Pearson correlation matrix this function calculates.
Returns:corr – The Pearson correlation matrix.
Return type:np.array
kxy.api.core.utils.prepare_data_for_mutual_info_analysis(x_c, x_d, y_c, y_d, non_monotonic_extension=True, categorical_encoding='two-split', space='dual')
kxy.api.core.utils.prepare_test_data_for_prediction(test_x_c, test_x_d, train_x_c, train_x_d, train_y_c, train_y_d, non_monotonic_extension=True, categorical_encoding='two-split', space='dual')
kxy.api.core.utils.robust_log_det(c)

Computes the logarithm of the determinant of a positive definite matrix in a fashion that is more robust to ill-conditioning than taking the logarithm of np.linalg.det.

Note

Specifically, we compute the SVD of c, and return the sum of the log of eigenvalues. np.linalg.det on the other hand computes the Cholesky decomposition of c, which is more likely to fail than its SVD, and takes the product of its diagonal elements, which could be subject to underflow error when diagonal elements are small.

Parameters:c ((d, d) np.array) – Square input matrix for computing log-determinant.
Returns:d – Log-determinant of the input matrix.
Return type:float
kxy.api.core.utils.spearman_corr(x)

Calculate the Spearman rank correlation matrix, ignoring nans.

Parameters:x ((n, d) np.array) – Input data representing n i.i.d. draws from the d-dimensional random variable, whose Spearman rank correlation matrix this function calculates.
Returns:corr – The Spearman rank correlation matrix.
Return type:np.array
kxy.api.core.utils.two_split_encoding(x)

Also known as binary encoding, the two-split encoding method turns categorical data taking \(q\) distinct values into the ordinal data \(1, \dots, q\), and then generates the binary representation of the ordinal data.

This encoding is more economical than one-hot encoding. Unlike the one-hot encoding methods which requires as many columns as the number of distinct inputs, two-split encoding only requires \(\log_2 \left\lceil q \right\rceil\) columns.

Every bit in the encoding splits the set of all \(q\) possible categories in 2 subsets of equal size (when q is a power of 2), and the value of the bit determines which subset the category of interest belongs to. Each bit generates a different partitioning of the set of all distinct categories.

The ordinal value assigned to a categorical value is the order of its string representation among all \(q\) distinct categorical values in the array.

In other words, while a bit in one-hot encoding determines whether the category is equal to a specific values, a bit in the two-split encoding determines whether the category belong in one of two subsets of equal size.

Parameters:x ((n,) or (n, d) np.array) – Array of n inputs (typically strings) to encode and that take q distinct values.
Returns:Binary array representing the two-split encoding representation of the inputs.
Return type:(n, q) np.array