PreLearning¶

kxy.classification.pre_learning.
classification_achievable_performance_analysis
(x_c, y, x_d=None, space='dual', categorical_encoding='twosplit')¶ Quantifies the performance that can be achieved when trying to predict \(y\) with \(x\).
Parameters:  x_c ((n,d) np.array) – Continuous inputs.
 y ((n,) np.array) – Labels.
 x_d ((n, d) np.array or None (default), optional) – Discrete inputs.
 space (str, 'primal'  'dual') – The space in which the maximum entropy problem is solved.
When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copulauniform dual space, under Spearman rank correlation constraints.  categorical_encoding (str, 'onehot'  'twosplit' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
Returns: a – Dataframe with columns:
'Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model using provided inputs to predict the label.'Achievable LogLikelihood Per Sample'
: The highest true loglikelihood per sample that can be achieved by a classification model using provided inputs to predict the label.'Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model using provided inputs to predict the label.
Return type: pandas.DataFrame
Theoretical Foundation
Section 1  Achievable Performance.

kxy.classification.pre_learning.
classification_variable_selection_analysis
(x_c, y, x_d=None, space='dual', categorical_encoding='twosplit')¶ Runs the variable selection analysis for a classification problem.
Parameters:  x_c ((n, d) np.array) – Continuous inputs.
 y ((n,) np.array) – Labels.
 x_d ((n, q) np.array or None (default), optional) – Discrete inputs.
 space (str, 'primal'  'dual') – The space in which the maximum entropy problem is solved.
When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copulauniform dual space, under Spearman rank correlation constraints.  categorical_encoding (str, 'onehot'  'twosplit' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
Returns: a – Dataframe with columns:
'Variable'
: The variable index starting from 0 at the leftmost column ofx_c
and ending at the rightmost column ofx_d
.'Selection Order'
: The order in which the associated variable was selected, starting at 1 for the most important variable.'Univariate Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model solely using this variable.'Maximum Marginal R^2 Increase'
: The highest amount by which the \(R^2\) can be increased as a result of adding this variable in the variable selection scheme.'Running Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.'Univariate Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model solely using this variable.'Maximum Marginal Accuracy Increase'
: The highest amount by which the classification accuracy can be increased as a result of adding this variable in the variable selection scheme.'Running Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.'Conditional Mutual Information (nats)'
: The mutual information between this variable and the label, conditional on all variables previously selected.'Running Mutual Information (nats)'
: The mutual information between all variables selected so far, including this one, and the label.'Univariate Achievable True LogLikelihood Per Sample'
: The highest true loglikelihood per sample that can be achieved by a classification model solely using this variable.'Maximum Marginal True LogLikelihood Per Sample Increase'
: The highest amount by which the true loglikelihood per sample can be increased as a result of adding this variable in the variable selection scheme.'Running Achievable True LogLikelihood Per Sample'
: The highest true loglikelihood per sample that can be achieved by a classification model using all variables selected so far, including this one.'Maximum Marginal True LogLikelihood Increase Per Sample'
: The highest amount by which the true loglikelihood per sample can increase as a result of adding this variable.'Running Maximum LogLikelihood Increase Per Sample'
: The highest amount by which the true loglikelihood per sample can increase (over the loglikelihood of the naive strategy consisting of predicting the mode of \(y\)) as a result of using all variables selected so far, including this one.
Return type: pandas.DataFrame
Theoretical Foundation
Section 2  Variable Selection Analysis.