# Pre-Learning¶

kxy.regression.pre_learning.regression_achievable_performance_analysis(x_c, y, x_d=None, space='dual', categorical_encoding='two-split')

Quantifies the $$R^2$$ that can be achieved when trying to predict $$y$$ with $$x$$.

$\bar{R}^2 = 1 - e^{-2I(y; f(x))}$
Parameters
• x_c ((n,) or (n, d) np.array) – Continuous inputs.

• x_d ((n,) or (n, d) np.array or None (default), optional) – Discrete inputs.

• y ((n,) np.array) – Labels.

• space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When space='primal', the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. When space='dual', the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.

• categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.

Returns

• a (pandas.DataFrame) – Dataframe with columns:

• 'Achievable R^2': The highest $$R^2$$ that can be achieved by a regression model using provided inputs.

• 'Achievable Log-Likelihood Per Sample': The highest true log-likelihood per sample that can be achieved by a regression model using provided inputs.

• 'Achievable RMSE': The lowest RMSE that can be achieved by a regression model using provided inputs.

• .. admonition:: Theoretical Foundation – Section 1 - Achievable Performance.

kxy.regression.pre_learning.regression_variable_selection_analysis(x_c, y, x_d=None, space='dual', categorical_encoding='two-split')

Runs the variable selection analysis for a regression problem.

Parameters
• x_c ((n,d) np.array) – Continuous inputs.

• y ((n,) np.array) – Labels.

• x_d ((n, d) np.array or None (default), optional) – Discrete inputs.

• space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When space='primal', the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. When space='dual', the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.

• categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.

Returns

a – Dataframe with columns:

• 'Variable': The variable index starting from 0 at the leftmost column of x_c and ending at the rightmost column of x_d.

• 'Selection Order': The order in which the associated variable was selected, starting at 1 for the most important variable.

• 'Univariate Achievable R^2': The highest $$R^2$$ that can be achieved by a regression model solely using this variable.

• 'Maximum Marginal R^2 Increase': The highest amount by which the $$R^2$$ can be increased as a result of adding this variable in the variable selection scheme.

• 'Running Achievable R^2': The highest $$R^2$$ that can be achieved by a regression model using all variables selected so far, including this one.

• 'Conditional Mutual Information (nats)': The mutual information between this variable and the label, conditional on all variables previously selected.

• 'Running Mutual Information (nats)': The mutual information between all variables selected so far, including this one, and the label.

• 'Maximum Marginal True Log-Likelihood Increase Per Sample': The highest amount by which the true log-likelihood per sample can increase as a result of adding this variable.

• 'Running Maximum Log-Likelihood Increase Per Sample': The highest amount by which the true log-likelihood per sample can increase (over the log-likelihood of the naive strategy consisting of predicting the mode of $$y$$) as a result of using all variables selected so far, including this one.

Return type

pandas.DataFrame

Theoretical Foundation

Section 2 - Variable Selection Analysis.