Model Improvability¶

Estimation of the amount by which the performance of a trained supervised learning model can be increased, either in a model-driven fashion, or a data-driven fashion.

Model-Driven Improvability¶

kxy.post_learning.improvability.model_driven_improvability(data_df, target_column, prediction_column, problem_type, snr='auto', file_name=None)¶

Estimate the extent to which a trained supervised learner may be improved in a model-driven fashion (i.e. without resorting to additional explanatory variables).

Parameters

data_df (pandas.DataFrame) – The pandas DataFrame containing the data.
target_column (str) – The name of the column containing true labels.
prediction_column (str) – The name of the column containing model predictions.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not target_column is categorical.
file_name (None | str) – A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

'Lost Accuracy': The amount of classification accuracy that was irreversibly lost when training the supervised learner.
'Lost R-Squared': The amount of \(R^2\) that was irreversibly lost when training the supervised learner.
'Lost RMSE': The amount of Root Mean Square Error that was irreversibly lost when training the supervised learner.
'Lost Log-Likelihood Per Sample': The amount of true log-likelihood per sample that was irreversibly lost when training the supervised learner.
'Residual R-Squared': For regression problems, this is the highest \(R^2\) that may be achieved when using explanatory variables to predict regression residuals.
'Residual RMSE': For regression problems, this is the lowest Root Mean Square Error that may be achieved when using explanatory variables to predict regression residuals.
'Residual Log-Likelihood Per Sample': For regression problems, this is the highest log-likelihood per sample that may be achieved when using explanatory variables to predict regression residuals.

Return type

pandas.Dataframe

Theoretical Foundation

Section 3 - Model Improvability.

Data-Driven Improvability¶

kxy.post_learning.improvability.data_driven_improvability(data_df, target_column, new_variables, problem_type, snr='auto', file_name=None)¶

Estimate the potential performance boost that a set of new explanatory variables can bring about.

Parameters

data_df (pandas.DataFrame) – The pandas DataFrame containing the data.
target_column (str) – The name of the column containing true labels.
new_variables (list) – The names of the columns to use as new explanatory variables.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not target_column is categorical.
file_name (None | str) – A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

'Accuracy Boost': The classification accuracy boost that the new explanatory variables can bring about.
'R-Squared Boost': The \(R^2\) boost that the new explanatory variables can bring about.
'RMSE Reduction': The reduction in Root Mean Square Error that the new explanatory variables can bring about.
'Log-Likelihood Per Sample Boost': The boost in log-likelihood per sample that the new explanatory variables can bring about.

Return type

pandas.Dataframe

Theoretical Foundation

Section 3 - Model Improvability.