Model Improvability¶

Estimation of the amount by which the performance of a trained supervised learning model can be increased, either in a model-driven fashion, or a data-driven fashion.

Model-Driven Improvability¶

kxy.post_learning.improvability.model_driven_improvability(data_df, target_column, prediction_column, problem_type)

Estimate the extent to which a trained supervised learner may be improved in a model-driven fashion (i.e. without resorting to additional explanatory variables).

Parameters
• data_df (pandas.DataFrame) – The pandas DataFrame containing the data.

• target_column (str) – The name of the column containing true labels.

• prediction_column (str) – The name of the column containing model predictions.

• problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not target_column is categorical.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

• 'Lost Accuracy': The amount of classification accuracy that was irreversibly lost when training the supervised learner.

• 'Lost R-Squared': The amount of $$R^2$$ that was irreversibly lost when training the supervised learner.

• 'Lost RMSE': The amount of Root Mean Square Error that was irreversibly lost when training the supervised learner.

• 'Lost Log-Likelihood Per Sample': The amount of true log-likelihood per sample that was irreversibly lost when training the supervised learner.

• 'Residual R-Squared': For regression problems, this is the highest $$R^2$$ that may be achieved when using explanatory variables to predict regression residuals.

• 'Residual RMSE': For regression problems, this is the lowest Root Mean Square Error that may be achieved when using explanatory variables to predict regression residuals.

• 'Residual Log-Likelihood Per Sample': For regression problems, this is the highest log-likelihood per sample that may be achieved when using explanatory variables to predict regression residuals.

Return type

pandas.Dataframe

Theoretical Foundation

Section 3 - Model Improvability.

Data-Driven Improvability¶

kxy.post_learning.improvability.data_driven_improvability(data_df, target_column, new_variables, problem_type)

Estimate the potential performance boost that a set of new explanatory variables can bring about.

Parameters
• data_df (pandas.DataFrame) – The pandas DataFrame containing the data.

• target_column (str) – The name of the column containing true labels.

• new_variables (list) – The names of the columns to use as new explanatory variables.

• problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not target_column is categorical.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

• 'Accuracy Boost': The classification accuracy boost that the new explanatory variables can bring about.

• 'R-Squared Boost': The $$R^2$$ boost that the new explanatory variables can bring about.

• 'RMSE Reduction': The reduction in Root Mean Square Error that the new explanatory variables can bring about.

• 'Log-Likelihood Per Sample Boost': The boost in log-likelihood per sample that the new explanatory variables can bring about.

Return type

pandas.Dataframe

Theoretical Foundation

Section 3 - Model Improvability.