Model Improvability

Estimation of the amount by which the performance of a trained supervised learning model can be increased, either in a model-driven fashion, or a data-driven fashion.

Model-Driven Improvability

kxy.post_learning.improvability.model_driven_improvability(data_df, target_column, prediction_column, problem_type, snr='auto', file_name=None)

Estimate the extent to which a trained supervised learner may be improved in a model-driven fashion (i.e. without resorting to additional explanatory variables).

Parameters
  • data_df (pandas.DataFrame) – The pandas DataFrame containing the data.

  • target_column (str) – The name of the column containing true labels.

  • prediction_column (str) – The name of the column containing model predictions.

  • problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not target_column is categorical.

  • file_name (None | str) – A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

  • 'Lost Accuracy': The amount of classification accuracy that was irreversibly lost when training the supervised learner.

  • 'Lost R-Squared': The amount of \(R^2\) that was irreversibly lost when training the supervised learner.

  • 'Lost RMSE': The amount of Root Mean Square Error that was irreversibly lost when training the supervised learner.

  • 'Lost Log-Likelihood Per Sample': The amount of true log-likelihood per sample that was irreversibly lost when training the supervised learner.

  • 'Residual R-Squared': For regression problems, this is the highest \(R^2\) that may be achieved when using explanatory variables to predict regression residuals.

  • 'Residual RMSE': For regression problems, this is the lowest Root Mean Square Error that may be achieved when using explanatory variables to predict regression residuals.

  • 'Residual Log-Likelihood Per Sample': For regression problems, this is the highest log-likelihood per sample that may be achieved when using explanatory variables to predict regression residuals.

Return type

pandas.Dataframe

Theoretical Foundation

Section 3 - Model Improvability.

Data-Driven Improvability

kxy.post_learning.improvability.data_driven_improvability(data_df, target_column, new_variables, problem_type, snr='auto', file_name=None)

Estimate the potential performance boost that a set of new explanatory variables can bring about.

Parameters
  • data_df (pandas.DataFrame) – The pandas DataFrame containing the data.

  • target_column (str) – The name of the column containing true labels.

  • new_variables (list) – The names of the columns to use as new explanatory variables.

  • problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not target_column is categorical.

  • file_name (None | str) – A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

  • 'Accuracy Boost': The classification accuracy boost that the new explanatory variables can bring about.

  • 'R-Squared Boost': The \(R^2\) boost that the new explanatory variables can bring about.

  • 'RMSE Reduction': The reduction in Root Mean Square Error that the new explanatory variables can bring about.

  • 'Log-Likelihood Per Sample Boost': The boost in log-likelihood per sample that the new explanatory variables can bring about.

Return type

pandas.Dataframe

Theoretical Foundation

Section 3 - Model Improvability.