Model Explanation

Estimation of the top-\(k\) most valuable variables in a supervised learning problem for every possible \(k\), and the corresponding achievable performances.

kxy.post_learning.model_explanation.model_explanation(data_df, prediction_column, problem_type, snr='auto', file_name=None)

Analyzes the variables that a model relies on the most in a brute-force fashion.

The first variable is the variable the model relies on the most. The second variable is the variable that complements the first variable the most in explaining model decisions etc.

Running performances should be understood as the performance achievable when trying to guess model predictions using variables with selection order smaller or equal to that of the row.

When problem_type=None, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not prediction_column is categorical.

Parameters
  • data_df (pandas.DataFrame) – The pandas DataFrame containing the data.

  • prediction_column (str) – The name of the column containing true labels.

  • problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.

  • file_name (None | str) – A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

  • 'Selection Order': The order in which the associated variable was selected, starting at 1 for the most important variable.

  • 'Variable': The column name corresponding to the input variable.

  • 'Running Achievable R-Squared': The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.

  • 'Running Achievable Accuracy': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.

  • 'Running Achievable RMSE': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.

Return type

pandas.DataFrame

Theoretical Foundation

Section a) Model Explanation.