Estimation of the top-\(k\) most valuable variables in a supervised learning problem for every possible \(k\), and the corresponding achievable performances.
model_explanation(data_df, prediction_column, problem_type)¶
Analyzes the variables that a model relies on the most in a brute-force fashion.
The first variable is the variable the model relies on the most. The second variable is the variable that complements the first variable the most in explaining model decisions etc.
Running performances should be understood as the performance achievable when trying to guess model predictions using variables with selection order smaller or equal to that of the row.
problem_type=None, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not
data_df (pandas.DataFrame) – The pandas DataFrame containing the data.
prediction_column (str) – The name of the column containing true labels.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
result – The result is a pandas.Dataframe with columns (where applicable):
'Selection Order': The order in which the associated variable was selected, starting at 1 for the most important variable.
'Variable': The column name corresponding to the input variable.
'Running Achievable R-Squared': The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.
'Running Achievable Accuracy': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.
'Running Achievable RMSE': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.
- Return type
Section a) Model Explanation.