Model-Free Variable Selection¶
Estimation of the top-\(k\) most valuable variables in a supervised learning problem for every possible \(k\), and the corresponding achievable performances.
- kxy.pre_learning.variable_selection.variable_selection(data_df, target_column, problem_type)¶
Runs the model-free variable selection analysis.
The first variable is the variable that explains the label the most, when used in isolation. The second variable is the variable that complements the first variable the most for predicting the label etc.
Running performances should be understood as the performance achievable when trying to predict the label using variables with selection order smaller or equal to that of the row.
problem_type=None, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not
data_df (pandas.DataFrame) – The pandas DataFrame containing the data.
target_column (str) – The name of the column containing true labels.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
result – The result is a pandas.Dataframe with columns (where applicable):
'Selection Order': The order in which the associated variable was selected, starting at 1 for the most important variable.
'Variable': The column name corresponding to the input variable.
'Running Achievable R-Squared': The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.
'Running Achievable Accuracy': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.
'Running Achievable RMSE': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.
- Return type
Section 2 - Variable Selection Analysis.