Model-Free Variable Selection

Estimation of the top-\(k\) most valuable variables in a supervised learning problem for every possible \(k\), and the corresponding achievable performances.

kxy.pre_learning.variable_selection.variable_selection(data_df, target_column, problem_type, snr='auto', file_name=None)

Runs the model-free variable selection analysis.

The first variable is the variable that explains the label the most, when used in isolation. The second variable is the variable that complements the first variable the most for predicting the label etc.

Running performances should be understood as the performance achievable when trying to predict the label using variables with selection order smaller or equal to that of the row.

When problem_type=None, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not target_column is categorical.

Parameters
  • data_df (pandas.DataFrame) – The pandas DataFrame containing the data.

  • target_column (str) – The name of the column containing true labels.

  • problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.

  • file_name (None | str) – A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

  • 'Selection Order': The order in which the associated variable was selected, starting at 1 for the most important variable.

  • 'Variable': The column name corresponding to the input variable.

  • 'Running Achievable R-Squared': The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.

  • 'Running Achievable Accuracy': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.

  • 'Running Achievable RMSE': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.

Return type

pandas.DataFrame

Theoretical Foundation

Section 2 - Variable Selection Analysis.