Model-Free Regression Variable Selection Explained

In this tutorial we show how to run the model-free variable selection analysis provided in the kxy package on a regression problem.

We use the UCI Yacht Hydrodynamics dataset.

In [1]:
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')

# Required imports
import pandas as pd
import kxy
In [2]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/'\
                 '00243/yacht_hydrodynamics.data', sep='[ ]{1,2}',\
                 names=['Longitudinal Position', 'Prismatic Coeefficient',\
                        'Length-Displacement', 'Beam-Draught Ratio',\
                        'Length-Beam Ratio', 'Froude Number',\
                        'Residuary Resistance'])
df.rename(columns={col: col.title() for col in df.columns}, inplace=True)

How To Generate The Variable Selection Analysis

In [3]:
var_selection_analysis = df.kxy.variable_selection_analysis('Residuary Resistance')
In [4]:
var_selection_analysis
Out[4]:
Selection Order Univariate Achievable R^2 Maximum Marginal R^2 Increase Running Achievable R^2 Running Achievable R^2 (%) Univariate Mutual Information (nats) Conditional Mutual Information (nats) Univariate Maximum True Log-Likelihood Increase Per Sample Maximum Marginal True Log-Likelihood Increase Per Sample Running Mutual Information (nats) Running Mutual Information (%) Running Maximum Log-Likelihood Increase Per Sample
Variable
Froude Number 1 0.986015 0.986015 0.986015 98.7412 2.13488 2.13488 2.13488 2.13488 2.13488 65.0786 2.13488
Prismatic Coeefficient 2 2.59997e-05 0.00639454 0.992409 99.3815 1.3e-05 0.305538 1.3e-05 0.305538 2.44041 74.3925 2.44041
Length-Displacement 3 0.000223975 0.00347074 0.99588 99.7291 0.000112 0.305538 0.000112 0.305538 2.74595 83.7064 2.74595
Beam-Draught Ratio 4 0.00028396 0.0018838 0.997764 99.9177 0.000142 0.305538 0.000142 0.305538 3.05149 93.0202 3.05149
Longitudinal Position 5 0.000259966 0.000457629 0.998221 99.9636 0.00013 0.114484 0.00013 0.114484 3.16597 96.5101 3.16597
Length-Beam Ratio 6 1.79998e-05 0.000363977 0.998585 100 9e-06 0.114484 9e-06 0.114484 3.28046 100 3.28046

Note: the same syntax is used for classification problems. The type of supervised learning problem is inferred based on the label column.

Column Meaning

  • Selection Order: Order in which variables are selected, from the most important (1) to the least important. The first variable selected is the one with the highest mutual information with the label (i.e. that is the most useful when used in isolation). The \((i+1)\)-th variable selected is the variable, among all variables not yet selected, that complements all \(i\) variables previously selected the most.
  • Univariate Achievable R^2: The highest \(R^2\) achievable by a model using the variable in isolation to predict the label.
  • Maximum Marginal R^2 Increase: The highest amount by which the \(R^2\) can be increased as a result of including this variable to all previously selected variables.
  • Running Achievable R^2: The highest \(R^2\) achievable by a model using all variables selected so far to predict the label.
  • Univariate Mutual Information (nats): Maximum-entropy model-free estimate of the mutual information in nats between the variable and the label, under the true data generating distribution.
  • Conditional Mutual Information (nats): Maximum-entropy model-free estimate of the mutual information in nats between the variable and the label conditional on all variables previously selected (if any), under the true data generating distribution.
  • Running Mutual Information (nats): Maximum-entropy model-free estimate of the mutual information in nats between all variables selected so far (including this variable) and the label, under the true data generating distribution.
  • Univariate Maximum True Log-Likelihood Increase Per Sample: The highest amount by which the true likelihood per sample of a model using the variable in isolation to predict the label can increase over the naive baseline strategy consisting of always predicting the mode of the true unconditional distribution of the label.
  • Maximum Marginal True Log-Likelihood Increase Per Sample: The highest amount by which the true likelihood per sample of a model using all variables selected prior to this one to predict the label can increase as a result of adding this variable.
  • Running Maximum Log-Likelihood Increase Per Sample: The highest amount by which the true likelihood per sample of a model using all variables selected so far, including this one, to predict the label can increase over the naive baseline strategy consisting of always predicting the mode of the true unconditional distribution of the label.
In [5]:
import pylab as plt
fig, ax = plt.subplots(1, 2, figsize=(20, 10))
var_selection_analysis.data[['Univariate Achievable R^2', \
    'Maximum Marginal R^2 Increase']]\
    .plot.bar(rot=90, ax=ax[0])
ax[0].set_ylabel(r'$R^2$', fontsize=15)
var_selection_analysis.data[['Univariate Mutual Information (nats)', \
    'Conditional Mutual Information (nats)']]\
    .plot.bar(rot=90, ax=ax[1])
ax[1].set_ylabel('Mutual Information (nats)', fontsize=15)
plt.show()
../../../_images/latest_examples_regression_regression_variable_selection_explained_8_0.png