# Model-Free Regression Variable Selection Explained¶

In this tutorial we show how to run the model-free variable selection analysis provided in the kxy package on a regression problem.

We use the UCI Yacht Hydrodynamics dataset.

In :

%load_ext autoreload
import warnings
warnings.filterwarnings('ignore')

# Required imports
import pandas as pd
import kxy

In :

df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/'\
'00243/yacht_hydrodynamics.data', sep='[ ]{1,2}',\
names=['Longitudinal Position', 'Prismatic Coeefficient',\
'Length-Displacement', 'Beam-Draught Ratio',\
'Length-Beam Ratio', 'Froude Number',\
'Residuary Resistance'])
df.rename(columns={col: col.title() for col in df.columns}, inplace=True)


## How To Generate The Variable Selection Analysis¶

In :

var_selection_analysis = df.kxy.variable_selection_analysis('Residuary Resistance')

In :

var_selection_analysis

Out:

Selection Order Univariate Achievable R^2 Maximum Marginal R^2 Increase Running Achievable R^2 Running Achievable R^2 (%) Univariate Mutual Information (nats) Conditional Mutual Information (nats) Univariate Maximum True Log-Likelihood Increase Per Sample Maximum Marginal True Log-Likelihood Increase Per Sample Running Mutual Information (nats) Running Mutual Information (%) Running Maximum Log-Likelihood Increase Per Sample
Variable
Froude Number 1 0.986015 0.986015 0.986015 98.9424 2.13488 2.13488 2.13488 2.13488 2.13488 75.2943 2.13488
Beam-Draught Ratio 2 0.00028396 0.00238509 0.9884 99.1817 0.000142 0.093492 0.000142 0.093492 2.22837 78.5917 2.22837
Longitudinal Position 3 0.000259966 0.00197833 0.990378 99.3802 0.00013 0.093492 0.00013 0.093492 2.32186 81.889 2.32186
Length-Displacement 4 0.000223975 0.00164094 0.992019 99.5449 0.000112 0.093492 0.000112 0.093492 2.41535 85.1864 2.41535
Prismatic Coeefficient 5 2.59997e-05 0.00364914 0.995668 99.911 1.3e-05 0.305538 1.3e-05 0.305538 2.72089 95.9623 2.72089
Length-Beam Ratio 6 1.79998e-05 0.00088648 0.996555 100 9e-06 0.114484 9e-06 0.114484 2.83537 100 2.83537

Note: the same syntax is used for classification problems. The type of supervised learning problem is inferred based on the label column.

## Column Meaning¶

• Selection Order: Order in which variables are selected, from the most important (1) to the least important. The first variable selected is the one with the highest mutual information with the label (i.e. that is the most useful when used in isolation). The $$(i+1)$$-th variable selected is the variable, among all variables not yet selected, that complements all $$i$$ variables previously selected the most.

• Univariate Achievable R^2: The highest $$R^2$$ achievable by a model using the variable in isolation to predict the label.

• Maximum Marginal R^2 Increase: The highest amount by which the $$R^2$$ can be increased as a result of including this variable to all previously selected variables.

• Running Achievable R^2: The highest $$R^2$$ achievable by a model using all variables selected so far to predict the label.

• Univariate Mutual Information (nats): Maximum-entropy model-free estimate of the mutual information in nats between the variable and the label, under the true data generating distribution.

• Conditional Mutual Information (nats): Maximum-entropy model-free estimate of the mutual information in nats between the variable and the label conditional on all variables previously selected (if any), under the true data generating distribution.

• Running Mutual Information (nats): Maximum-entropy model-free estimate of the mutual information in nats between all variables selected so far (including this variable) and the label, under the true data generating distribution.

• Univariate Maximum True Log-Likelihood Increase Per Sample: The highest amount by which the true likelihood per sample of a model using the variable in isolation to predict the label can increase over the naive baseline strategy consisting of always predicting the mode of the true unconditional distribution of the label.

• Maximum Marginal True Log-Likelihood Increase Per Sample: The highest amount by which the true likelihood per sample of a model using all variables selected prior to this one to predict the label can increase as a result of adding this variable.

• Running Maximum Log-Likelihood Increase Per Sample: The highest amount by which the true likelihood per sample of a model using all variables selected so far, including this one, to predict the label can increase over the naive baseline strategy consisting of always predicting the mode of the true unconditional distribution of the label.

In :

import pylab as plt
fig, ax = plt.subplots(1, 2, figsize=(20, 10))
var_selection_analysis.data[['Univariate Achievable R^2', \
'Maximum Marginal R^2 Increase']]\
.plot.bar(rot=90, ax=ax)
ax.set_ylabel(r'$R^2$', fontsize=15)
var_selection_analysis.data[['Univariate Mutual Information (nats)', \
'Conditional Mutual Information (nats)']]\
.plot.bar(rot=90, ax=ax)
ax.set_ylabel('Mutual Information (nats)', fontsize=15)
plt.show() 