# Model-Free Regression Variable Selection Explained¶

In this tutorial we show how to run the model-free variable selection analysis provided in the `kxy`

package on a regression problem.

We use the UCI Yacht Hydrodynamics dataset.

```
In [1]:
```

```
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')
# Required imports
import pandas as pd
import kxy
```

```
In [2]:
```

```
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/'\
'00243/yacht_hydrodynamics.data', sep='[ ]{1,2}',\
names=['Longitudinal Position', 'Prismatic Coeefficient',\
'Length-Displacement', 'Beam-Draught Ratio',\
'Length-Beam Ratio', 'Froude Number',\
'Residuary Resistance'])
df.rename(columns={col: col.title() for col in df.columns}, inplace=True)
```

## How To Generate The Variable Selection Analysis¶

```
In [3]:
```

```
var_selection_analysis = df.kxy.variable_selection_analysis('Residuary Resistance')
```

```
In [4]:
```

```
var_selection_analysis
```

```
Out[4]:
```

Selection Order | Univariate Achievable R^2 | Maximum Marginal R^2 Increase | Running Achievable R^2 | Running Achievable R^2 (%) | Univariate Mutual Information (nats) | Conditional Mutual Information (nats) | Univariate Maximum True Log-Likelihood Increase Per Sample | Maximum Marginal True Log-Likelihood Increase Per Sample | Running Mutual Information (nats) | Running Mutual Information (%) | Running Maximum Log-Likelihood Increase Per Sample | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Variable | ||||||||||||

Froude Number | 1 | 0.986015 | 0.986015 | 0.986015 | 98.7412 | 2.13488 | 2.13488 | 2.13488 | 2.13488 | 2.13488 | 65.0786 | 2.13488 |

Prismatic Coeefficient | 2 | 2.59997e-05 | 0.00639454 | 0.992409 | 99.3815 | 1.3e-05 | 0.305538 | 1.3e-05 | 0.305538 | 2.44041 | 74.3925 | 2.44041 |

Length-Displacement | 3 | 0.000223975 | 0.00347074 | 0.99588 | 99.7291 | 0.000112 | 0.305538 | 0.000112 | 0.305538 | 2.74595 | 83.7064 | 2.74595 |

Beam-Draught Ratio | 4 | 0.00028396 | 0.0018838 | 0.997764 | 99.9177 | 0.000142 | 0.305538 | 0.000142 | 0.305538 | 3.05149 | 93.0202 | 3.05149 |

Longitudinal Position | 5 | 0.000259966 | 0.000457629 | 0.998221 | 99.9636 | 0.00013 | 0.114484 | 0.00013 | 0.114484 | 3.16597 | 96.5101 | 3.16597 |

Length-Beam Ratio | 6 | 1.79998e-05 | 0.000363977 | 0.998585 | 100 | 9e-06 | 0.114484 | 9e-06 | 0.114484 | 3.28046 | 100 | 3.28046 |

**Note:** the same syntax is used for classification problems. The type of supervised learning problem is inferred based on the label column.

## Column Meaning¶

`Selection Order`

: Order in which variables are selected, from the most important (1) to the least important. The first variable selected is the one with the highest mutual information with the label (i.e. that is the most useful when used in isolation). The \((i+1)\)-th variable selected is the variable, among all variables not yet selected, that complements all \(i\) variables previously selected the most.`Univariate Achievable R^2`

: The highest \(R^2\) achievable by a model using the variable in isolation to predict the label.`Maximum Marginal R^2 Increase`

: The highest amount by which the \(R^2\) can be increased as a result of including this variable to all previously selected variables.`Running Achievable R^2`

: The highest \(R^2\) achievable by a model using all variables selected so far to predict the label.`Univariate Mutual Information (nats)`

: Maximum-entropy model-free estimate of the mutual information in nats between the variable and the label, under the true data generating distribution.`Conditional Mutual Information (nats)`

: Maximum-entropy model-free estimate of the mutual information in nats between the variable and the label conditional on all variables previously selected (if any), under the true data generating distribution.`Running Mutual Information (nats)`

: Maximum-entropy model-free estimate of the mutual information in nats between all variables selected so far (including this variable) and the label, under the true data generating distribution.`Univariate Maximum True Log-Likelihood Increase Per Sample`

: The highest amount by which the true likelihood per sample of a model using the variable in isolation to predict the label can increase over the naive baseline strategy consisting of always predicting the mode of the true unconditional distribution of the label.`Maximum Marginal True Log-Likelihood Increase Per Sample`

: The highest amount by which the true likelihood per sample of a model using all variables selected prior to this one to predict the label can increase as a result of adding this variable.`Running Maximum Log-Likelihood Increase Per Sample`

: The highest amount by which the true likelihood per sample of a model using all variables selected so far, including this one, to predict the label can increase over the naive baseline strategy consisting of always predicting the mode of the true unconditional distribution of the label.

```
In [5]:
```

```
import pylab as plt
fig, ax = plt.subplots(1, 2, figsize=(20, 10))
var_selection_analysis.data[['Univariate Achievable R^2', \
'Maximum Marginal R^2 Increase']]\
.plot.bar(rot=90, ax=ax[0])
ax[0].set_ylabel(r'$R^2$', fontsize=15)
var_selection_analysis.data[['Univariate Mutual Information (nats)', \
'Conditional Mutual Information (nats)']]\
.plot.bar(rot=90, ax=ax[1])
ax[1].set_ylabel('Mutual Information (nats)', fontsize=15)
plt.show()
```