# Regression Problem With Some Useless Variables¶

In this tutorial we go through a regression problem with a moderate number of explanatory variables (6) that are not equally insightful.

We run the variable selection analysis of the kxy package and confirm it empirically. We then show that our empirical findings are consistent with the data-driven improvability analysis of the kxy package.

We use the UCI Yacht Hydrodynamics dataset.

In [1]:

%load_ext autoreload
import os
import warnings
warnings.filterwarnings('ignore')

# Required imports
import pandas as pd
import kxy

In [2]:

df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/'\
'00243/yacht_hydrodynamics.data', sep='[ ]{1,2}',\
names=['Longitudinal Position', 'Prismatic Coeefficient',\
'Length-Displacement', 'Beam-Draught Ratio',\
'Length-Beam Ratio', 'Froude Number',\
'Residuary Resistance'])
df.rename(columns={col: col.title() for col in df.columns}, inplace=True)


## Variable Selection¶

### Analysis¶

In [3]:

var_selection_analysis = df.kxy.variable_selection('Residuary Resistance', problem_type='regression')

[====================================================================================================] 100% ETA: 0s

In [4]:

var_selection_analysis

Out[4]:

Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
1 Froude Number 0.99 1.66
2 Beam-Draught Ratio 0.99 1.62
3 Length-Displacement 0.99 1.60
4 Longitudinal Position 0.99 1.59
5 Length-Beam Ratio 0.99 1.58
6 Prismatic Coeefficient 0.99 1.58

#### Column Meaning¶

• Selection Order: Order in which variables (see Variable) are selected, from the most important to the least important. The first variable selected is the one with the highest mutual information with the label (i.e. that is the most useful when used in isolation). The $$(i+1)$$-th variable selected is the variable, among all variables not yet selected, that complements all $$i$$ variables previously selected the most

• Running Achievable R-Squared: The highest $$R^2$$ achievable by a model using all variables selected so far to predict the label.

• Running Achievable RMSE: The lowest Root Mean Square Error achievable by a model using all variables selected so far to predict the label.

### Validation¶

It seems that Froude Number is by far the most relevant variable and an almost perfect $$R^2$$ may be achieved. Let’s try to confirm this visually.

In [5]:

import pylab as plt
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
df[['Residuary Resistance', 'Froude Number']].plot(ax=ax, x='Froude Number',\
y='Residuary Resistance', kind='scatter')
plt.show()


## Data-Driven Improvability¶

The variable selection analysis suggests that the other 4 variables do not add much value. Let’s try to confirm this with our data-driven improvability analysis.

In [6]:

new_variables = ['Beam-Draught Ratio',\
'Length-Displacement', 'Longitudinal Position', \
'Length-Beam Ratio', 'Prismatic Coeefficient']
df.kxy.data_driven_improvability('Residuary Resistance', new_variables, problem_type='regression')

[====================================================================================================] 100% ETA: 0s

Out[6]:

R-Squared Boost Log-Likelihood Per Sample Boost RMSE Reduction
0 0.00 0.00 0.00

As it turns out, our data-driven improvability analysis does indeed suggest that the other 4 variables add negligible value.

## Model-Driven Improvability¶

Let’s train a couple models using Froude Number as the only explanatory variable and run our model-driven improvability analysis. We begin with a simple linear model.

### Case I: Linear Regression¶

In [7]:

# Learning (Basic Linear Regression)
from sklearn.linear_model import LinearRegression
# Training
label_column = 'Residuary Resistance'
prediction_column = 'Prediction'
train_size = 200
df_ = df.copy()
df_ = df_.sample(frac=1, random_state=0) # Shuffle rows
train_df = df.iloc[:train_size]
x_train = train_df[['Froude Number']].values
y_train = train_df[label_column].values
model = LinearRegression().fit(x_train, y_train)

# Testing
test_df = df.iloc[train_size:]
x_test = test_df[['Froude Number']].values
y_test = test_df[label_column].values

# Out-of-sample predictions
predictions = model.predict(x_test)
lr_test_df = test_df.copy()
lr_test_df[prediction_column] = predictions

# Out-of-sample accuracy (R^2)
'Linear Regression Out-Of-Sample R^2: %.2f' % (model.score(x_test, y_test))

Out[7]:

'Linear Regression Out-Of-Sample R^2: 0.65'

In [8]:

fig, axes = plt.subplots(1, 2, figsize=(20, 10))
lr_test_df[['Prediction', 'Froude Number']].plot(\
ax=axes[0], x='Froude Number', y='Prediction', \
color='r')
lr_test_df[['Residuary Resistance', 'Froude Number']].plot(\
ax=axes[0], x='Froude Number', y='Residuary Resistance', \
kind='scatter')
lr_test_df['Residuals'] = lr_test_df['Residuary Resistance']-lr_test_df['Prediction']
lr_test_df[['Residuals', 'Froude Number']].plot(\
ax=axes[1], x='Froude Number', y='Residuals', \
kind='scatter', label='Residuals (Linear Regression)', color='k')
axes[0].legend(['Predictions (Linear Regression)', 'True'])
plt.show()

In [9]:

lr_test_df[['Froude Number', 'Residuary Resistance', 'Prediction']].kxy\
.model_driven_improvability(label_column, prediction_column, \
problem_type='regression')

[====================================================================================================] 100% ETA: 0s

Out[9]:

Lost R-Squared Lost Log-Likelihood Per Sample Lost RMSE Residual R-Squared Residual Log-Likelihood Per Sample Residual RMSE
0 0.00 0.00 0.00 0.98 -1.73 1.40

### Interpretation¶

Lost performance metrics represent the performance that was irreversibly lost while training the moodel. When the model’s prediction function is in a 1-to-1 relationship with the input, nothing can be lost during training as the effect of training can be undone.

Indeed, if a poor model has 1-to-1 prediction function $$x \to g(x)$$, and the best model has prediction function $$x \to f(x)$$, then we may always achieve optimal performance from $$z=g(x)$$ by simply using the model with prediction function $$z \to f\left(g^{-1}(z)\right)$$. Note however that the prediction function of a trained model can only be 1-to-1 when the input is 1D, otherwise the trained model will irreversibly lose information about the input distribution.

In the case of our trained 1D linear regression model, the prediction function is 1-to-1 and, as such, the training process is lossless. Once more, the analysis from the kxy package matches our expectation.

Residual performance metrics represent the performance that may be achieved when using explanatory variables to explain regression residuals. In other words, they represent how much juice is still left in regression residuals. The more juice is extracted by the trained model, the less juice will be left in residuals. For these residual performance metrics to be 0, the learned model needs to be truly optimal.

The kxy model-driven improvability analysis suggests that residuals of the trained linear regression model can be predicted almost perfectly using Froude Number, which is consistent with the residuals plot above.

### Case 2: Gaussian Process Regression¶

Next we try Gaussian Proceess regression to capture non-linearities. We use the RBF kernel with fixed input scale 1.0 and output scale 10.

In [10]:

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF

x_train = train_df[['Froude Number']].values
y_train = train_df[label_column].values

x_test = test_df[['Froude Number']].values
y_test = test_df[label_column].values

# Fit
gp_model = GaussianProcessRegressor(10.0 * RBF(1.0))
gp_model.fit(x_train, y_train)

# Out-of-sample predictions
gp_test_df = test_df.copy()
gp_test_df[prediction_column] = gp_model.predict(x_test)

# Out-of-sample accuracy (R^2)
'Gaussian Process Regression Out-Of-Sample R^2: %.2f' % (gp_model.score(x_test, y_test))

Out[10]:

'Gaussian Process Regression Out-Of-Sample R^2: 0.98'

In [11]:

gp_test_df[['Froude Number', 'Residuary Resistance', 'Prediction']].kxy\
.model_driven_improvability(label_column, prediction_column, \
problem_type='regression')

[====================================================================================================] 100% ETA: 0s

Out[11]:

Lost R-Squared Lost Log-Likelihood Per Sample Lost RMSE Residual R-Squared Residual Log-Likelihood Per Sample Residual RMSE
0 0.00 4.36e-02 7.65e-02 0.90 -4.46e-01 7.37e-01

Lost performance metrics still have negligible values as the prediction function is 1-to-1.

Moreover, residual performances are reduced, which is an indication that our GP model is less suboptimal than our linear regression model. A lower Residual RMSE typically implies that residuals have a smaller variance. A lower Residual R-Squared illustrates that residuals are harder to predict using the explanatory variables, which implies the model is more effective — when the model is optimal, residuals are independant from explanatory variables.

We note however that the kxy analysis suggests that there is still a lot of juice left in our GP residuals. This can be visually confirmed on the figure below, where we see that our trained GP model undershoots for large Froude Number values. This is most likely due to the fact that we did not learn kernel hyper-parameters.

In [12]:

fig, axes = plt.subplots(1, 2, figsize=(20, 10))
gp_test_df[['Prediction', 'Froude Number']].sort_values(by=['Froude Number']).\
plot(ax=axes[0], x='Froude Number', y='Prediction', \
color='r')
gp_test_df[['Residuary Resistance', 'Froude Number']].plot(\
ax=axes[0], x='Froude Number', y='Residuary Resistance', \
kind='scatter')
gp_test_df['Residuals'] = gp_test_df['Residuary Resistance']-gp_test_df['Prediction']
gp_test_df[['Residuals', 'Froude Number']].plot(\
ax=axes[1], x='Froude Number', y='Residuals', \
kind='scatter', label='Residuals (GP Regression)', color='k')
axes[0].legend(['Predictions (GP Regression)', 'True'])
plt.show()