Can a Trained Regression Model Be Improved Without Additional Variables?

In this tutorial we show how to use the kxy package to quantify the extent to which a trained regression model can be improved without resorting to new datasets.

We use the UCI Yacht Hydrodynamics dataset.

In [1]:
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')

# Required imports
import pandas as pd
import kxy
In [2]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/'\
                 '00243/yacht_hydrodynamics.data', sep='[ ]{1,2}',\
                 names=['Longitudinal Position', 'Prismatic Coeefficient',\
                        'Length-Displacement', 'Beam-Draught Ratio',\
                        'Length-Beam Ratio', 'Froude Number',\
                        'Residuary Resistance'])

Basic Model 1: Linear Regression

We begin by training a simple linear regression model.

In [3]:
# Learning (Basic Linear Regression)
from sklearn.linear_model import LinearRegression
# Training
label_column = 'Residuary Resistance'
train_size = 200
train_df = df.iloc[:train_size]
x_train = train_df[['Longitudinal Position', 'Prismatic Coeefficient',\
                        'Length-Displacement', 'Beam-Draught Ratio',\
                        'Length-Beam Ratio', 'Froude Number']].values
y_train = train_df[label_column].values
model = LinearRegression().fit(x_train, y_train)

# Testing
test_df = df.iloc[train_size:]
x_test = test_df[['Longitudinal Position', 'Prismatic Coeefficient',\
                        'Length-Displacement', 'Beam-Draught Ratio',\
                        'Length-Beam Ratio', 'Froude Number']].values
y_test = test_df[label_column].values

# Out-of-sample predictions
predictions = model.predict(x_test)
test_df['Prediction'] = predictions

# In-sample predictions
in_sample_predictions = model.predict(x_train)
train_df['Prediction'] = in_sample_predictions

# Out-of-sample accuracy (R^2)
'Linear Regression Out-Of-Sample R^2: %.3f' % (model.score(x_test, y_test))
Out[3]:
'Linear Regression Out-Of-Sample R^2: 0.648'
In [4]:
print('''
Model Parameters:
* Longitudinal Position: %.4f,
* Prismatic Coeefficient: %.4f,
* Length-Displacement: %.4f,
* Length-Displacement: %.4f,
* Length-Beam Ratio: %.4f,
* Froude Number: %.4f.''' % tuple(list(model.coef_)))

Model Parameters:
* Longitudinal Position: 0.1725,
* Prismatic Coeefficient: -32.0069,
* Length-Displacement: 1.9683,
* Length-Displacement: -0.9958,
* Length-Beam Ratio: -2.6162,
* Froude Number: 118.0419.
In [5]:
print('''
Model Parameters (Adjusted For Std):
* Longitudinal Position: %.4f,
* Prismatic Coeefficient: %.4f,
* Length-Displacement: %.4f,
* Length-Displacement: %.4f,
* Length-Beam Ratio: %.4f,
* Froude Number: %.4f.''' % tuple(list(model.coef_*x_train.std(axis=0))))

Model Parameters (Adjusted For Std):
* Longitudinal Position: 0.2288,
* Prismatic Coeefficient: -0.4150,
* Length-Displacement: 0.4872,
* Length-Displacement: -0.5807,
* Length-Beam Ratio: -0.6904,
* Froude Number: 11.9654.

Improvability Analysis Model 1

In [6]:
# Improvability analysis
improvability_analysis = train_df.kxy.model_improvability_analysis(\
    label_column, 'Prediction', space='dual')
improvability_analysis
Out[6]:
Lost R^2 Lost True Log-Likelihood Per Sample Residual R^2 Residual True Log-Likelihood Per Sample
0.0163178 0 0.733263 2.87701

Column Meaning

  • Lost R^2: The \(R^2\) that has been irreversibly lost by the trained model. That is, the difference between the highest \(R^2\) that could be achieved from raw inputs, and the highest \(R^2\) that could be achieved from model predictions.
  • Lost True Log-Likelihood Per Sample: The true log-likelihood per sample that has been irreversibly lost by the trained model. That is, the difference between the highest log-likelihood per sample that could be achieved from raw inputs, and the highest log-likelihood per sample that could be achieved from model predictions.
  • Residual R^2: The highest \(R^2\) that can be achieved when trying to predict model residuals using inputs.
  • Residual True Log-Likelihood Per Sample: The highest log-likelihood per sample that can be achieved when trying to predict model residuals using inputs.

Interpretation Analysis Model 1

Judging by the low values of Lost R^2 and Lost True Log-Likelihood Per Sample, the linear regression model doesn’t seem to have resulted in information loss. Most of the value in using inputs to predict the label are captured in the model predictions.

However, given the high Residual R^2, the trained model itself doesn’t seem to be the most accurate.

Let’s another simple model.

In [7]:
train_df.kxy.variable_selection_analysis(label_column, \
    input_columns=['Longitudinal Position', 'Prismatic Coeefficient',\
                        'Length-Displacement', 'Beam-Draught Ratio',\
                        'Length-Beam Ratio', 'Froude Number'])
Out[7]:
Selection Order Univariate Achievable R^2 Maximum Marginal R^2 Increase Running Achievable R^2 Running Achievable R^2 (%) Univariate Mutual Information (nats) Conditional Mutual Information (nats) Univariate Maximum True Log-Likelihood Increase Per Sample Maximum Marginal True Log-Likelihood Increase Per Sample Running Mutual Information (nats) Running Mutual Information (%) Running Maximum Log-Likelihood Increase Per Sample
Variable
Froude Number 1 0.992262 0.992262 0.992262 99.3676 2.43081 2.43081 2.43081 2.43081 2.43081 74.1639 2.43081
Beam-Draught Ratio 2 0.00315701 0.00259387 0.994856 99.6273 0.001581 0.204147 0.001581 0.204147 2.63496 80.3924 2.63496
Length-Displacement 3 0.00164065 0.00146861 0.996325 99.7744 0.000821 0.168085 0.000821 0.168085 2.80305 85.5207 2.80305
Longitudinal Position 4 1.59999e-05 0.00104931 0.997374 99.8795 8e-06 0.168082 8e-06 0.168082 2.97113 90.6488 2.97113
Prismatic Coeefficient 5 0.0030633 0.000717127 0.998091 99.9513 0.001534 0.159467 0.001534 0.159467 3.1306 95.5142 3.1306
Length-Beam Ratio 6 0.000477886 0.00048634 0.998577 100 0.000239 0.147028 0.000239 0.147028 3.27762 100 3.27762

Our variable selection analysis suggests that the Froude Number variable is, by far, the most important. Let’s try traning a simple univariate nonlinear regression model, namely an RBF-Gaussian Process regression.

Basic Model 2: Linear Regression

In [8]:
import GPy # pip install GPy
x_train = train_df[['Froude Number']].values
x_test = test_df[['Froude Number']].values
kernel = GPy.kern.RBF(input_dim=1)
gp_model = GPy.models.GPRegression(x_train, y_train[:, None], kernel)
gp_model.optimize()
gp_predictions = gp_model.predict(x_test)[0].flatten()

# Out-of-sample accuracy (R^2)
ss_res = ((gp_predictions-y_test)**2).sum()
ss_tot = ((y_test-y_test.mean())**2).sum()
'Univariate RBF-GPR Out-Of-Sample R^2: %.3f' % (1.-(ss_res/ss_tot))
Out[8]:
'Univariate RBF-GPR Out-Of-Sample R^2: 0.979'

Improvability Analysis Model 2

In [9]:
# Improvability analysis
train_df['Prediction'] = gp_model.predict(x_train)[0].flatten()
improvability_analysis = train_df.kxy.model_improvability_analysis(\
    label_column, 'Prediction', space='dual')
improvability_analysis
Out[9]:
Lost R^2 Lost True Log-Likelihood Per Sample Residual R^2 Residual True Log-Likelihood Per Sample
0.00631525 0 0.120764 0.808562
In [10]:
import pylab as plt
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
gp_model.plot_data(ax=ax, label='Training Data')
gp_model.plot(ax=ax, plot_data=False)
ax.plot(x_test, y_test, '+', c='k', label='Testing Data')
ax.plot(x_test, gp_predictions, '.', c='green', label='GP Predictions')
model = LinearRegression().fit(x_train, y_train)
import numpy as np
lr_x = np.arange(0., 0.7, 0.05)
lr_y = model.predict(lr_x[:, None])
ax.plot(lr_x, lr_y, '.--', c='red', label='Linear Regression Fit')
ax.set_xlabel('Froude Number')
ax.set_ylabel(label_column)
plt.legend()
plt.show()
../../../_images/latest_examples_regression_regression_improvability_18_0.png

Interpretation Analysis Model 2

We can see at a glance that the relationship between the label and the most important variable, namely Froude Number is nonlinear, and that this nonlinearity is well captured by the RBF-GP regression.

This is also evidenced in our improvability analysis as the Residual R^2 dropped from 0.66 to 0.12.

Note that the \(R^2\) of our RBF-GP regression model is very closed to the theoretical best (i.e. Univariate Achievable R^2 of Froude Number).

Next Step

The fit achieved so far is great, but as evidenced by the fact that the same value of the Froude Number variable can result in a wide range of outputs for large froude numbers, and by the non-negligible Residual R^2, there is still some ambiguity that might be alleviated with additional inputs.

We could look at fitting another model on the residuals of our RBF-GP regression, and carry on iterating additively until the model improvability analysis suggests that residuals are independent from inputs.