# Can a Trained Regression Model Be Improved Without Additional Variables?¶

In this tutorial we show how to use the `kxy`

package to quantify the extent to which a trained regression model can be improved without resorting to new datasets.

We use the UCI Yacht Hydrodynamics dataset.

```
In [1]:
```

```
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')
# Required imports
import pandas as pd
import kxy
```

```
In [2]:
```

```
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/'\
'00243/yacht_hydrodynamics.data', sep='[ ]{1,2}',\
names=['Longitudinal Position', 'Prismatic Coeefficient',\
'Length-Displacement', 'Beam-Draught Ratio',\
'Length-Beam Ratio', 'Froude Number',\
'Residuary Resistance'])
```

## Basic Model 1: Linear Regression¶

We begin by training a simple linear regression model.

```
In [3]:
```

```
# Learning (Basic Linear Regression)
from sklearn.linear_model import LinearRegression
# Training
label_column = 'Residuary Resistance'
train_size = 200
train_df = df.iloc[:train_size]
x_train = train_df[['Longitudinal Position', 'Prismatic Coeefficient',\
'Length-Displacement', 'Beam-Draught Ratio',\
'Length-Beam Ratio', 'Froude Number']].values
y_train = train_df[label_column].values
model = LinearRegression().fit(x_train, y_train)
# Testing
test_df = df.iloc[train_size:]
x_test = test_df[['Longitudinal Position', 'Prismatic Coeefficient',\
'Length-Displacement', 'Beam-Draught Ratio',\
'Length-Beam Ratio', 'Froude Number']].values
y_test = test_df[label_column].values
# Out-of-sample predictions
predictions = model.predict(x_test)
test_df['Prediction'] = predictions
# In-sample predictions
in_sample_predictions = model.predict(x_train)
train_df['Prediction'] = in_sample_predictions
# Out-of-sample accuracy (R^2)
'Linear Regression Out-Of-Sample R^2: %.3f' % (model.score(x_test, y_test))
```

```
Out[3]:
```

```
'Linear Regression Out-Of-Sample R^2: 0.648'
```

```
In [4]:
```

```
print('''
Model Parameters:
* Longitudinal Position: %.4f,
* Prismatic Coeefficient: %.4f,
* Length-Displacement: %.4f,
* Length-Displacement: %.4f,
* Length-Beam Ratio: %.4f,
* Froude Number: %.4f.''' % tuple(list(model.coef_)))
```

```
Model Parameters:
* Longitudinal Position: 0.1725,
* Prismatic Coeefficient: -32.0069,
* Length-Displacement: 1.9683,
* Length-Displacement: -0.9958,
* Length-Beam Ratio: -2.6162,
* Froude Number: 118.0419.
```

```
In [5]:
```

```
print('''
Model Parameters (Adjusted For Std):
* Longitudinal Position: %.4f,
* Prismatic Coeefficient: %.4f,
* Length-Displacement: %.4f,
* Length-Displacement: %.4f,
* Length-Beam Ratio: %.4f,
* Froude Number: %.4f.''' % tuple(list(model.coef_*x_train.std(axis=0))))
```

```
Model Parameters (Adjusted For Std):
* Longitudinal Position: 0.2288,
* Prismatic Coeefficient: -0.4150,
* Length-Displacement: 0.4872,
* Length-Displacement: -0.5807,
* Length-Beam Ratio: -0.6904,
* Froude Number: 11.9654.
```

### Improvability Analysis Model 1¶

```
In [6]:
```

```
# Improvability analysis
improvability_analysis = train_df.kxy.model_improvability_analysis(\
label_column, 'Prediction', space='dual')
improvability_analysis
```

```
Out[6]:
```

Lost R^2 | Lost True Log-Likelihood Per Sample | Residual R^2 | Residual True Log-Likelihood Per Sample |
---|---|---|---|

0.0120705 | 0 | 0.613555 | 3.06237 |

#### Column Meaning¶

`Lost R^2`

: The \(R^2\) that has been irreversibly lost by the trained model. That is, the difference between the highest \(R^2\) that could be achieved from raw inputs, and the highest \(R^2\) that could be achieved from model predictions.`Lost True Log-Likelihood Per Sample`

: The true log-likelihood per sample that has been irreversibly lost by the trained model. That is, the difference between the highest log-likelihood per sample that could be achieved from raw inputs, and the highest log-likelihood per sample that could be achieved from model predictions.`Residual R^2`

: The highest \(R^2\) that can be achieved when trying to predict model residuals using inputs.`Residual True Log-Likelihood Per Sample`

: The highest log-likelihood per sample that can be achieved when trying to predict model residuals using inputs.

### Interpretation Analysis Model 1¶

Judging by the low values of `Lost R^2`

and `Lost True Log-Likelihood Per Sample`

, the linear regression model doesn’t seem to have resulted in information loss. Most of the value in using inputs to predict the label are captured in the model predictions.

However, given the high `Residual R^2`

, the trained model itself doesn’t seem to be the most accurate.

Let’s another simple model.

```
In [7]:
```

```
train_df.kxy.variable_selection_analysis(label_column, \
input_columns=['Longitudinal Position', 'Prismatic Coeefficient',\
'Length-Displacement', 'Beam-Draught Ratio',\
'Length-Beam Ratio', 'Froude Number'])
```

```
Out[7]:
```

Selection Order | Univariate Achievable R^2 | Maximum Marginal R^2 Increase | Running Achievable R^2 | Running Achievable R^2 (%) | Univariate Mutual Information (nats) | Conditional Mutual Information (nats) | Univariate Maximum True Log-Likelihood Increase Per Sample | Maximum Marginal True Log-Likelihood Increase Per Sample | Running Mutual Information (nats) | Running Mutual Information (%) | Running Maximum Log-Likelihood Increase Per Sample | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Variable | ||||||||||||

Froude Number | 1 | 0.992262 | 0.992262 | 0.992262 | 99.792 | 2.43081 | 2.43081 | 2.43081 | 2.43081 | 2.43081 | 93.9885 | 2.43081 |

Beam-Draught Ratio | 2 | 0.00315701 | 0.00107825 | 0.99334 | 99.9005 | 0.001581 | 0.075032 | 0.001581 | 0.075032 | 2.50585 | 96.8896 | 2.50585 |

Prismatic Coeefficient | 3 | 0.0030633 | 0.0004045 | 0.993745 | 99.9411 | 0.001534 | 0.031331 | 0.001534 | 0.031331 | 2.53718 | 98.1011 | 2.53718 |

Longitudinal Position | 4 | 1.59999e-05 | 0.000264671 | 0.99401 | 99.9678 | 8e-06 | 0.021617 | 8e-06 | 0.021617 | 2.55879 | 98.9369 | 2.55879 |

Length-Beam Ratio | 5 | 0.000477886 | 0.000224915 | 0.994234 | 99.9904 | 0.000239 | 0.0191343 | 0.000239 | 0.0191343 | 2.57793 | 99.6767 | 2.57793 |

Length-Displacement | 6 | 0.00164065 | 9.56097e-05 | 0.99433 | 100 | 0.000821 | 0.008361 | 0.000821 | 0.008361 | 2.58629 | 100 | 2.58629 |

Our variable selection analysis suggests that the `Froude Number`

variable is, by far, the most important. Let’s try traning a simple univariate nonlinear regression model, namely an RBF-Gaussian Process regression.

## Basic Model 2: Linear Regression¶

```
In [8]:
```

```
import GPy # pip install GPy
x_train = train_df[['Froude Number']].values
x_test = test_df[['Froude Number']].values
kernel = GPy.kern.RBF(input_dim=1)
gp_model = GPy.models.GPRegression(x_train, y_train[:, None], kernel)
gp_model.optimize()
gp_predictions = gp_model.predict(x_test)[0].flatten()
# Out-of-sample accuracy (R^2)
ss_res = ((gp_predictions-y_test)**2).sum()
ss_tot = ((y_test-y_test.mean())**2).sum()
'Univariate RBF-GPR Out-Of-Sample R^2: %.3f' % (1.-(ss_res/ss_tot))
```

```
Out[8]:
```

```
'Univariate RBF-GPR Out-Of-Sample R^2: 0.979'
```

### Improvability Analysis Model 2¶

```
In [9]:
```

```
# Improvability analysis
train_df['Prediction'] = gp_model.predict(x_train)[0].flatten()
improvability_analysis = train_df.kxy.model_improvability_analysis(\
label_column, 'Prediction', space='dual')
improvability_analysis
```

```
Out[9]:
```

Lost R^2 | Lost True Log-Likelihood Per Sample | Residual R^2 | Residual True Log-Likelihood Per Sample |
---|---|---|---|

0.00206795 | 0 | 0.120764 | 0.808562 |

```
In [10]:
```

```
import pylab as plt
fig, ax = plt.subplots(1, 1, figsize=(10, 10))
gp_model.plot_data(ax=ax, label='Training Data')
gp_model.plot(ax=ax, plot_data=False)
ax.plot(x_test, y_test, '+', c='k', label='Testing Data')
ax.plot(x_test, gp_predictions, '.', c='green', label='GP Predictions')
model = LinearRegression().fit(x_train, y_train)
import numpy as np
lr_x = np.arange(0., 0.7, 0.05)
lr_y = model.predict(lr_x[:, None])
ax.plot(lr_x, lr_y, '.--', c='red', label='Linear Regression Fit')
ax.set_xlabel('Froude Number')
ax.set_ylabel(label_column)
plt.legend()
plt.show()
```

### Interpretation Analysis Model 2¶

We can see at a glance that the relationship between the label and the most important variable, namely `Froude Number`

is nonlinear, and that this nonlinearity is well captured by the RBF-GP regression.

This is also evidenced in our improvability analysis as the `Residual R^2`

dropped from 0.61 to 0.12.

Note that the \(R^2\) of our RBF-GP regression model is very closed to the theoretical best (i.e. `Univariate Achievable R^2`

of `Froude Number`

).

## Next Step¶

The fit achieved so far is great, but as evidenced by the fact that the same value of the `Froude Number`

variable can result in a wide range of outputs for large froude numbers, and by the non-negligible `Residual R^2`

, there is still some ambiguity that might be alleviated with additional inputs.

We could look at fitting another model on the residuals of our RBF-GP regression, and carry on iterating additively until the model improvability analysis suggests that residuals are independent from inputs.