# UCI Yacht Hydrodynamics Feasibility¶

In this tutorial we show how to run to use the kxy package to estimate the best performance that can be achieved when using a specific set of variables in a regression problem.

We use the UCI Yacht Hydrodynamics dataset.

In :

%load_ext autoreload
import warnings
warnings.filterwarnings('ignore')

# Required imports
import numpy as np
import pandas as pd
import kxy

In :

df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/'\
'00243/yacht_hydrodynamics.data', sep='[ ]{1,2}',\
names=['Longitudinal Position', 'Prismatic Coeefficient',\
'Length-Displacement', 'Beam-Draught Ratio',\
'Length-Beam Ratio', 'Froude Number',\
'Residuary Resistance'])
df.rename(columns={col: col.title() for col in df.columns}, inplace=True)


## How To Generate The Achievable Performance Analysis¶

In :

performance_analysis = df.kxy.data_valuation('Residuary Resistance', problem_type='regression')

[====================================================================================================] 100% ETA: 0s

In :

performance_analysis

Out:

Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable RMSE
0 0.99 -1.20 1.33

Note: the same syntax is used for classification problems. The type of supervised learning problem is inferred based on the label column.

## Column Meaning¶

• Achievable R-Squared: The highest $$R^2$$ that can be achieved by a model using the inputs to predict the label.

• Achievable Log-Likelihood Per Sample: The highest true log-likelihood per sample that can be achieved by a model using the inputs to predict the label.

• Achievable RMSE: The lowest Root-Mean-Square-Error that can be achieved by a model using the inputs to predict the label.

These performances are not mere upper bounds, they are attainable.