UCI Yacht Hydrodynamics Feasibility¶
In this tutorial we show how to run to use the kxy
package to estimate the best performance that can be achieved when using a specific set of variables in a regression problem.
We use the UCI Yacht Hydrodynamics dataset.
In [1]:
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')
# Required imports
import numpy as np
import pandas as pd
import kxy
In [2]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/'\
'00243/yacht_hydrodynamics.data', sep='[ ]{1,2}',\
names=['Longitudinal Position', 'Prismatic Coeefficient',\
'Length-Displacement', 'Beam-Draught Ratio',\
'Length-Beam Ratio', 'Froude Number',\
'Residuary Resistance'])
df.rename(columns={col: col.title() for col in df.columns}, inplace=True)
How To Generate The Achievable Performance Analysis¶
In [3]:
performance_analysis = df.kxy.data_valuation('Residuary Resistance', problem_type='regression')
[====================================================================================================] 100% ETA: 0s
In [4]:
performance_analysis
Out[4]:
Achievable R-Squared | Achievable Log-Likelihood Per Sample | Achievable RMSE | |
---|---|---|---|
0 | 0.99 | -1.20 | 1.33 |
Note: the same syntax is used for classification problems. The type of supervised learning problem is inferred based on the label column.
Column Meaning¶
Achievable R-Squared
: The highest \(R^2\) that can be achieved by a model using the inputs to predict the label.Achievable Log-Likelihood Per Sample
: The highest true log-likelihood per sample that can be achieved by a model using the inputs to predict the label.Achievable RMSE
: The lowest Root-Mean-Square-Error that can be achieved by a model using the inputs to predict the label.
These performances are not mere upper bounds, they are attainable.