# Calculating the Best Performance Achievable in a Regression Problem¶

In this tutorial we show how to run to use the kxy package to estimate the best performance that can be achieved when using a specific set of variables in a regression problem.

We use the UCI Yacht Hydrodynamics dataset.

In [1]:

%load_ext autoreload
import warnings
warnings.filterwarnings('ignore')

# Required imports
import pandas as pd
import kxy

In [2]:

df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/'\
'00243/yacht_hydrodynamics.data', sep='[ ]{1,2}',\
names=['Longitudinal Position', 'Prismatic Coeefficient',\
'Length-Displacement', 'Beam-Draught Ratio',\
'Length-Beam Ratio', 'Froude Number',\
'Residuary Resistance'])
df.rename(columns={col: col.title() for col in df.columns}, inplace=True)


## How To Generate The Achievable Performance Analysis¶

In [3]:

performance_analysis = df.kxy.achievable_performance_analysis('Residuary Resistance', \
input_columns=()) # Use all columns but the label column as inputs

In [4]:

performance_analysis

Out[4]:

Achievable R^2 Achievable True Log-Likelihood Per Sample
0.996555 0.798462

Note: the same syntax is used for classification problems. The type of supervised learning problem is inferred based on the label column.

## Column Meaning¶

• Achievable R^2: The highest $$R^2$$ that can be achieved by a model using the inputs to predict the label.
• Achievable True Log-Likelihood Per Sample: The highest true log-likelihood per sample that can be achieved by a model using the inputs to predict the label.

These performances are not mere upper bounds, they are attainable.