Calculating the Best Performance Achievable in a Classification Problem¶
In this tutorial we show how to run to use the
kxy package to estimate the best performance that can be achieved when using a specific set of variables in a regression problem.
We use the UCI bank note classification dataset.
%load_ext autoreload %autoreload 2 import warnings warnings.filterwarnings('ignore') # Required imports import pandas as pd import kxy
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00267/' \ 'data_banknote_authentication.txt' df = pd.read_csv(url, names=['Variance', 'Skewness', 'Kurtosis', 'Entropy', 'Is Fake'])
How To Generate The Achievable Performance Analysis¶
performance_analysis = df.kxy.achievable_performance_analysis('Is Fake', \ input_columns=()) # Use all columns but the label column as inputs
|Achievable R^2||Achievable Log-Likelihood Per Sample||Achievable Accuracy|
Note: the same syntax is used for regression problems. The type of supervised learning problem is inferred based on the label column.
Achievable R^2: The highest \(R^2\) that can be achieved by a model using the inputs to predict the label.
Achievable True Log-Likelihood Per Sample: The highest true log-likelihood per sample that can be achieved by a model using the inputs to predict the label.
Achievable Accuracy: The highest classification accuracy that can be achieved by a model using the inputs to predict the label.
These performances are not mere upper bounds, they are attainable.