Data Valuation¶
Estimation of the highest performance achievable in a supervised learning problem. E.g. \(R^2\), RMSE, classification accuracy, true loglikelihood per observation.

kxy.pre_learning.achievable_performance.
data_valuation
(data_df, target_column, problem_type)¶ Estimate the highest performance metrics achievable when predicting the
target_column
using all other columns.When
problem_type=None
, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or nottarget_column
is categorical. Parameters
data_df (pandas.DataFrame) – The pandas DataFrame containing the data.
target_column (str) – The name of the column containing true labels.
problem_type (None  'classification'  'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
 Returns
achievable_performance – The result is a pandas.Dataframe with columns (where applicable):
'Achievable Accuracy'
: The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.'Achievable RSquared'
: The highest \(R^2\) that can be achieved by a model using provided inputs to predict the label.'Achievable RMSE'
: The lowest Root Mean Square Error that can be achieved by a model using provided inputs to predict the label.'Achievable LogLikelihood Per Sample'
: The highest true loglikelihood per sample that can be achieved by a model using provided inputs to predict the label.
 Return type
pandas.Dataframe
Theoretical Foundation
Section 1  Achievable Performance.