Data Valuation¶

Estimation of the highest performance achievable in a supervised learning problem. E.g. \(R^2\), RMSE, classification accuracy, true log-likelihood per observation.

kxy.pre_learning.achievable_performance.data_valuation(data_df, target_column, problem_type, snr='auto', include_mutual_information=False, file_name=None)¶

Estimate the highest performance metrics achievable when predicting the target_column using all other columns.

When problem_type=None, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not target_column is categorical.

Parameters

data_df (pandas.DataFrame) – The pandas DataFrame containing the data.
target_column (str) – The name of the column containing true labels.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
include_mutual_information (bool) – Whether to include the mutual information between target and explanatory variables in the result.
file_name (None | str) – A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.

Returns

achievable_performance – The result is a pandas.Dataframe with columns (where applicable):

'Achievable Accuracy': The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.
'Achievable R-Squared': The highest \(R^2\) that can be achieved by a model using provided inputs to predict the label.
'Achievable RMSE': The lowest Root Mean Square Error that can be achieved by a model using provided inputs to predict the label.
'Achievable Log-Likelihood Per Sample': The highest true log-likelihood per sample that can be achieved by a model using provided inputs to predict the label.

Return type

pandas.Dataframe

Theoretical Foundation

Section 1 - Achievable Performance.