Data Valuation

Estimation of the highest performance achievable in a supervised learning problem. E.g. \(R^2\), RMSE, classification accuracy, true log-likelihood per observation.

kxy.pre_learning.achievable_performance.data_valuation(data_df, target_column, problem_type, snr='auto', include_mutual_information=False, file_name=None)

Estimate the highest performance metrics achievable when predicting the target_column using all other columns.

When problem_type=None, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not target_column is categorical.

Parameters
  • data_df (pandas.DataFrame) – The pandas DataFrame containing the data.

  • target_column (str) – The name of the column containing true labels.

  • problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.

  • include_mutual_information (bool) – Whether to include the mutual information between target and explanatory variables in the result.

  • file_name (None | str) – A unique identifier characterizing data_df in the form of a file name. Do not set this unless you know why.

Returns

achievable_performance – The result is a pandas.Dataframe with columns (where applicable):

  • 'Achievable Accuracy': The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.

  • 'Achievable R-Squared': The highest \(R^2\) that can be achieved by a model using provided inputs to predict the label.

  • 'Achievable RMSE': The lowest Root Mean Square Error that can be achieved by a model using provided inputs to predict the label.

  • 'Achievable Log-Likelihood Per Sample': The highest true log-likelihood per sample that can be achieved by a model using provided inputs to predict the label.

Return type

pandas.Dataframe

Theoretical Foundation

Section 1 - Achievable Performance.