# DataFrame Extension Deep Dive¶

We define a custom kxy pandas accessor below, namely the class Accessor, that extends the pandas DataFrame class with all our analyses, thereby allowing data scientists to tap into the power of the kxy toolkit within the comfort of their favorite data structure.

All methods defined in the Accessor class are accessible from any DataFrame instance as df.kxy.<method_name>, so long as the kxy python package is imported alongside pandas.

class kxy.pandas_extension.accessor.Accessor(pandas_obj)

Extension of the pandas.DataFrame class with the full capabilities of the kxy platform.

class kxy.pandas_extension.base_accessor.BaseAccessor(pandas_obj)

Base class inheritated by our customs accessors.

anonymize(columns_to_exclude=[])

Anonymize the dataframe in a manner that leaves all pre-learning and post-learning analyses (including data valuation, variable selection, model-driven improvability, data-driven improvability and model explanation) invariant.

Any transformation on continuous variables that preserves ranks will not change our pre-learning and post-learning analyses. The same holds for any 1-to-1 transformation on categorical variables.

This implementation replaces ordinal values (i.e. any column that can be cast as a float) with their within-column ranks. For each non-ordinal column, we form the set of all possible values, we assign a unique integer index to each value in the set, and we systematically replace said value appearing in the dataframe by the hexadecimal code of its associated integer index.

For regression problems, accurate estimation of RMSE related metrics require the target column (and the prediction column for post-learning analyses) not to be anonymized.

Parameters

columns_to_exclude (list (optional)) – List of columns not to anonymize (e.g. target and prediction columns for regression problems).

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

Return type

pandas.DataFrame

is_categorical(column)

Determine whether the input column contains categorical (i.e. non-ordinal) observations.

is_discrete(column)

Determine whether the input column contains discrete (i.e as opposed to continuous) observations.

class kxy.pandas_extension.pre_learning_accessor.PreLearningAccessor(pandas_obj)

Extension of the pandas.DataFrame class with various analytics for post-learning in supervised learning problems.

This class defines the kxy_pre_learning pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_pre_learning.<method_name>, so long as the kxy python package is imported alongside pandas.

data_valuation(target_column, problem_type=None)

Estimate the highest performance metrics achievable when predicting the target_column using all other columns.

When problem_type=None, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not target_column is categorical.

Parameters
• target_column (str) – The name of the column containing true labels.

• problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.

Returns

achievable_performance – The result is a pandas.Dataframe with columns (where applicable):

• 'Achievable Accuracy': The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.

• 'Achievable R^2': The highest $$R^2$$ that can be achieved by a model using provided inputs to predict the label.

• 'Achievable RMSE': The lowest Root Mean Square Error that can be achieved by a model using provided inputs to predict the label.

• 'Achievable Log-Likelihood Per Sample': The highest true log-likelihood per sample that can be achieved by a model using provided inputs to predict the label.

Return type

pandas.Dataframe

Theoretical Foundation

Section 1 - Achievable Performance.

variable_selection(target_column, problem_type=None)

Runs the model-free variable selection analysis.

When problem_type=None, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not target_column is categorical.

Parameters
• target_column (str) – The name of the column containing true labels.

• problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.

Returns

result – The result is a pandas.DataFrame with columns (where applicable):

• 'Selection Order': The order in which the associated variable was selected, starting at 1 for the most important variable.

• 'Variable': The column name corresponding to the input variable.

• 'Running Achievable R^2': The highest $$R^2$$ that can be achieved by a classification model using all variables selected so far, including this one.

• 'Running Achievable Accuracy': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.

• 'Running Achievable RMSE': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.

Return type

pandas.DataFrame

Theoretical Foundation

Section 2 - Variable Selection Analysis.

class kxy.pandas_extension.learning_accessor.LearningAccessor(pandas_obj)

Extension of the pandas.DataFrame class with various analytics for automatically training predictive models.

This class defines the kxy_learning pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_learning.<method_name>, so long as the kxy python package is imported alongside pandas.

class kxy.pandas_extension.post_learning_accessor.PostLearningAccessor(pandas_obj)

Extension of the pandas.DataFrame class with various analytics for post-learning in supervised learning problems.

This class defines the kxy_post_learning pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_post_learning.<method_name>, so long as the kxy python package is imported alongside pandas.

data_driven_improvability(target_column, new_variables, problem_type=None)

Estimate the potential performance boost that a set of new explanatory variables can bring about.

Parameters
• target_column (str) – The name of the column containing true labels.

• new_variables (list) – The names of the columns to use as new explanatory variables.

• problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not target_column is categorical.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

• 'Accuracy Boost': The classification accuracy boost that the new explanatory variables can bring about.

• 'R-Squared Boost': The $$R^2$$ boost that the new explanatory variables can bring about.

• 'RMSE Reduction': The reduction in Root Mean Square Error that the new explanatory variables can bring about.

• 'Log-Likelihood Per Sample Boost': The boost in log-likelihood per sample that the new explanatory variables can bring about.

Return type

pandas.Dataframe

Theoretical Foundation

Section 3 - Model Improvability.

kxy.post_learning.improvability.data_driven_improvability

model_driven_improvability(target_column, prediction_column, problem_type=None)

Estimate the extent to which a trained supervised learner may be improved in a model-driven fashion (i.e. without resorting to additional explanatory variables).

Parameters
• target_column (str) – The name of the column containing true labels.

• prediction_column (str) – The name of the column containing model predictions.

• problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not target_column is categorical.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

• 'Lost Accuracy': The amount of classification accuracy that was irreversibly lost when training the supervised learner.

• 'Lost R-Squared': The amount of $$R^2$$ that was irreversibly lost when training the supervised learner.

• 'Lost RMSE': The amount of Root Mean Square Error that was irreversibly lost when training the supervised learner.

• 'Lost Log-Likelihood Per Sample': The amount of true log-likelihood per sample that was irreversibly lost when training the supervised learner.

• 'Residual R-Squared': For regression problems, this is the highest $$R^2$$ that may be achieved when using explanatory variables to predict regression residuals.

• 'Residual RMSE': For regression problems, this is the lowest Root Mean Square Error that may be achieved when using explanatory variables to predict regression residuals.

• 'Residual Log-Likelihood Per Sample': For regression problems, this is the highest log-likelihood per sample that may be achieved when using explanatory variables to predict regression residuals.

Return type

pandas.Dataframe

Theoretical Foundation

Section 3 - Model Improvability.

kxy.post_learning.improvability.model_driven_improvability

model_explanation(prediction_column, problem_type=None)

Analyzes the variables that a model relies on the most in a brute-force fashion.

The first variable is the variable the model relies on the most. The second variable is the variable that complements the first variable the most in explaining model decisions etc.

Running performances should be understood as the performance achievable when trying to guess model predictions using variables with selection order smaller or equal to that of the row.

When problem_type=None, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not target_column is categorical.

Parameters
• prediction_column (str) – The name of the column containing model predictions.

• problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.

Returns

result – The result is a pandas.Dataframe with columns (where applicable):

• 'Selection Order': The order in which the associated variable was selected, starting at 1 for the most important variable.

• 'Variable': The column name corresponding to the input variable.

• 'Running Achievable R^2': The highest $$R^2$$ that can be achieved by a classification model using all variables selected so far, including this one.

• 'Running Achievable Accuracy': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.

• 'Running Achievable RMSE': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.

Return type

pandas.DataFrame

Theoretical Foundation

Section 2 - Variable Selection Analysis.

class kxy.pandas_extension.finance_accessor.FinanceAccessor(pandas_obj)

Extension of the pandas.DataFrame class with various analytics finance specific analytics.

This class defines the kxy_finance pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_finance.<method_name>, so long as the kxy python package is imported alongside pandas.