DataFrame Extension Deep Dive¶
We define a custom kxy
pandas accessor below,
namely the class Accessor
, that extends the pandas DataFrame class with all our analyses, thereby allowing data scientists to tap into
the power of the kxy
toolkit within the comfort of their favorite data structure.
All methods defined in the Accessor
class are accessible from any DataFrame instance as df.kxy.<method_name>
, so long as the kxy
python
package is imported alongside pandas
.

class
kxy.pandas_extension.accessor.
Accessor
(pandas_obj)¶ Bases:
kxy.pandas_extension.pre_learning_accessor.PreLearningAccessor
,kxy.pandas_extension.learning_accessor.LearningAccessor
,kxy.pandas_extension.post_learning_accessor.PostLearningAccessor
,kxy.pandas_extension.finance_accessor.FinanceAccessor
Extension of the pandas.DataFrame class with the full capabilities of the
kxy
platform.

class
kxy.pandas_extension.base_accessor.
BaseAccessor
(pandas_obj)¶ Base class inheritated by our customs accessors.

anonymize
(columns_to_exclude=[])¶ Anonymize the dataframe in a manner that leaves all prelearning and postlearning analyses (including data valuation, variable selection, modeldriven improvability, datadriven improvability and model explanation) invariant.
Any transformation on continuous variables that preserves ranks will not change our prelearning and postlearning analyses. The same holds for any 1to1 transformation on categorical variables.
This implementation replaces ordinal values (i.e. any column that can be cast as a float) with their withincolumn ranks. For each nonordinal column, we form the set of all possible values, we assign a unique integer index to each value in the set, and we systematically replace said value appearing in the dataframe by the hexadecimal code of its associated integer index.
For regression problems, accurate estimation of RMSE related metrics require the target column (and the prediction column for postlearning analyses) not to be anonymized.
 Parameters
columns_to_exclude (list (optional)) – List of columns not to anonymize (e.g. target and prediction columns for regression problems).
 Returns
result – The result is a pandas.Dataframe with columns (where applicable):
 Return type
pandas.DataFrame

is_categorical
(column)¶ Determine whether the input column contains categorical (i.e. nonordinal) observations.

is_discrete
(column)¶ Determine whether the input column contains discrete (i.e as opposed to continuous) observations.


class
kxy.pandas_extension.pre_learning_accessor.
PreLearningAccessor
(pandas_obj)¶ Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics for postlearning in supervised learning problems.
This class defines the
kxy_pre_learning
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_pre_learning.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.
data_valuation
(target_column, problem_type=None)¶ Estimate the highest performance metrics achievable when predicting the
target_column
using all other columns.When
problem_type=None
, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or nottarget_column
is categorical. Parameters
target_column (str) – The name of the column containing true labels.
problem_type (None  'classification'  'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
 Returns
achievable_performance – The result is a pandas.Dataframe with columns (where applicable):
'Achievable Accuracy'
: The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.'Achievable R^2'
: The highest \(R^2\) that can be achieved by a model using provided inputs to predict the label.'Achievable RMSE'
: The lowest Root Mean Square Error that can be achieved by a model using provided inputs to predict the label.'Achievable LogLikelihood Per Sample'
: The highest true loglikelihood per sample that can be achieved by a model using provided inputs to predict the label.
 Return type
pandas.Dataframe
Theoretical Foundation
Section 1  Achievable Performance.

variable_selection
(target_column, problem_type=None)¶ Runs the modelfree variable selection analysis.
When
problem_type=None
, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or nottarget_column
is categorical. Parameters
target_column (str) – The name of the column containing true labels.
problem_type (None  'classification'  'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
 Returns
result – The result is a pandas.DataFrame with columns (where applicable):
'Selection Order'
: The order in which the associated variable was selected, starting at 1 for the most important variable.'Variable'
: The column name corresponding to the input variable.'Running Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.'Running Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.'Running Achievable RMSE'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.
 Return type
pandas.DataFrame
Theoretical Foundation
Section 2  Variable Selection Analysis.


class
kxy.pandas_extension.learning_accessor.
LearningAccessor
(pandas_obj)¶ Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics for automatically training predictive models.
This class defines the
kxy_learning
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_learning.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.

class
kxy.pandas_extension.post_learning_accessor.
PostLearningAccessor
(pandas_obj)¶ Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics for postlearning in supervised learning problems.
This class defines the
kxy_post_learning
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_post_learning.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.
data_driven_improvability
(target_column, new_variables, problem_type=None)¶ Estimate the potential performance boost that a set of new explanatory variables can bring about.
 Parameters
target_column (str) – The name of the column containing true labels.
new_variables (list) – The names of the columns to use as new explanatory variables.
problem_type (None  'classification'  'regression') – The type of supervised learning problem. When None, it is inferred from whether or not
target_column
is categorical.
 Returns
result – The result is a pandas.Dataframe with columns (where applicable):
'Accuracy Boost'
: The classification accuracy boost that the new explanatory variables can bring about.'RSquared Boost'
: The \(R^2\) boost that the new explanatory variables can bring about.'RMSE Reduction'
: The reduction in Root Mean Square Error that the new explanatory variables can bring about.'LogLikelihood Per Sample Boost'
: The boost in loglikelihood per sample that the new explanatory variables can bring about.
 Return type
pandas.Dataframe
Theoretical Foundation
Section 3  Model Improvability.
See also
kxy.post_learning.improvability.data_driven_improvability

model_driven_improvability
(target_column, prediction_column, problem_type=None)¶ Estimate the extent to which a trained supervised learner may be improved in a modeldriven fashion (i.e. without resorting to additional explanatory variables).
 Parameters
target_column (str) – The name of the column containing true labels.
prediction_column (str) – The name of the column containing model predictions.
problem_type (None  'classification'  'regression') – The type of supervised learning problem. When None, it is inferred from whether or not
target_column
is categorical.
 Returns
result – The result is a pandas.Dataframe with columns (where applicable):
'Lost Accuracy'
: The amount of classification accuracy that was irreversibly lost when training the supervised learner.'Lost RSquared'
: The amount of \(R^2\) that was irreversibly lost when training the supervised learner.'Lost RMSE'
: The amount of Root Mean Square Error that was irreversibly lost when training the supervised learner.'Lost LogLikelihood Per Sample'
: The amount of true loglikelihood per sample that was irreversibly lost when training the supervised learner.'Residual RSquared'
: For regression problems, this is the highest \(R^2\) that may be achieved when using explanatory variables to predict regression residuals.'Residual RMSE'
: For regression problems, this is the lowest Root Mean Square Error that may be achieved when using explanatory variables to predict regression residuals.'Residual LogLikelihood Per Sample'
: For regression problems, this is the highest loglikelihood per sample that may be achieved when using explanatory variables to predict regression residuals.
 Return type
pandas.Dataframe
Theoretical Foundation
Section 3  Model Improvability.
See also
kxy.post_learning.improvability.model_driven_improvability

model_explanation
(prediction_column, problem_type=None)¶ Analyzes the variables that a model relies on the most in a bruteforce fashion.
The first variable is the variable the model relies on the most. The second variable is the variable that complements the first variable the most in explaining model decisions etc.
Running performances should be understood as the performance achievable when trying to guess model predictions using variables with selection order smaller or equal to that of the row.
When
problem_type=None
, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or nottarget_column
is categorical. Parameters
prediction_column (str) – The name of the column containing model predictions.
problem_type (None  'classification'  'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
 Returns
result – The result is a pandas.Dataframe with columns (where applicable):
'Selection Order'
: The order in which the associated variable was selected, starting at 1 for the most important variable.'Variable'
: The column name corresponding to the input variable.'Running Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.'Running Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.'Running Achievable RMSE'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.
 Return type
pandas.DataFrame
Theoretical Foundation
Section 2  Variable Selection Analysis.


class
kxy.pandas_extension.finance_accessor.
FinanceAccessor
(pandas_obj)¶ Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics finance specific analytics.
This class defines the
kxy_finance
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_finance.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.