DataFrame Extension Deep Dive¶
We define a custom kxy
pandas accessor below,
namely the class Accessor
, that extends the pandas DataFrame class with all our analyses, thereby allowing data scientists to tap into
the power of the kxy
toolkit within the comfort of their favorite data structure.
All methods defined in the Accessor
class are accessible from any DataFrame instance as df.kxy.<method_name>
, so long as the kxy
python
package is imported alongside pandas
.
- class kxy.pandas_extension.accessor.Accessor(pandas_obj)¶
Bases:
kxy.pandas_extension.pre_learning_accessor.PreLearningAccessor
,kxy.pandas_extension.learning_accessor.LearningAccessor
,kxy.pandas_extension.post_learning_accessor.PostLearningAccessor
,kxy.pandas_extension.finance_accessor.FinanceAccessor
Extension of the pandas.DataFrame class with the full capabilities of the
kxy
platform.
- class kxy.pandas_extension.base_accessor.BaseAccessor(pandas_obj)¶
Base class inheritated by our customs accessors.
- anonymize(columns_to_exclude=[])¶
Anonymize the dataframe in a manner that leaves all pre-learning and post-learning analyses (including data valuation, variable selection, model-driven improvability, data-driven improvability and model explanation) invariant.
Any transformation on continuous variables that preserves ranks will not change our pre-learning and post-learning analyses. The same holds for any 1-to-1 transformation on categorical variables.
This implementation replaces ordinal values (i.e. any column that can be cast as a float) with their within-column Gaussian score. For each non-ordinal column, we form the set of all possible values, we assign a unique integer index to each value in the set, and we systematically replace said value appearing in the dataframe by the hexadecimal code of its associated integer index.
For regression problems, accurate estimation of RMSE related metrics require the target column (and the prediction column for post-learning analyses) not to be anonymized.
- Parameters
columns_to_exclude (list (optional)) – List of columns not to anonymize (e.g. target and prediction columns for regression problems).
- Returns
result – The result is a pandas.Dataframe with columns (where applicable):
- Return type
pandas.DataFrame
- is_categorical(column)¶
Determine whether the input column contains categorical (i.e. non-ordinal) observations.
- is_discrete(column)¶
Determine whether the input column contains discrete (i.e as opposed to continuous) observations.
- class kxy.pandas_extension.pre_learning_accessor.PreLearningAccessor(pandas_obj)¶
Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics for post-learning in supervised learning problems.
This class defines the
kxy_pre_learning
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_pre_learning.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.- data_valuation(target_column, problem_type=None, anonymize=False)¶
Estimate the highest performance metrics achievable when predicting the
target_column
using all other columns.When
problem_type=None
, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or nottarget_column
is categorical.- Parameters
target_column (str) – The name of the column containing true labels.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).
- Returns
achievable_performance – The result is a pandas.Dataframe with columns (where applicable):
'Achievable Accuracy'
: The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.'Achievable R^2'
: The highest \(R^2\) that can be achieved by a model using provided inputs to predict the label.'Achievable RMSE'
: The lowest Root Mean Square Error that can be achieved by a model using provided inputs to predict the label.'Achievable Log-Likelihood Per Sample'
: The highest true log-likelihood per sample that can be achieved by a model using provided inputs to predict the label.
- Return type
pandas.Dataframe
Theoretical Foundation
Section 1 - Achievable Performance.
- variable_selection(target_column, problem_type=None, anonymize=False)¶
Runs the model-free variable selection analysis.
When
problem_type=None
, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or nottarget_column
is categorical.- Parameters
target_column (str) – The name of the column containing true labels.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).
- Returns
result – The result is a pandas.DataFrame with columns (where applicable):
'Selection Order'
: The order in which the associated variable was selected, starting at 1 for the most important variable.'Variable'
: The column name corresponding to the input variable.'Running Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.'Running Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.'Running Achievable RMSE'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.
- Return type
pandas.DataFrame
Theoretical Foundation
Section 2 - Variable Selection Analysis.
- class kxy.pandas_extension.learning_accessor.LearningAccessor(pandas_obj)¶
Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics for automatically training predictive models.
This class defines the
kxy_learning
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_learning.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.
- class kxy.pandas_extension.post_learning_accessor.PostLearningAccessor(pandas_obj)¶
Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics for post-learning in supervised learning problems.
This class defines the
kxy_post_learning
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_post_learning.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.- data_driven_improvability(target_column, new_variables, problem_type=None, anonymize=False)¶
Estimate the potential performance boost that a set of new explanatory variables can bring about.
- Parameters
target_column (str) – The name of the column containing true labels.
new_variables (list) – The names of the columns to use as new explanatory variables.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not
target_column
is categorical.anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).
- Returns
result – The result is a pandas.Dataframe with columns (where applicable):
'Accuracy Boost'
: The classification accuracy boost that the new explanatory variables can bring about.'R-Squared Boost'
: The \(R^2\) boost that the new explanatory variables can bring about.'RMSE Reduction'
: The reduction in Root Mean Square Error that the new explanatory variables can bring about.'Log-Likelihood Per Sample Boost'
: The boost in log-likelihood per sample that the new explanatory variables can bring about.
- Return type
pandas.Dataframe
Theoretical Foundation
Section 3 - Model Improvability.
See also
kxy.post_learning.improvability.data_driven_improvability
- model_driven_improvability(target_column, prediction_column, problem_type=None, anonymize=False)¶
Estimate the extent to which a trained supervised learner may be improved in a model-driven fashion (i.e. without resorting to additional explanatory variables).
- Parameters
target_column (str) – The name of the column containing true labels.
prediction_column (str) – The name of the column containing model predictions.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from whether or not
target_column
is categorical.anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).
- Returns
result – The result is a pandas.Dataframe with columns (where applicable):
'Lost Accuracy'
: The amount of classification accuracy that was irreversibly lost when training the supervised learner.'Lost R-Squared'
: The amount of \(R^2\) that was irreversibly lost when training the supervised learner.'Lost RMSE'
: The amount of Root Mean Square Error that was irreversibly lost when training the supervised learner.'Lost Log-Likelihood Per Sample'
: The amount of true log-likelihood per sample that was irreversibly lost when training the supervised learner.'Residual R-Squared'
: For regression problems, this is the highest \(R^2\) that may be achieved when using explanatory variables to predict regression residuals.'Residual RMSE'
: For regression problems, this is the lowest Root Mean Square Error that may be achieved when using explanatory variables to predict regression residuals.'Residual Log-Likelihood Per Sample'
: For regression problems, this is the highest log-likelihood per sample that may be achieved when using explanatory variables to predict regression residuals.
- Return type
pandas.Dataframe
Theoretical Foundation
Section 3 - Model Improvability.
See also
kxy.post_learning.improvability.model_driven_improvability
- model_explanation(prediction_column, problem_type=None, anonymize=False)¶
Analyzes the variables that a model relies on the most in a brute-force fashion.
The first variable is the variable the model relies on the most. The second variable is the variable that complements the first variable the most in explaining model decisions etc.
Running performances should be understood as the performance achievable when trying to guess model predictions using variables with selection order smaller or equal to that of the row.
When
problem_type=None
, the nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or nottarget_column
is categorical.- Parameters
prediction_column (str) – The name of the column containing model predictions.
problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).
- Returns
result – The result is a pandas.Dataframe with columns (where applicable):
'Selection Order'
: The order in which the associated variable was selected, starting at 1 for the most important variable.'Variable'
: The column name corresponding to the input variable.'Running Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.'Running Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.'Running Achievable RMSE'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.
- Return type
pandas.DataFrame
Theoretical Foundation
Section 2 - Variable Selection Analysis.
- class kxy.pandas_extension.finance_accessor.FinanceAccessor(pandas_obj)¶
Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various finance-specific analytics.
This class defines the
kxy_finance
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_finance.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.- information_adjusted_beta(market_column, asset_column, anonymize=False)¶
Estimate the information-adjusted beta of an asset return \(r\) relative to the market return \(r_m\): \(\text{IA-}\beta := \text{IA-Corr}\left(r, r_m \right) \sqrt{\frac{\text{Var}(r)}{\text{Var}(r_m)}}\), where \(\text{IA-Corr}\left(r, r_m \right) := \text{sgn}\left(\text{Corr}\left(r, r_m \right) \right) \left[1 - e^{-2I(r, r_m)} \right]\) denotes the information-adjusted correlation coefficient, with \(\text{sgn}\left(\text{Corr}\left(r, r_m \right) \right)\) the sign of the Pearson correlation coefficient.
Unlike the traditional beta coefficient, namely \(\beta := \text{Corr}\left(r, r_m \right) \sqrt{\frac{\text{Var}(r)}{\text{Var}(r_m)}}\), that only captures linear relations between market and asset returns, and that is 0 if and only if the two are decorrelated, \(\text{IA-}\beta\) captures any relationship between asset return and market return, linear or nonlinear, and is 0 if and only if the two variables are statistically independent.
- Parameters
market_column (str) – The name of the column containing market returns.
asset_column (str) – The name of the column containing asset returns.
anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).
- Returns
result – The information-adjusted beta coefficient.
- Return type
float
- information_adjusted_correlation(market_column, asset_column, anonymize=False)¶
Estimate the information-adjusted correlation between an asset return \(r\) and the market return \(r_m\): \(\text{IA-Corr}\left(r, r_m \right) := \text{sgn}\left(\text{Corr}\left(r, r_m \right) \right) \left[1 - e^{-2I(r, r_m)} \right]\), where \(\text{sgn}\left(\text{Corr}\left(r, r_m \right) \right)\) is the sign of the Pearson correlation coefficient.
Unlike Pearson’s correlation coefficient, which is 0 if and only if asset return and market return are decorrelated (i.e. they exhibit no linear relation), information-adjusted correlation is 0 if and only if market and asset returns are statistically independent (i.e. the exhibit no relation, linear or nonlinear).
- Parameters
market_column (str) – The name of the column containing market returns.
asset_column (str) – The name of the column containing asset returns.
anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).
- Returns
result – The information-adjusted correlation.
- Return type
float