Pandas Extension

We define a custom kxy pandas accessor below, namely the class Accessor, that extends the pandas DataFrame class with all our analyses, thereby allowing data scientists to tap into the power of the kxy toolkit within the comfort of their favorite data structure.

All methods defined in the Accessor class are accessible from any DataFrame instance as df.kxy.<method_name>, so long as the kxy python package is imported alongside pandas.

class kxy.pandas_extension.accessor.Accessor(pandas_obj)

Bases: kxy.pandas_extension.pre_learning_accessor.PreLearningAccessor, kxy.pandas_extension.finance_accessor.FinanceAccessor, kxy.pandas_extension.learning_accessor.LearningAccessor, kxy.pandas_extension.post_learning_accessor.PostLearningAccessor

Extension of the pandas.DataFrame class with the full capabilities of the kxy platform.

class kxy.pandas_extension.base_accessor.BaseAccessor(pandas_obj)

Base class inheritated by our customs accessors.

corr(columns=(), method='information-adjusted', min_periods=1, p=0, p_ic='hqic')

Calculates the auto-correlation matrix of all columns or the input subset.

Parameters:
  • columns (set, optional) – The set of columns to use. If not provided, all columns are used.
  • method (str, optional) – Which method to use to calculate the auto-correlation matrix. Supported values are ‘information-adjusted’ (the default) and all ‘method’ values of pandas.DataFrame.corr.
  • p (int, optional) – The number of auto-correlation lags to use as empirical evidence in the maximum-entropy problem. The default value is 0, which corresponds to assuming rows are i.i.d. Values other than 0 are only supported in the robust-pearson method. When p is None, it is inferred from the sample.
  • min_periods (int, optional) – Only used when method is not ‘information-adjusted’. See the documentation of pandas.DataFrame.corr.
  • p_ic (str) – The criterion used to learn the optimal value of p (by fitting a VAR(p) model) when p=None. Should be one of ‘hqic’ (Hannan-Quinn Information Criterion), ‘aic’ (Akaike Information Criterion), ‘bic’ (Bayes Information Criterion) and ‘t-stat’ (based on last lag). Same as the ‘ic’ parameter of statsmodels.tsa.api.VAR.
Returns:

c – The auto-correlation matrix.

Return type:

pandas.DataFrame

is_categorical(column)

Determine whether the input column contains categorical observations.

is_discrete(column)

Determine whether the input column contains discrete observations.

class kxy.pandas_extension.pre_learning_accessor.PreLearningAccessor(pandas_obj)

Bases: kxy.pandas_extension.base_accessor.BaseAccessor

Extension of the pandas.DataFrame class with various analytics pre-learning in supervised learning problems.

This class defines the kxy_pre_learning pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_pre_learning.<method_name>, so long as the kxy python package is imported alongside pandas.

achievable_performance_analysis(label_column, input_columns=(), space='dual', categorical_encoding='two-split', problem=None)

Runs the achievable performance analysis on a trained supervised learning model.

The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not label_column is categorical.

Parameters:
  • label_column (str) – The name of the column containing true labels.
  • input_columns (set) – List of columns to use as inputs. When an empty set/list is provided, all columns but label_column are used as inputs.
  • space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When space='primal', the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. When space='dual', the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.
  • categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
  • problem (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
Returns:

res – res.data is a pandas.Dataframe with columns (where applicable):

  • 'Achievable R^2': The highest \(R^2\) that can be achieved by a model using provided inputs to predict the label.
  • 'Achievable Log-Likelihood Per Sample': The highest true log-likelihood per sample that can be achieved by a model using provided inputs to predict the label.
  • 'Achievable Accuracy': The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.

Return type:

pandas.Styler

Theoretical Foundation

Section 1 - Achievable Performance.

auto_predictability(columns=(), space='primal', robust=True, p=None)

Estimates the measure of auto-predictability of the time series corresponding to the input columns.

\[ \begin{align}\begin{aligned}\mathbb{PR}\left(\{x_t \} \right) :&= h\left(x_* \right) - h\left( \{ x_t \} \right) \\\ &= h\left(u_{x_*}\right) - h\left( \{ u_{x_t} \} \right).\end{aligned}\end{align} \]
Parameters:
  • columns (list) – The input columns.
  • space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When space='primal', the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. When space='dual', the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.
  • p (int, optional) – The number of auto-correlation lags to use as empirical evidence in the maximum-entropy problem. When p is None (the default), it is inferred from the sample.
  • p_ic (str) – The criterion used to learn the optimal value of p (by fitting a VAR(p) model) when p=None. Should be one of ‘hqic’ (Hannan-Quinn Information Criterion), ‘aic’ (Akaike Information Criterion), ‘bic’ (Bayes Information Criterion) and ‘t-stat’ (based on last lag). Same as the ‘ic’ parameter of statsmodels.tsa.api.VAR.
Returns:

The measure of auto-predictability in nats/period.

Return type:

float

Note

The time series is assumed to have a fixed period, and the index is assumed to be sorted.

dataset_valuation(label_column, existing_input_columns, new_input_columns, space='dual', categorical_encoding='two-split', problem=None)

Quantifies the additional performance that can be brought about by adding a set of new variables, as the difference between achievable performance with and without the new dataset.

Parameters:
  • label_column (str) – The name of the column containing the true labels of the supervised learning problem.
  • existing_input_columns (set) – List of columns corresponding to existing inputs/variables.
  • new_input_columns (set) – List of columns corresponding to the new dataset.
  • space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When space='primal', the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. When space='dual', the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.
  • categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
  • problem (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
Returns:

res – res.data is a Dataframe with columns (where applicable):

  • 'Increased Achievable R^2': The highest \(R^2\) increase that can result from adding the new dataset.
  • 'Increased Achievable Log-Likelihood Per Sample': The highest increase in the true log-likelihood per sample that can result from adding the new dataset.
  • 'Increased Achievable Accuracy': The highest increase in the classification accuracy that can result from adding the new dataset.

Return type:

pandas.Styler

Theoretical Foundation

Section 1 - Achievable Performance.

variable_selection_analysis(label_column, input_columns=(), space='dual', categorical_encoding='two-split', problem=None)

Runs the model variable selection analysis on a trained supervised learning model.

The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not label_column is categorical.

Parameters:
  • label_column (str) – The name of the column containing true labels.
  • input_columns (set) – List of columns to use as inputs. When an empty set/list is provided, all columns but label_column are used as inputs.
  • space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When space='primal', the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. When space='dual', the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.
  • categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
  • problem (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
Returns:

res

res.data is a pandas.Dataframe with columns (where applicable):

  • 'Variable': The column name corresponding to the input variable.
  • 'Selection Order': The order in which the associated variable was selected, starting at 1 for the most important variable.
  • 'Univariate Achievable R^2': The highest \(R^2\) that can be achieved by a classification model solely using this variable.
  • 'Maximum Marginal R^2 Increase': The highest amount by which the \(R^2\) can be increased as a result of adding this variable in the variable selection scheme.
  • 'Running Achievable R^2': The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.
  • 'Univariate Achievable Accuracy': The highest classification accuracy that can be achieved by a classification model solely using this variable.
  • 'Maximum Marginal Accuracy Increase': The highest amount by which the classification accuracy can be increased as a result of adding this variable in the variable selection scheme.
  • 'Running Achievable Accuracy': The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.
  • 'Conditional Mutual Information (nats)': The mutual information between this variable and the label, conditional on all variables previously selected.
  • 'Running Mutual Information (nats)': The mutual information between all variables selected so far, including this one, and the label.
  • 'Univariate Achievable True Log-Likelihood Per Sample': The highest true log-likelihood per sample that can be achieved by a classification model solely using this variable.
  • 'Maximum Marginal True Log-Likelihood Per Sample Increase': The highest amount by which the true log-likelihood per sample can be increased as a result of adding this variable in the variable selection scheme.
  • 'Running Achievable True Log-Likelihood Per Sample': The highest true log-likelihood per sample that can be achieved by a classification model using all variables selected so far, including this one.
  • 'Maximum Marginal True Log-Likelihood Increase Per Sample': The highest amount by which the true log-likelihood per sample can increase as a result of adding this variable.
  • 'Running Maximum Log-Likelihood Increase Per Sample': The highest amount by which the true log-likelihood per sample can increase (over the log-likelihood of the naive strategy consisting of predicting the mode of \(y\)) as a result of using all variables selected so far, including this one.

Return type:

pandas.Styler

Theoretical Foundation

Section 2 - Variable Selection Analysis.

class kxy.pandas_extension.learning_accessor.LearningAccessor(pandas_obj)

Bases: kxy.pandas_extension.base_accessor.BaseAccessor

Extension of the pandas.DataFrame class with various analytics for automatically training predictive models.

This class defines the kxy_learning pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_learning.<method_name>, so long as the kxy python package is imported alongside pandas.

class kxy.pandas_extension.post_learning_accessor.PostLearningAccessor(pandas_obj)

Bases: kxy.pandas_extension.base_accessor.BaseAccessor

Extension of the pandas.DataFrame class with various analytics for post-learning in supervised learning problems.

This class defines the kxy_post_learning pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_post_learning.<method_name>, so long as the kxy python package is imported alongside pandas.

bias(bias_source_column, model_prediction_column, linear_scale=True, categorical_encoding='two-split', problem=None)

Quantifies the bias in a supervised learning model as the mutual information between a possible cause and model predictions.

The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not model_prediction_column is categorical.

Parameters:
  • bias_source_column (str) – The name of the column containing values of the bias factor (e.g. age, gender, etc.)
  • model_prediction_column (str) – The name of the column containing predicted labels associated to the values of bias_source_column.
  • linear_scale (bool) – Whether the bias should be returned in the linear/correlation scale or in the mutual information scale (in nats).
  • categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
  • problem (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
Returns:

The mutual information \(m\) or \(1-e^{-2m}\) if linear_scale=True.

Return type:

float

Theoretical Foundation

Section b) Quantifying Bias in Models.

model_explanation_analysis(model_prediction_column, input_columns=(), space='dual', categorical_encoding='two-split', problem=None)

Runs the model explanation analysis on a trained supervised learning model.

The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not model_prediction_column is categorical.

Parameters:
  • model_prediction_column (str) – The name of the column containing predicted labels.
  • input_columns (set) – List of columns to use as inputs. When an empty set/list is provided, all columns but model_prediction_column are used as inputs.
  • space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When space='primal', the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. When space='dual', the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.
  • categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
  • problem (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
Returns:

res

res.data is a pandas.Dataframe with columns:

  • 'Variable': The column name corresponding to the input variable.
  • 'Selection Order': The order in which the associated variable was selected, starting at 1 for the most important variable.
  • 'Univariate Explained R^2': The \(R^2\) between predicted labels and this variable.
  • 'Running Explained R^2': The \(R^2\) between predicted labels and all variables selected so far, including this one.
  • 'Marginal Explained R^2': The increase in \(R^2\) between predicted labels and all variables selected so far that is due to adding this variable in the selection scheme.

Return type:

pandas.Styler

Theoretical Foundation

Section a) Model Explanation.

model_improvability_analysis(label_column, model_prediction_column, input_columns=(), space='dual', categorical_encoding='two-split', problem=None)

Runs the model improvability analysis on a trained supervised learning model.

The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not label_column is categorical.

Parameters:
  • label_column (str) – The name of the column containing true labels.
  • model_prediction_column (str) – The name of the column containing labels predicted by the model.
  • input_columns (set) – List of columns to use as inputs. When an empty set/list is provided, all columns but model_prediction_column and label_column are used as inputs.
  • space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When space='primal', the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. When space='dual', the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.
  • categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
  • problem (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
Returns:

res

res.data is a pandas.Dataframe with columns (where applicable):

  • 'Leftover R^2': The amount by which the trained model’s \(R^2\) can still be increased without resorting to additional inputs, simply through better modeling.
  • 'Leftover Log-Likelihood Per Sample': The amount by which the trained model’s true log-likelihood per sample can still be increased without resorting to additional inputs, simply through better modeling.
  • 'Leftover Accuracy': The amount by which the trained model’s classification accuracy can still be increased without resorting to additional inputs, simply through better modeling.

Return type:

pandas.Styler

Theoretical Foundation

Section 3 - Model Improvability.

class kxy.pandas_extension.finance_accessor.FinanceAccessor(pandas_obj)

Bases: kxy.pandas_extension.base_accessor.BaseAccessor

Extension of the pandas.DataFrame class with various analytics for finance and asset management problems.

This class defines the kxy_finance pandas accessor.

All its methods defined are accessible from any DataFrame instance as df.kxy_finance.<method_name>, so long as the kxy python package is imported alongside pandas.

beta(market_returns_column, asset_returns_columns=(), risk_free_column=None, method='information-adjusted', p=0, p_ic='hqic')

Calculates the beta of a portfolio/asset (whose returns are provided in column_y) with respect to the market (whose returns are provided in market_returns_column) using a variety of estimation methods including the standard OLS/Pearson methods and information theoretical alternatives aiming at accounting for nonlinearities and memory in asset returns.

Parameters:
  • asset_returns_columns (str or list of str) – The name(s) of the column(s) to use for portfolio/asset returns.
  • market_returns_column (str) – The name of the column to use for market returns.
  • method (str, optional) – One of ‘information-adjusted’, ‘robust-pearson’, ‘spearman’, or ‘pearson’. This is the method to use to estimate the correlation between portfolio/asset returns and market returns.
  • p (int, optional) – The number of auto-correlation lags to use as empirical evidence in the maximum-entropy problem. The default value is 0, which corresponds to assuming rows are i.i.d. Values other than 0 are only supported in the robust-pearson method. When p is None, it is inferred from the sample.
  • p_ic (str) – The criterion used to learn the optimal value of p (by fitting a VAR(p) model) when p=None. Should be one of ‘hqic’ (Hannan-Quinn Information Criterion), ‘aic’ (Akaike Information Criterion), ‘bic’ (Bayes Information Criterion) and ‘t-stat’ (based on last lag). Same as the ‘ic’ parameter of statsmodels.tsa.api.VAR.
Returns:

c – The beta coefficient(s).

Return type:

pandas.DataFrame