Pandas Extension¶
We define a custom kxy
pandas accessor below,
namely the class KXYAccessor
, that extends the pandas DataFrame class with all our analyses, thereby allowing data scientists to tap into
the power of the kxy
toolkit within the comfort of their favorite data structure.
All methods defined in the KXYAccessor
class are accessible from any DataFrame instance as df.kxy.<method_name>
, so long as the kxy
python
package is imported alongside pandas
.

class
kxy.data.dataframe.
KXYAccessor
(pandas_obj)¶ Extension of the pandas.DataFrame class with various analytics for prelearning and postlearning, in supervised learning problems.

achievable_performance_analysis
(label_column, input_columns=(), space='dual', categorical_encoding='twosplit')¶ Runs the achievable performance analysis on a trained supervised learning model.
The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not
label_column
is categorical.Parameters:  label_column (str) – The name of the column containing true labels.
 input_columns (set) – List of columns to use as inputs. When an empty set/list is provided, all columns but
label_column
are used as inputs.  space (str, 'primal'  'dual') – The space in which the maximum entropy problem is solved.
When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copulauniform dual space, under Spearman rank correlation constraints.  categorical_encoding (str, 'onehot'  'twosplit' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
Returns: res – res.data is a pandas.Dataframe with columns (where applicable):
'Achievable R^2'
: The highest \(R^2\) that can be achieved by a model using provided inputs to predict the label.'Achievable LogLikelihood Per Sample'
: The highest true loglikelihood per sample that can be achieved by a model using provided inputs to predict the label.'Achievable Accuracy'
: The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.
Return type: pandas.Styler
Theoretical Foundation
Section 1  Achievable Performance.

auto_predictability
(columns=(), space='primal', robust=True, p=None)¶ Estimates the measure of autopredictability of the time series corresponding to the input columns.
\[ \begin{align}\begin{aligned}\mathbb{PR}\left(\{x_t \} \right) :&= h\left(x_* \right)  h\left( \{ x_t \} \right) \\\ &= h\left(u_{x_*}\right)  h\left( \{ u_{x_t} \} \right).\end{aligned}\end{align} \]Parameters:  columns (list) – The input columns.
 space (str, 'primal'  'dual') – The space in which the maximum entropy problem is solved.
When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copulauniform dual space, under Spearman rank correlation constraints.  p (int, optional) – The number of autocorrelation lags to use as empirical evidence in the maximumentropy problem. When p is None (the default), it is inferred from the sample.
 p_ic (str) – The criterion used to learn the optimal value of
p
(by fitting a VAR(p) model) whenp=None
. Should be one of ‘hqic’ (HannanQuinn Information Criterion), ‘aic’ (Akaike Information Criterion), ‘bic’ (Bayes Information Criterion) and ‘tstat’ (based on last lag). Same as the ‘ic’ parameter ofstatsmodels.tsa.api.VAR
.
Returns: The measure of autopredictability in nats/period.
Return type: float
Note
The time series is assumed to have a fixed period, and the index is assumed to be sorted.

beta
(market_returns_column, asset_returns_columns=(), risk_free_column=None, method='informationadjusted', p=0, p_ic='hqic')¶ Calculates the beta of a portfolio/asset (whose returns are provided in column_y) with respect to the market (whose returns are provided in market_returns_column) using a variety of estimation methods including the standard OLS/Pearson methods and information theoretical alternatives aiming at accounting for nonlinearities and memory in asset returns.
Parameters:  asset_returns_columns (str or list of str) – The name(s) of the column(s) to use for portfolio/asset returns.
 market_returns_column (str) – The name of the column to use for market returns.
 method (str, optional) – One of ‘informationadjusted’, ‘robustpearson’, ‘spearman’, or ‘pearson’. This is the method to use to estimate the correlation between portfolio/asset returns and market returns.
 p (int, optional) – The number of autocorrelation lags to use as empirical evidence in the maximumentropy problem. The default value is 0, which corresponds to assuming rows are i.i.d. Values other than 0 are only supported in the robustpearson method. When p is None, it is inferred from the sample.
 p_ic (str) – The criterion used to learn the optimal value of
p
(by fitting a VAR(p) model) whenp=None
. Should be one of ‘hqic’ (HannanQuinn Information Criterion), ‘aic’ (Akaike Information Criterion), ‘bic’ (Bayes Information Criterion) and ‘tstat’ (based on last lag). Same as the ‘ic’ parameter ofstatsmodels.tsa.api.VAR
.
Returns: c – The beta coefficient(s).
Return type: pandas.DataFrame

bias
(bias_source_column, model_prediction_column, linear_scale=True, categorical_encoding='twosplit')¶ Quantifies the bias in a supervised learning model as the mutual information between a possible cause and model predictions.
The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not
model_prediction_column
is categorical.Parameters:  bias_source_column (str) – The name of the column containing values of the bias factor (e.g. age, gender, etc.)
 model_prediction_column (str) – The name of the column containing predicted labels associated to the values of
bias_source_column
.  linear_scale (bool) – Whether the bias should be returned in the linear/correlation scale or in the mutual information scale (in nats).
 categorical_encoding (str, 'onehot'  'twosplit' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
Returns: The mutual information \(m\) or \(1e^{2m}\) if
linear_scale=True
.Return type: float
Theoretical Foundation
Section b) Quantifying Bias in Models.

corr
(columns=(), method='informationadjusted', min_periods=1, p=0, p_ic='hqic')¶ Calculates the autocorrelation matrix of all columns or the input subset.
Parameters:  columns (set, optional) – The set of columns to use. If not provided, all columns are used.
 method (str, optional) – Which method to use to calculate the autocorrelation matrix. Supported values are ‘informationadjusted’ (the default) and all ‘method’ values of pandas.DataFrame.corr.
 p (int, optional) – The number of autocorrelation lags to use as empirical evidence in the maximumentropy problem. The default value is 0, which corresponds to assuming rows are i.i.d. Values other than 0 are only supported in the robustpearson method. When p is None, it is inferred from the sample.
 min_periods (int, optional) – Only used when method is not ‘informationadjusted’. See the documentation of pandas.DataFrame.corr.
 p_ic (str) – The criterion used to learn the optimal value of
p
(by fitting a VAR(p) model) whenp=None
. Should be one of ‘hqic’ (HannanQuinn Information Criterion), ‘aic’ (Akaike Information Criterion), ‘bic’ (Bayes Information Criterion) and ‘tstat’ (based on last lag). Same as the ‘ic’ parameter ofstatsmodels.tsa.api.VAR
.
Returns: c – The autocorrelation matrix.
Return type: pandas.DataFrame

dataset_valuation
(label_column, existing_input_columns, new_input_columns, space='dual', categorical_encoding='twosplit')¶ Quantifies the additional performance that can be brought about by adding a set of new variables, as the difference between achievable performance with and without the new dataset.
Parameters:  label_column (str) – The name of the column containing the true labels of the supervised learning problem.
 existing_input_columns (set) – List of columns corresponding to existing inputs/variables.
 new_input_columns (set) – List of columns corresponding to the new dataset.
 space (str, 'primal'  'dual') – The space in which the maximum entropy problem is solved.
When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copulauniform dual space, under Spearman rank correlation constraints.  categorical_encoding (str, 'onehot'  'twosplit' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
Returns: res – res.data is a Dataframe with columns (where applicable):
'Increased Achievable R^2'
: The highest \(R^2\) increase that can result from adding the new dataset.'Increased Achievable LogLikelihood Per Sample'
: The highest increase in the true loglikelihood per sample that can result from adding the new dataset.'Increased Achievable Accuracy'
: The highest increase in the classification accuracy that can result from adding the new dataset.
Return type: pandas.Styler
Theoretical Foundation
Section 1  Achievable Performance.

is_categorical
(column)¶ Determine whether the input column contains categorical observations.

is_discrete
(column)¶ Determine whether the input column contains discrete observations.

model_explanation_analysis
(model_prediction_column, input_columns=(), space='dual', categorical_encoding='twosplit')¶ Runs the model explanation analysis on a trained supervised learning model.
The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not
model_prediction_column
is categorical.Parameters:  model_prediction_column (str) – The name of the column containing predicted labels.
 input_columns (set) – List of columns to use as inputs. When an empty set/list is provided, all columns but
model_prediction_column
are used as inputs.  space (str, 'primal'  'dual') – The space in which the maximum entropy problem is solved.
When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copulauniform dual space, under Spearman rank correlation constraints.  categorical_encoding (str, 'onehot'  'twosplit' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
Returns: res –
res.data is a pandas.Dataframe with columns:
'Variable'
: The column name corresponding to the input variable.'Selection Order'
: The order in which the associated variable was selected, starting at 1 for the most important variable.'Univariate Explained R^2'
: The \(R^2\) between predicted labels and this variable.'Running Explained R^2'
: The \(R^2\) between predicted labels and all variables selected so far, including this one.'Marginal Explained R^2'
: The increase in \(R^2\) between predicted labels and all variables selected so far that is due to adding this variable in the selection scheme.
Return type: pandas.Styler
Theoretical Foundation
Section a) Model Explanation.

model_improvability_analysis
(label_column, model_prediction_column, input_columns=(), space='dual', categorical_encoding='twosplit')¶ Runs the model improvability analysis on a trained supervised learning model.
The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not
label_column
is categorical.Parameters:  label_column (str) – The name of the column containing true labels.
 model_prediction_column (str) – The name of the column containing labels predicted by the model.
 input_columns (set) – List of columns to use as inputs. When an empty set/list is provided, all columns but
model_prediction_column
andlabel_column
are used as inputs.  space (str, 'primal'  'dual') – The space in which the maximum entropy problem is solved.
When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copulauniform dual space, under Spearman rank correlation constraints.  categorical_encoding (str, 'onehot'  'twosplit' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
Returns: res –
res.data is a pandas.Dataframe with columns (where applicable):
'Leftover R^2'
: The amount by which the trained model’s \(R^2\) can still be increased without resorting to additional inputs, simply through better modeling.'Leftover LogLikelihood Per Sample'
: The amount by which the trained model’s true loglikelihood per sample can still be increased without resorting to additional inputs, simply through better modeling.'Leftover Accuracy'
: The amount by which the trained model’s classification accuracy can still be increased without resorting to additional inputs, simply through better modeling.
Return type: pandas.Styler
Theoretical Foundation
Section 3  Model Improvability.

variable_selection_analysis
(label_column, input_columns=(), space='dual', categorical_encoding='twosplit')¶ Runs the model variable selection analysis on a trained supervised learning model.
The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not
label_column
is categorical.Parameters:  label_column (str) – The name of the column containing true labels.
 input_columns (set) – List of columns to use as inputs. When an empty set/list is provided, all columns but
label_column
are used as inputs.  space (str, 'primal'  'dual') – The space in which the maximum entropy problem is solved.
When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copulauniform dual space, under Spearman rank correlation constraints.  categorical_encoding (str, 'onehot'  'twosplit' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
Returns: res –
res.data is a pandas.Dataframe with columns (where applicable):
'Variable'
: The column name corresponding to the input variable.'Selection Order'
: The order in which the associated variable was selected, starting at 1 for the most important variable.'Univariate Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model solely using this variable.'Maximum Marginal R^2 Increase'
: The highest amount by which the \(R^2\) can be increased as a result of adding this variable in the variable selection scheme.'Running Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.'Univariate Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model solely using this variable.'Maximum Marginal Accuracy Increase'
: The highest amount by which the classification accuracy can be increased as a result of adding this variable in the variable selection scheme.'Running Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.'Conditional Mutual Information (nats)'
: The mutual information between this variable and the label, conditional on all variables previously selected.'Running Mutual Information (nats)'
: The mutual information between all variables selected so far, including this one, and the label.'Univariate Achievable True LogLikelihood Per Sample'
: The highest true loglikelihood per sample that can be achieved by a classification model solely using this variable.'Maximum Marginal True LogLikelihood Per Sample Increase'
: The highest amount by which the true loglikelihood per sample can be increased as a result of adding this variable in the variable selection scheme.'Running Achievable True LogLikelihood Per Sample'
: The highest true loglikelihood per sample that can be achieved by a classification model using all variables selected so far, including this one.'Maximum Marginal True LogLikelihood Increase Per Sample'
: The highest amount by which the true loglikelihood per sample can increase as a result of adding this variable.'Running Maximum LogLikelihood Increase Per Sample'
: The highest amount by which the true loglikelihood per sample can increase (over the loglikelihood of the naive strategy consisting of predicting the mode of \(y\)) as a result of using all variables selected so far, including this one.
Return type: pandas.Styler
Theoretical Foundation
Section 2  Variable Selection Analysis.
