Pandas Extension¶
We define a custom kxy
pandas accessor below,
namely the class Accessor
, that extends the pandas DataFrame class with all our analyses, thereby allowing data scientists to tap into
the power of the kxy
toolkit within the comfort of their favorite data structure.
All methods defined in the Accessor
class are accessible from any DataFrame instance as df.kxy.<method_name>
, so long as the kxy
python
package is imported alongside pandas
.
-
class
kxy.pandas_extension.accessor.
Accessor
(pandas_obj)¶ Bases:
kxy.pandas_extension.pre_learning_accessor.PreLearningAccessor
,kxy.pandas_extension.finance_accessor.FinanceAccessor
,kxy.pandas_extension.learning_accessor.LearningAccessor
,kxy.pandas_extension.post_learning_accessor.PostLearningAccessor
Extension of the pandas.DataFrame class with the full capabilities of the
kxy
platform.
-
class
kxy.pandas_extension.base_accessor.
BaseAccessor
(pandas_obj)¶ Base class inheritated by our customs accessors.
-
corr
(columns=(), method='information-adjusted', min_periods=1, p=0, p_ic='hqic')¶ Calculates the auto-correlation matrix of all columns or the input subset.
- Parameters
columns (set, optional) – The set of columns to use. If not provided, all columns are used.
method (str, optional) – Which method to use to calculate the auto-correlation matrix. Supported values are ‘information-adjusted’ (the default) and all ‘method’ values of pandas.DataFrame.corr.
p (int, optional) – The number of auto-correlation lags to use as empirical evidence in the maximum-entropy problem. The default value is 0, which corresponds to assuming rows are i.i.d. Values other than 0 are only supported in the robust-pearson method. When p is None, it is inferred from the sample.
min_periods (int, optional) – Only used when method is not ‘information-adjusted’. See the documentation of pandas.DataFrame.corr.
p_ic (str) – The criterion used to learn the optimal value of
p
(by fitting a VAR(p) model) whenp=None
. Should be one of ‘hqic’ (Hannan-Quinn Information Criterion), ‘aic’ (Akaike Information Criterion), ‘bic’ (Bayes Information Criterion) and ‘t-stat’ (based on last lag). Same as the ‘ic’ parameter ofstatsmodels.tsa.api.VAR
.
- Returns
c – The auto-correlation matrix.
- Return type
pandas.DataFrame
-
is_categorical
(column)¶ Determine whether the input column contains categorical observations.
-
is_discrete
(column)¶ Determine whether the input column contains discrete observations.
-
-
class
kxy.pandas_extension.pre_learning_accessor.
PreLearningAccessor
(pandas_obj)¶ Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics pre-learning in supervised learning problems.
This class defines the
kxy_pre_learning
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_pre_learning.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.-
achievable_performance_analysis
(label_column, input_columns=(), space='dual', categorical_encoding='two-split', problem=None)¶ Runs the achievable performance analysis on a trained supervised learning model.
The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not
label_column
is categorical.- Parameters
label_column (str) – The name of the column containing true labels.
input_columns (set) – List of columns to use as inputs. When an empty set/list is provided, all columns but
label_column
are used as inputs.space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
problem (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
- Returns
res – res.data is a pandas.Dataframe with columns (where applicable):
'Achievable R^2'
: The highest \(R^2\) that can be achieved by a model using provided inputs to predict the label.'Achievable Log-Likelihood Per Sample'
: The highest true log-likelihood per sample that can be achieved by a model using provided inputs to predict the label.'Achievable Accuracy'
: The highest classification accuracy that can be achieved by a model using provided inputs to predict the label.
- Return type
pandas.Styler
Theoretical Foundation
Section 1 - Achievable Performance.
-
auto_predictability
(columns=(), space='primal', robust=True, p=None)¶ Estimates the measure of auto-predictability of the time series corresponding to the input columns.
\[ \begin{align}\begin{aligned}\mathbb{PR}\left(\{x_t \} \right) :&= h\left(x_* \right) - h\left( \{ x_t \} \right) \\\ &= h\left(u_{x_*}\right) - h\left( \{ u_{x_t} \} \right).\end{aligned}\end{align} \]- Parameters
columns (list) – The input columns.
space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.p (int, optional) – The number of auto-correlation lags to use as empirical evidence in the maximum-entropy problem. When p is None (the default), it is inferred from the sample.
p_ic (str) – The criterion used to learn the optimal value of
p
(by fitting a VAR(p) model) whenp=None
. Should be one of ‘hqic’ (Hannan-Quinn Information Criterion), ‘aic’ (Akaike Information Criterion), ‘bic’ (Bayes Information Criterion) and ‘t-stat’ (based on last lag). Same as the ‘ic’ parameter ofstatsmodels.tsa.api.VAR
.
- Returns
The measure of auto-predictability in nats/period.
- Return type
float
Note
The time series is assumed to have a fixed period, and the index is assumed to be sorted.
-
dataset_valuation
(label_column, existing_input_columns, new_input_columns, space='dual', categorical_encoding='two-split', problem=None)¶ Quantifies the additional performance that can be brought about by adding a set of new variables, as the difference between achievable performance with and without the new dataset.
- Parameters
label_column (str) – The name of the column containing the true labels of the supervised learning problem.
existing_input_columns (set) – List of columns corresponding to existing inputs/variables.
new_input_columns (set) – List of columns corresponding to the new dataset.
space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
problem (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
- Returns
res – res.data is a Dataframe with columns (where applicable):
'Increased Achievable R^2'
: The highest \(R^2\) increase that can result from adding the new dataset.'Increased Achievable Log-Likelihood Per Sample'
: The highest increase in the true log-likelihood per sample that can result from adding the new dataset.'Increased Achievable Accuracy'
: The highest increase in the classification accuracy that can result from adding the new dataset.
- Return type
pandas.Styler
Theoretical Foundation
Section 1 - Achievable Performance.
-
variable_selection_analysis
(label_column, input_columns=(), space='dual', categorical_encoding='two-split', problem=None)¶ Runs the model variable selection analysis on a trained supervised learning model.
The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not
label_column
is categorical.- Parameters
label_column (str) – The name of the column containing true labels.
input_columns (set) – List of columns to use as inputs. When an empty set/list is provided, all columns but
label_column
are used as inputs.space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
problem (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
- Returns
res –
res.data is a pandas.Dataframe with columns (where applicable):
'Variable'
: The column name corresponding to the input variable.'Selection Order'
: The order in which the associated variable was selected, starting at 1 for the most important variable.'Univariate Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model solely using this variable.'Maximum Marginal R^2 Increase'
: The highest amount by which the \(R^2\) can be increased as a result of adding this variable in the variable selection scheme.'Running Achievable R^2'
: The highest \(R^2\) that can be achieved by a classification model using all variables selected so far, including this one.'Univariate Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model solely using this variable.'Maximum Marginal Accuracy Increase'
: The highest amount by which the classification accuracy can be increased as a result of adding this variable in the variable selection scheme.'Running Achievable Accuracy'
: The highest classification accuracy that can be achieved by a classification model using all variables selected so far, including this one.'Conditional Mutual Information (nats)'
: The mutual information between this variable and the label, conditional on all variables previously selected.'Running Mutual Information (nats)'
: The mutual information between all variables selected so far, including this one, and the label.'Univariate Achievable True Log-Likelihood Per Sample'
: The highest true log-likelihood per sample that can be achieved by a classification model solely using this variable.'Maximum Marginal True Log-Likelihood Per Sample Increase'
: The highest amount by which the true log-likelihood per sample can be increased as a result of adding this variable in the variable selection scheme.'Running Achievable True Log-Likelihood Per Sample'
: The highest true log-likelihood per sample that can be achieved by a classification model using all variables selected so far, including this one.'Maximum Marginal True Log-Likelihood Increase Per Sample'
: The highest amount by which the true log-likelihood per sample can increase as a result of adding this variable.'Running Maximum Log-Likelihood Increase Per Sample'
: The highest amount by which the true log-likelihood per sample can increase (over the log-likelihood of the naive strategy consisting of predicting the mode of \(y\)) as a result of using all variables selected so far, including this one.
- Return type
pandas.Styler
Theoretical Foundation
Section 2 - Variable Selection Analysis.
-
-
class
kxy.pandas_extension.learning_accessor.
LearningAccessor
(pandas_obj)¶ Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics for automatically training predictive models.
This class defines the
kxy_learning
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_learning.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.
-
class
kxy.pandas_extension.post_learning_accessor.
PostLearningAccessor
(pandas_obj)¶ Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics for post-learning in supervised learning problems.
This class defines the
kxy_post_learning
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_post_learning.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.-
bias
(bias_source_column, model_prediction_column, linear_scale=True, categorical_encoding='two-split', problem=None)¶ Quantifies the bias in a supervised learning model as the mutual information between a possible cause and model predictions.
The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not
model_prediction_column
is categorical.- Parameters
bias_source_column (str) – The name of the column containing values of the bias factor (e.g. age, gender, etc.)
model_prediction_column (str) – The name of the column containing predicted labels associated to the values of
bias_source_column
.linear_scale (bool) – Whether the bias should be returned in the linear/correlation scale or in the mutual information scale (in nats).
categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
problem (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
- Returns
The mutual information \(m\) or \(1-e^{-2m}\) if
linear_scale=True
.- Return type
float
Theoretical Foundation
Section b) Quantifying Bias in Models.
-
model_explanation_analysis
(model_prediction_column, input_columns=(), space='dual', categorical_encoding='two-split', problem=None)¶ Runs the model explanation analysis on a trained supervised learning model.
The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not
model_prediction_column
is categorical.- Parameters
model_prediction_column (str) – The name of the column containing predicted labels.
input_columns (set) – List of columns to use as inputs. When an empty set/list is provided, all columns but
model_prediction_column
are used as inputs.space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
problem (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
- Returns
res –
res.data is a pandas.Dataframe with columns:
'Variable'
: The column name corresponding to the input variable.'Selection Order'
: The order in which the associated variable was selected, starting at 1 for the most important variable.'Univariate Explained R^2'
: The \(R^2\) between predicted labels and this variable.'Running Explained R^2'
: The \(R^2\) between predicted labels and all variables selected so far, including this one.'Marginal Explained R^2'
: The increase in \(R^2\) between predicted labels and all variables selected so far that is due to adding this variable in the selection scheme.
- Return type
pandas.Styler
Theoretical Foundation
Section a) Model Explanation.
-
model_improvability_analysis
(label_column, model_prediction_column, input_columns=(), space='dual', categorical_encoding='two-split', problem=None)¶ Runs the model improvability analysis on a trained supervised learning model.
The nature of the supervised learning problem (i.e. regression or classification) is inferred from whether or not
label_column
is categorical.- Parameters
label_column (str) – The name of the column containing true labels.
model_prediction_column (str) – The name of the column containing labels predicted by the model.
input_columns (set) – List of columns to use as inputs. When an empty set/list is provided, all columns but
model_prediction_column
andlabel_column
are used as inputs.space (str, 'primal' | 'dual') – The space in which the maximum entropy problem is solved. When
space='primal'
, the maximum entropy problem is solved in the original observation space, under Pearson covariance constraints, leading to the Gaussian copula. Whenspace='dual'
, the maximum entropy problem is solved in the copula-uniform dual space, under Spearman rank correlation constraints.categorical_encoding (str, 'one-hot' | 'two-split' (default)) – The encoding method to use to represent categorical variables. See kxy.api.core.utils.one_hot_encoding and kxy.api.core.utils.two_split_encoding.
problem (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
- Returns
res –
res.data is a pandas.Dataframe with columns (where applicable):
'Leftover R^2'
: The amount by which the trained model’s \(R^2\) can still be increased without resorting to additional inputs, simply through better modeling.'Leftover Log-Likelihood Per Sample'
: The amount by which the trained model’s true log-likelihood per sample can still be increased without resorting to additional inputs, simply through better modeling.'Leftover Accuracy'
: The amount by which the trained model’s classification accuracy can still be increased without resorting to additional inputs, simply through better modeling.
- Return type
pandas.Styler
Theoretical Foundation
Section 3 - Model Improvability.
-
-
class
kxy.pandas_extension.finance_accessor.
FinanceAccessor
(pandas_obj)¶ Bases:
kxy.pandas_extension.base_accessor.BaseAccessor
Extension of the pandas.DataFrame class with various analytics for finance and asset management problems.
This class defines the
kxy_finance
pandas accessor.All its methods defined are accessible from any DataFrame instance as
df.kxy_finance.<method_name>
, so long as thekxy
python package is imported alongsidepandas
.-
beta
(market_returns_column, asset_returns_columns=(), risk_free_column=None, method='information-adjusted', p=0, p_ic='hqic')¶ Calculates the beta of a portfolio/asset (whose returns are provided in column_y) with respect to the market (whose returns are provided in market_returns_column) using a variety of estimation methods including the standard OLS/Pearson methods and information theoretical alternatives aiming at accounting for nonlinearities and memory in asset returns.
- Parameters
asset_returns_columns (str or list of str) – The name(s) of the column(s) to use for portfolio/asset returns.
market_returns_column (str) – The name of the column to use for market returns.
method (str, optional) – One of ‘information-adjusted’, ‘robust-pearson’, ‘spearman’, or ‘pearson’. This is the method to use to estimate the correlation between portfolio/asset returns and market returns.
p (int, optional) – The number of auto-correlation lags to use as empirical evidence in the maximum-entropy problem. The default value is 0, which corresponds to assuming rows are i.i.d. Values other than 0 are only supported in the robust-pearson method. When p is None, it is inferred from the sample.
p_ic (str) – The criterion used to learn the optimal value of
p
(by fitting a VAR(p) model) whenp=None
. Should be one of ‘hqic’ (Hannan-Quinn Information Criterion), ‘aic’ (Akaike Information Criterion), ‘bic’ (Bayes Information Criterion) and ‘t-stat’ (based on last lag). Same as the ‘ic’ parameter ofstatsmodels.tsa.api.VAR
.
- Returns
c – The beta coefficient(s).
- Return type
pandas.DataFrame
-