Model Compression

How to seamlessly add feature selection to any predictive model in Python, so as to achieve the same performance with far fewer features.

LeanML Feature Selection

class kxy.learning.leanml_predictor.BaselineClassifier

Classifier following the scikit-learn API and always predicting the most frequent class in-sample.

class kxy.learning.leanml_predictor.BaselineRegressor(baseline='mean')

Regressor following the scikit-learn API and always predicting the in-sample mean.

class kxy.learning.leanml_predictor.LeanMLPredictor

Wrapper to seamlessly add LeanML variable selection to any supervised learner.

fit(obj, target_column, learner_func, problem_type=None, snr='auto', train_frac=0.8, random_state=0, force_redo=False, max_n_features=None, min_n_features=None, start_n_features=None, anonymize=False, benchmark_feature=None, missing_value_imputation=False, score='auto', n_down_perf_before_stop=3, regression_baseline='mean', additive_learning=False, regression_error_type='additive', return_scores=False, start_n_features_perf_frac=0.9, val_performance_buffer=0.0, path=None, file_name=None)

Train a lean boosted supervised learner, bringing in variables one at a time, in decreasing order of importance (as per obj.kxy.variable_selection), until doing so no longer improves validation performance or another stopping criterion is met.

Specifically, training proceeds as follows. First, KXY’s model-free variable selection is run (i.e. obj.kxy.variable_selection).

Then we train a model (instance returned by learner_func) using the start_n_features most important feature/variable to predict the target (defined by target_column).

Next we consider adding one variable at a time to fix the mistakes made by the previously trained model when additive_learning is True.

If doing so improves performance on the validation set, we keep going until either performance no longer improves on the validation set n_down_perf_before_stop consecutive times, or we’ve selected max_n_features features.

When start_n_features is None the initial set of variables is the smallest set of variables with which we may achieve start_n_features_perf_frac of the performance we could achieve using all variables (as per obj.kxy.variable_selection).

When additive_learning is set to False, after adding a new variable, we will train the new model on the original problem, rather than trying to improve residuals.

Parameters
  • obj (pd.DataFrame) – The pandas dataframe containing training data.

  • target_column (str) – The name of the column containing true labels.

  • learner_func (function) – A function returning an instance of a base learner. They should define a fit(x, y) method and a predict(x) method.

  • problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.

  • snr ('auto' | 'low' | 'high') – Set to low if the problem is difficult (i.e. has a low signal-to-noise ratio) or the number of rows is small relative to the number of columns. Only used for model-free variable selection.

  • train_frac (float) – The fraction of rows used for training and validation.

  • random_state (int) – The seed to use for random training/validation/testing split.

  • force_redo (bool) – Fitted models are saved. Set this parameters to True to ignore saved models and refit.

  • min_n_features (int | None) – Boosting will not stop until at least this many features/explanatory variables are selected.

  • max_n_features (int | None) – Boosting will stop as soon as this many features/explanatory variables are selected.

  • start_n_features (int) – The number of most important features boosting will start with.

  • anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).

  • benchmark_feature (str | None) – When not None, ‘benchmark’ performance metrics using this column as predictor will be reported in the output dictionary.

  • missing_value_imputation (bool) – When set to True, replace missing values with medians.

  • n_down_perf_before_stop (int) – Number of consecutive down performances to observe before boosting stops.

  • regression_baseline (str (mean | median)) – Whether to use the unconditional mean or median as the best predictor in the absence of explanatory variables. Choosing the mean corresponds to minimizing the L2 norm, whereas choosing the median corresponds to minimizing the L1 norm.

  • additive_learning (bool) – When a new variable is added, whether errors/residuals should be fixed or a new model should be learned from scratch.

  • regression_error_type (str ('additive' | 'multiplicative')) – For regression problems with additive learning, this determines whether the final model should be additive (pruning tries to reduce regressor residuals) or multiplicative (i.e. pruning tries to bring the ratio between true and predicted labels as closed to 1 as possible).

  • start_n_features_perf_frac (float (between 0 and 1)) – When start_n_features is not specified, it is set to the number of variables required to achieve a fraction start_n_features_perf_frac of the maximum performance achievable (as per obj.kxy.variable_selection).

  • return_scores (bool (Default False)) – Whether to return training, validation and testing performance after lean boosting.

  • val_performance_buffer (float (Default 0.0)) – The threshold by which the new validation performance needs to exceed the previously evaluated validation performance to consider increasing the number of features.

  • score (str | func) – The validation metric to use to determine if a new feature should be added. When set to 'auto' (the default), the \(R^2\) is used for regression problems and the classification accuracy is used for classification problems. Any other string should be the name of a globally accessible callable.

Returns

result – Dictionary containing selected variables, as well as training, validation and testing performance.

Return type

dict

classmethod load(path, learner_func)

Load the learner from disk.

predict(obj, memory_bound=False)

Make predictions using the fitted model.

Parameters
  • obj (pandas.DataFrame) – A dataframe containing test explanatory variables about which we want to make predictions.

  • memory_bound (bool (Default False)) – Whether we should try to save memory.

Returns

result – A dataframe with the same index as obj, and with one column whose name is the target_column used for training.

Return type

pandas.DataFrame

save(path)

Save the learner to disk.

set_from_disk(path, learner_func)

Load the learner from disk.

Boruta and Recursive Feature Elimination

class kxy.misc.boruta.Boruta(learner_func, path=None)

Implementation of the Boruta feature selection algorithm.

Reference:

fit(x_df, y_df, n_evaluations=20, pval=0.95, max_duration=None)

Performs a run of the Boruta feature selection algorithm.

Specifically, n_evaluations times we randomly shuffle all feature values to define so-called ‘shadow features’, which we add to the original features.

The random shuffling destroys any association between shadow features and the target. Thus, none of them should be considered important.

We train the learner with both original and shadow features, and calculate the highest feature importance score among shadow features.

We consider an original feature a ‘hit’ if its feature importance score is higher than the highest feature importance score of shadow features, otherwise it is a miss.

For each original feature, we end up with the number of hits in the random n_evaluations runs, and we use the following statistical test to conclude if the original feature is relevant.

The null hypothesis \(H_0\) is that we do not know if an original feature is relevant for the problem at hand.

Under \(H_0\), the probability that a feature will be a hit in one of the random trials above is \(p=0.5\), and its number of hits after n_evaluations runs follows a binomial distribution with probability \(0.5\) and total number of trials \(n_evaluations\).

We reject \(H_0\) and accept the alternative hypothesis \(H_1\) that a feature is relevant when the number of hits observed is higher than the pval quantile under \(H_0\).

We reject \(H_0\) and accept the alternative hypothesis \(H_2\) that a feature is not relevant when the number of hits observed is lower than the 1-pval quantile under \(H_0\).

A warning is raised when we were not able to reject \(H_0\) for some features.

Features that were determined to be relevant are used to train the final learner, which this method returns.

When no feature was determined to be relevant, this method returns None.

Parameters
  • x_df (pd.DataFrame) – A dataframe containing all features.

  • y_df (pd.DataFrame) – A dataframe containing the target.

  • n_evaluations (int (default 20)) – Number of trials in the Boruta algorithm.

  • pval (float (between 0 and 1, default 0.95)) – Quantile level above which features are considered relevant.

  • max_duration (float | None (default)) – If not None, then feature elimination will stop after this many seconds.

selected_variables

The list of features deemed relevant, sorted in decreasing number of hits.

Type

list

ambiguous_variables

The list of features that we could not conclude are either relevant or irrelevant.

Type

list

hits

Dictionary containing the number of hits for each feature.

Type

dict

Returns

m – An instance returned by learner_func trained with features that were deemed relevant.

Return type

sklearn-like model (an instance returned by learner_func)

class kxy.misc.rfe.RFE(learner_func, path=None)

Implementation of the Recursive Feature Elimination (RFE) feature selection algorithm.

Reference:

fit(x_df, y_df, n_vars, max_duration=None)

Performs a run of the Recursive Feature Elimination (RFE) feature selection algorithm.

Starting with all features, we recursively train a learner, calculate all feature importance scores, remove the least important feature, and repeat until we are left with n_vars features.

Parameters
  • x_df (pd.DataFrame) – A dataframe containing all features.

  • y_df (pd.DataFrame) – A dataframe containing the target.

  • n_vars (int) – The number of features to keep.

  • max_duration (float | None (default)) – If not None, then feature elimination will stop after this many seconds.

selected_variables

The list of the n_vars features we kept.

Type

list

Returns

m – An instance returned by learner_func trained with the n_vars features we kept.

Return type

sklearn-like model (an instance returned by learner_func)

Principal Feature Selection

class kxy.pfs.pfs_selector.PCA(energy_loss_frac=0.05)

Principal Component Analysis.

fit(x, _, max_duration=None, p=None)
class kxy.pfs.pfs_selector.PFS

Principal Feature Selection.

fit(x, y, p=None, mi_tolerance=0.0001, max_duration=None, epochs=None, seed=None, expand_y=True)

Perform Principal Feature Selection using \(x\) to predict \(y\).

Specifically, we are looking for a \(p x d\) matrix \(W\) whose \(p\) rows are learned sequentially such that \(z := Wx\) is a great feature vector for predicting \(y\).

Each row of \(W\) is normal: \(||w_i||=1\), and the corresponding principal feature, namely \(w_i^Tx\), points in the same direction as \(y\) (i.e. \(Cov(y, w_i^Tx) > 0\)).

The first row \(w_1\) is learned so as to maximize the mutual information \(I(y; x^Tw_1)\).

The second row \(w_2\) is learned so as to maximize the conditional mutual information \(I(y; x^Tw_2 | x^Tw_1)\).

More generally, the \((i+1)\)-th row \(w_{i+1}\) is learned so as to maximize the conditional mutual information \(I(y; x^Tw_{i+1} | [x^Tw_1, ..., x^Tw_i])\).

Parameters
  • x (np.array) – 2D array of shape \((n, d)\) containing original features.

  • y (np.array) – Array of shape \((n)\) or \((n, 1)\) containing targets.

  • p (int | None (default)) – The number of features to select. When None (the default) we stop when the estimated mutual information smaller than the mutual information tolerance parameter, or when we have exceeded the maximum duration. A value of p that is not None triggers one-shot PFS.

  • mi_tolerance (float) – The smallest estimated mutual information required to keep looking for new feature directions.

  • max_duration (float | None (default)) – The maximum amount of time (in second) to allocate to PFS.

Returns

W – 2D array whose rows are directions to use to compute principal features: \(z = Wx\).

Return type

np.array

max_ent_features_x(x)
kxy.pfs.pfs_selector.learn_principal_direction(y, x, ox=None, oy=None, epochs=None, expand_y=True)

Learn the i-th principal feature when using \(x\) to predict \(y\).

Parameters
  • x (np.array) – 2D array of shape \((n, d)\) containing original features.

  • y (np.array) – Array of shape \((n)\) or \((n, 1)\) containing targets.

Returns

  • w (np.array) – The first principal direction.

  • mi (float) – The mutual information \(I(y; w_i^Tx, \dots, w_1^Tx)\).

kxy.pfs.pfs_selector.learn_principal_directions_one_shot(y, x, p, epochs=None, expand_y=True)

Jointly learn p principal features.

Parameters
  • x (np.array) – 2D array of shape \((n, d)\) containing original features.

  • y (np.array) – Array of shape \((n)\) or \((n, 1)\) containing targets.

  • p (int) – The number of principal features to learn.

Returns

w – The matrix whose rows are the p principal directions.

Return type

np.array

class kxy.pfs.pfs_predictor.PCAPredictor(energy_loss_frac=0.05)

Principal Component Analysis Predictor.

class kxy.pfs.pfs_predictor.PFSPredictor

Principal Feature Selection Predictor.

fit(obj, target_column, learner_func, max_duration=None, path=None, p=None)

Fits a supervised learner enriched with feature selection using the Principal Feature Selection (PFS) algorithm.

Parameters
  • obj (pandas.DataFrame) – A dataframe containing training explanatory variables/features as well as the target.

  • target_column (str) – The name of the column in obj containing targets.

  • learner_func (func | callable) – Function or callable that expects one optional argument n_vars and returns an instance of a superviser learner (regressor or classifier) following the scikit-learn convention, and expecting n_vars features. Specifically, the learner should have a fit(x_train, y_train) method. The learner should also have a feature_importances_ property or attribute, which is an array or a list containing feature importances once the model has been trained. There should be as many importance scores in feature_importances_ as columns in fit(x_train, y_train).

  • max_duration (float | None (default)) – If not None, then feature elimination will stop after this many seconds.

  • p (int | None (default)) – The number of principal features to learn when using one-shot PFS.

feature_directions

The matrix whose rows are the directions in which to project the original features to get principal features.

Type

np.array

target_column

The name of the column used as target.

Type

str

models

An array whose first entry is the fitted model.

Type

list

x_columns

The list of columns used for PFS sorted alphabetically.

Type

list

Returns

results – A dictionary containing, among other things, feature directions.

Return type

dict

get_feature_selector()
classmethod load(path, learner_func)

Load the predictor from disk.

predict(obj, memory_bound=False)

Make predictions using the fitted model.

Parameters
  • obj (pandas.DataFrame) – A dataframe containing test explanatory variables/features about which we want to make predictions.

  • memory_bound (bool (Default False)) – Whether we should try to save memory.

Returns

result – A dataframe with the same index as obj, and with one column whose name is the target_column used for training.

Return type

pandas.DataFrame

save(path)

Cache the predictor to disk.

Utilities Generating Learner Functions

kxy.learning.base_learners.get_autogluon_learner(problem_type=None, eval_metric=None, verbosity=2, sample_weight=None, weight_evaluation=False, groups=None, fit_kwargs={}, **kwargs)

Return a learner function derived from AutoGluon’s TabularPredictor.

Parameters should be the same as those of the constructor of autogluon.tabular.TabularPredictor.

kxy.learning.base_learners.get_lightgbm_learner_learning_api(params, num_boost_round=100, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', learning_rates=None, keep_training_booster=False, callbacks=None, split_random_seed=None, weight_func=None, verbose=- 1, importance_type='split')

Generate a base learner class as a subclass of an lightgbm learner class (using the regular training api), but one whose hyper-parameters are frozen to user-specified values.

All parameters are the same as lightgbm.train parameters with the same name.

kxy.learning.base_learners.get_lightgbm_learner_sklearn_api(class_name, boosting_type='gbdt', num_leaves=31, max_depth=- 1, learning_rate=0.1, n_estimators=100, subsample_for_bin=200000, objective=None, class_weight=None, min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20, subsample=1.0, subsample_freq=0, colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, random_state=None, n_jobs=- 1, silent='warn', importance_type='split', fit_kwargs={}, predict_kwargs={}, **kwargs)

Generate a base learner class as a subclass of an lightgbm learner class (using the sklearn api), but one whose hyper-parameters are frozen to user-specified values.

The class name should be full (e.g. lightgbm.LGBMRegressor).

kxy.learning.base_learners.get_pytorch_dense_learner(class_name, layers, optimizer='Adam', fit_kwargs={}, predict_kwargs={}, **kwargs)

Generate a base learner class as a subclass of a pytorch learner class, but one whose hyper-parameters are frozen to user-specified values.

The class name should either skorch.NeuralNetRegressor, skorch.NeuralNetClassifier, NeuralNetRegressor or NeuralNetClassifier.

This function requires both PyTorch and Skorch to be installed.

Parameters
  • class_name (str ('skorch.NeuralNetRegressor' | 'skorch.NeuralNetClassifier')) – Determines whether the learner is a regressor or a classifier.

  • layers (list) – List of tuples of the form (n_neurons, activation)

  • optimizer (str) – Name of the class in torch.optim to use as optimizer to train the model.

  • kwargs (dict) – Other named parameters to use in the constructor of the base class.

  • fit_kwargs (dict) – Arguments to be used to fit the model.

  • predict_kwargs (dict) – Arguments to be used to make predictions.

kxy.learning.base_learners.get_sklearn_learner(class_name, *args, fit_intercept=True, fit_kwargs={}, predict_kwargs={}, **kwargs)

Generate base learner class as a subclass of an sklearn learner class, but one whose hyper-parameters are frozen to user-specified values.

The class name should be full (e.g. sklearn.ensemble.RandomForestRegressor).

kxy.learning.base_learners.get_tensorflow_dense_learner(class_name, layers, loss, optimizer='adam', fit_kwargs={}, predict_kwargs={}, **kwargs)

Generate a base learner class as a subclass of a tensorflow learner class, but one whose hyper-parameters are frozen to user-specified values.

This function requires both Tensorflow to be installed.

Parameters
kxy.learning.base_learners.get_xgboost_learner(class_name, *args, fit_kwargs={}, predict_kwargs={}, **kwargs)

Generate a base learner class as a subclass of an xgboost learner class, but one whose hyper-parameters are frozen to user-specified values.

The class name should be full (e.g. xgboost.XGBRegressor).