Model Compression¶
How to seamlessly add feature selection to any predictive model in Python, so as to achieve the same performance with far fewer features.
LeanML Feature Selection¶
- class kxy.learning.leanml_predictor.BaselineClassifier¶
Classifier following the scikit-learn API and always predicting the most frequent class in-sample.
- class kxy.learning.leanml_predictor.BaselineRegressor(baseline='mean')¶
Regressor following the scikit-learn API and always predicting the in-sample mean.
- class kxy.learning.leanml_predictor.LeanMLPredictor¶
Wrapper to seamlessly add LeanML variable selection to any supervised learner.
- fit(obj, target_column, learner_func, problem_type=None, snr='auto', train_frac=0.8, random_state=0, force_redo=False, max_n_features=None, min_n_features=None, start_n_features=None, anonymize=False, benchmark_feature=None, missing_value_imputation=False, score='auto', n_down_perf_before_stop=3, regression_baseline='mean', additive_learning=False, regression_error_type='additive', return_scores=False, start_n_features_perf_frac=0.9, val_performance_buffer=0.0, path=None, file_name=None)¶
Train a lean boosted supervised learner, bringing in variables one at a time, in decreasing order of importance (as per
obj.kxy.variable_selection
), until doing so no longer improves validation performance or another stopping criterion is met.Specifically, training proceeds as follows. First, KXY’s model-free variable selection is run (i.e.
obj.kxy.variable_selection
).Then we train a model (instance returned by
learner_func
) using thestart_n_features
most important feature/variable to predict the target (defined bytarget_column
).Next we consider adding one variable at a time to fix the mistakes made by the previously trained model when
additive_learning
isTrue
.If doing so improves performance on the validation set, we keep going until either performance no longer improves on the validation set
n_down_perf_before_stop
consecutive times, or we’ve selectedmax_n_features
features.When
start_n_features
isNone
the initial set of variables is the smallest set of variables with which we may achievestart_n_features_perf_frac
of the performance we could achieve using all variables (as perobj.kxy.variable_selection
).When
additive_learning
is set toFalse
, after adding a new variable, we will train the new model on the original problem, rather than trying to improve residuals.- Parameters
obj (pd.DataFrame) – The pandas dataframe containing training data.
target_column (str) – The name of the column containing true labels.
learner_func (function) – A function returning an instance of a base learner. They should define a
fit(x, y)
method and apredict(x)
method.problem_type (None | 'classification' | 'regression') – The type of supervised learning problem. When None, it is inferred from the column type and the number of distinct values.
snr ('auto' | 'low' | 'high') – Set to
low
if the problem is difficult (i.e. has a low signal-to-noise ratio) or the number of rows is small relative to the number of columns. Only used for model-free variable selection.train_frac (float) – The fraction of rows used for training and validation.
random_state (int) – The seed to use for random training/validation/testing split.
force_redo (bool) – Fitted models are saved. Set this parameters to
True
to ignore saved models and refit.min_n_features (int | None) – Boosting will not stop until at least this many features/explanatory variables are selected.
max_n_features (int | None) – Boosting will stop as soon as this many features/explanatory variables are selected.
start_n_features (int) – The number of most important features boosting will start with.
anonymize (bool) – When set to true, your explanatory variables will never be shared with KXY (at no performance cost).
benchmark_feature (str | None) – When not None, ‘benchmark’ performance metrics using this column as predictor will be reported in the output dictionary.
missing_value_imputation (bool) – When set to True, replace missing values with medians.
n_down_perf_before_stop (int) – Number of consecutive down performances to observe before boosting stops.
regression_baseline (str (
mean
|median
)) – Whether to use the unconditional mean or median as the best predictor in the absence of explanatory variables. Choosing the mean corresponds to minimizing the L2 norm, whereas choosing the median corresponds to minimizing the L1 norm.additive_learning (bool) – When a new variable is added, whether errors/residuals should be fixed or a new model should be learned from scratch.
regression_error_type (str ('additive' | 'multiplicative')) – For regression problems with additive learning, this determines whether the final model should be additive (pruning tries to reduce regressor residuals) or multiplicative (i.e. pruning tries to bring the ratio between true and predicted labels as closed to 1 as possible).
start_n_features_perf_frac (float (between 0 and 1)) – When
start_n_features
is not specified, it is set to the number of variables required to achieve a fractionstart_n_features_perf_frac
of the maximum performance achievable (as perobj.kxy.variable_selection
).return_scores (bool (Default False)) – Whether to return training, validation and testing performance after lean boosting.
val_performance_buffer (float (Default 0.0)) – The threshold by which the new validation performance needs to exceed the previously evaluated validation performance to consider increasing the number of features.
score (str | func) – The validation metric to use to determine if a new feature should be added. When set to
'auto'
(the default), the \(R^2\) is used for regression problems and the classification accuracy is used for classification problems. Any other string should be the name of a globally accessible callable.
- Returns
result – Dictionary containing selected variables, as well as training, validation and testing performance.
- Return type
dict
- classmethod load(path, learner_func)¶
Load the learner from disk.
- predict(obj, memory_bound=False)¶
Make predictions using the fitted model.
- Parameters
obj (pandas.DataFrame) – A dataframe containing test explanatory variables about which we want to make predictions.
memory_bound (bool (Default False)) – Whether we should try to save memory.
- Returns
result – A dataframe with the same index as
obj
, and with one column whose name is thetarget_column
used for training.- Return type
pandas.DataFrame
- save(path)¶
Save the learner to disk.
- set_from_disk(path, learner_func)¶
Load the learner from disk.
Boruta and Recursive Feature Elimination¶
- class kxy.misc.boruta.Boruta(learner_func, path=None)¶
Implementation of the Boruta feature selection algorithm.
Reference:
- fit(x_df, y_df, n_evaluations=20, pval=0.95, max_duration=None)¶
Performs a run of the Boruta feature selection algorithm.
Specifically,
n_evaluations
times we randomly shuffle all feature values to define so-called ‘shadow features’, which we add to the original features.The random shuffling destroys any association between shadow features and the target. Thus, none of them should be considered important.
We train the learner with both original and shadow features, and calculate the highest feature importance score among shadow features.
We consider an original feature a ‘hit’ if its feature importance score is higher than the highest feature importance score of shadow features, otherwise it is a miss.
For each original feature, we end up with the number of hits in the random
n_evaluations
runs, and we use the following statistical test to conclude if the original feature is relevant.The null hypothesis \(H_0\) is that we do not know if an original feature is relevant for the problem at hand.
Under \(H_0\), the probability that a feature will be a hit in one of the random trials above is \(p=0.5\), and its number of hits after
n_evaluations
runs follows a binomial distribution with probability \(0.5\) and total number of trials \(n_evaluations\).We reject \(H_0\) and accept the alternative hypothesis \(H_1\) that a feature is relevant when the number of hits observed is higher than the
pval
quantile under \(H_0\).We reject \(H_0\) and accept the alternative hypothesis \(H_2\) that a feature is not relevant when the number of hits observed is lower than the
1-pval
quantile under \(H_0\).A warning is raised when we were not able to reject \(H_0\) for some features.
Features that were determined to be relevant are used to train the final learner, which this method returns.
When no feature was determined to be relevant, this method returns
None
.- Parameters
x_df (pd.DataFrame) – A dataframe containing all features.
y_df (pd.DataFrame) – A dataframe containing the target.
n_evaluations (int (default 20)) – Number of trials in the Boruta algorithm.
pval (float (between 0 and 1, default 0.95)) – Quantile level above which features are considered relevant.
max_duration (float | None (default)) – If not None, then feature elimination will stop after this many seconds.
- selected_variables¶
The list of features deemed relevant, sorted in decreasing number of hits.
- Type
list
- ambiguous_variables¶
The list of features that we could not conclude are either relevant or irrelevant.
- Type
list
- hits¶
Dictionary containing the number of hits for each feature.
- Type
dict
- Returns
m – An instance returned by
learner_func
trained with features that were deemed relevant.- Return type
sklearn-like model (an instance returned by
learner_func
)
- class kxy.misc.rfe.RFE(learner_func, path=None)¶
Implementation of the Recursive Feature Elimination (RFE) feature selection algorithm.
Reference:
- fit(x_df, y_df, n_vars, max_duration=None)¶
Performs a run of the Recursive Feature Elimination (RFE) feature selection algorithm.
Starting with all features, we recursively train a learner, calculate all feature importance scores, remove the least important feature, and repeat until we are left with
n_vars
features.- Parameters
x_df (pd.DataFrame) – A dataframe containing all features.
y_df (pd.DataFrame) – A dataframe containing the target.
n_vars (int) – The number of features to keep.
max_duration (float | None (default)) – If not None, then feature elimination will stop after this many seconds.
- selected_variables¶
The list of the
n_vars
features we kept.- Type
list
- Returns
m – An instance returned by
learner_func
trained with then_vars
features we kept.- Return type
sklearn-like model (an instance returned by
learner_func
)
Principal Feature Selection¶
- class kxy.pfs.pfs_selector.PCA(energy_loss_frac=0.05)¶
Principal Component Analysis.
- fit(x, _, max_duration=None, p=None)¶
- class kxy.pfs.pfs_selector.PFS¶
Principal Feature Selection.
- fit(x, y, p=None, mi_tolerance=0.0001, max_duration=None, epochs=None, seed=None, expand_y=True)¶
Perform Principal Feature Selection using \(x\) to predict \(y\).
Specifically, we are looking for a \(p x d\) matrix \(W\) whose \(p\) rows are learned sequentially such that \(z := Wx\) is a great feature vector for predicting \(y\).
Each row of \(W\) is normal: \(||w_i||=1\), and the corresponding principal feature, namely \(w_i^Tx\), points in the same direction as \(y\) (i.e. \(Cov(y, w_i^Tx) > 0\)).
The first row \(w_1\) is learned so as to maximize the mutual information \(I(y; x^Tw_1)\).
The second row \(w_2\) is learned so as to maximize the conditional mutual information \(I(y; x^Tw_2 | x^Tw_1)\).
More generally, the \((i+1)\)-th row \(w_{i+1}\) is learned so as to maximize the conditional mutual information \(I(y; x^Tw_{i+1} | [x^Tw_1, ..., x^Tw_i])\).
- Parameters
x (np.array) – 2D array of shape \((n, d)\) containing original features.
y (np.array) – Array of shape \((n)\) or \((n, 1)\) containing targets.
p (int | None (default)) – The number of features to select. When
None
(the default) we stop when the estimated mutual information smaller than the mutual information tolerance parameter, or when we have exceeded the maximum duration. A value ofp
that is notNone
triggers one-shot PFS.mi_tolerance (float) – The smallest estimated mutual information required to keep looking for new feature directions.
max_duration (float | None (default)) – The maximum amount of time (in second) to allocate to PFS.
- Returns
W – 2D array whose rows are directions to use to compute principal features: \(z = Wx\).
- Return type
np.array
- max_ent_features_x(x)¶
- kxy.pfs.pfs_selector.learn_principal_direction(y, x, ox=None, oy=None, epochs=None, expand_y=True)¶
Learn the i-th principal feature when using \(x\) to predict \(y\).
- Parameters
x (np.array) – 2D array of shape \((n, d)\) containing original features.
y (np.array) – Array of shape \((n)\) or \((n, 1)\) containing targets.
- Returns
w (np.array) – The first principal direction.
mi (float) – The mutual information \(I(y; w_i^Tx, \dots, w_1^Tx)\).
- kxy.pfs.pfs_selector.learn_principal_directions_one_shot(y, x, p, epochs=None, expand_y=True)¶
Jointly learn p principal features.
- Parameters
x (np.array) – 2D array of shape \((n, d)\) containing original features.
y (np.array) – Array of shape \((n)\) or \((n, 1)\) containing targets.
p (int) – The number of principal features to learn.
- Returns
w – The matrix whose rows are the p principal directions.
- Return type
np.array
- class kxy.pfs.pfs_predictor.PCAPredictor(energy_loss_frac=0.05)¶
Principal Component Analysis Predictor.
- class kxy.pfs.pfs_predictor.PFSPredictor¶
Principal Feature Selection Predictor.
- fit(obj, target_column, learner_func, max_duration=None, path=None, p=None)¶
Fits a supervised learner enriched with feature selection using the Principal Feature Selection (PFS) algorithm.
- Parameters
obj (pandas.DataFrame) – A dataframe containing training explanatory variables/features as well as the target.
target_column (str) – The name of the column in
obj
containing targets.learner_func (func | callable) – Function or callable that expects one optional argument
n_vars
and returns an instance of a superviser learner (regressor or classifier) following the scikit-learn convention, and expectingn_vars
features. Specifically, the learner should have afit(x_train, y_train)
method. The learner should also have afeature_importances_
property or attribute, which is an array or a list containing feature importances once the model has been trained. There should be as many importance scores infeature_importances_
as columns infit(x_train, y_train)
.max_duration (float | None (default)) – If not None, then feature elimination will stop after this many seconds.
p (int | None (default)) – The number of principal features to learn when using one-shot PFS.
- feature_directions¶
The matrix whose rows are the directions in which to project the original features to get principal features.
- Type
np.array
- target_column¶
The name of the column used as target.
- Type
str
- models¶
An array whose first entry is the fitted model.
- Type
list
- x_columns¶
The list of columns used for PFS sorted alphabetically.
- Type
list
- Returns
results – A dictionary containing, among other things, feature directions.
- Return type
dict
- get_feature_selector()¶
- classmethod load(path, learner_func)¶
Load the predictor from disk.
- predict(obj, memory_bound=False)¶
Make predictions using the fitted model.
- Parameters
obj (pandas.DataFrame) – A dataframe containing test explanatory variables/features about which we want to make predictions.
memory_bound (bool (Default False)) – Whether we should try to save memory.
- Returns
result – A dataframe with the same index as
obj
, and with one column whose name is thetarget_column
used for training.- Return type
pandas.DataFrame
- save(path)¶
Cache the predictor to disk.
Utilities Generating Learner Functions¶
- kxy.learning.base_learners.get_autogluon_learner(problem_type=None, eval_metric=None, verbosity=2, sample_weight=None, weight_evaluation=False, groups=None, fit_kwargs={}, **kwargs)¶
Return a learner function derived from AutoGluon’s TabularPredictor.
Parameters should be the same as those of the constructor of
autogluon.tabular.TabularPredictor
.
- kxy.learning.base_learners.get_lightgbm_learner_learning_api(params, num_boost_round=100, fobj=None, feval=None, init_model=None, feature_name='auto', categorical_feature='auto', learning_rates=None, keep_training_booster=False, callbacks=None, split_random_seed=None, weight_func=None, verbose=- 1, importance_type='split')¶
Generate a base learner class as a subclass of an lightgbm learner class (using the regular training api), but one whose hyper-parameters are frozen to user-specified values.
All parameters are the same as
lightgbm.train
parameters with the same name.
- kxy.learning.base_learners.get_lightgbm_learner_sklearn_api(class_name, boosting_type='gbdt', num_leaves=31, max_depth=- 1, learning_rate=0.1, n_estimators=100, subsample_for_bin=200000, objective=None, class_weight=None, min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20, subsample=1.0, subsample_freq=0, colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0, random_state=None, n_jobs=- 1, silent='warn', importance_type='split', fit_kwargs={}, predict_kwargs={}, **kwargs)¶
Generate a base learner class as a subclass of an lightgbm learner class (using the sklearn api), but one whose hyper-parameters are frozen to user-specified values.
The class name should be full (e.g. lightgbm.LGBMRegressor).
- kxy.learning.base_learners.get_pytorch_dense_learner(class_name, layers, optimizer='Adam', fit_kwargs={}, predict_kwargs={}, **kwargs)¶
Generate a base learner class as a subclass of a pytorch learner class, but one whose hyper-parameters are frozen to user-specified values.
The class name should either
skorch.NeuralNetRegressor
,skorch.NeuralNetClassifier
,NeuralNetRegressor
orNeuralNetClassifier
.This function requires both PyTorch and Skorch to be installed.
- Parameters
class_name (str ('skorch.NeuralNetRegressor' | 'skorch.NeuralNetClassifier')) – Determines whether the learner is a regressor or a classifier.
layers (list) – List of tuples of the form (n_neurons, activation)
optimizer (str) – Name of the class in
torch.optim
to use as optimizer to train the model.kwargs (dict) – Other named parameters to use in the constructor of the base class.
fit_kwargs (dict) – Arguments to be used to fit the model.
predict_kwargs (dict) – Arguments to be used to make predictions.
- kxy.learning.base_learners.get_sklearn_learner(class_name, *args, fit_intercept=True, fit_kwargs={}, predict_kwargs={}, **kwargs)¶
Generate base learner class as a subclass of an sklearn learner class, but one whose hyper-parameters are frozen to user-specified values.
The class name should be full (e.g. sklearn.ensemble.RandomForestRegressor).
- kxy.learning.base_learners.get_tensorflow_dense_learner(class_name, layers, loss, optimizer='adam', fit_kwargs={}, predict_kwargs={}, **kwargs)¶
Generate a base learner class as a subclass of a tensorflow learner class, but one whose hyper-parameters are frozen to user-specified values.
This function requires both Tensorflow to be installed.
- Parameters
class_name (str ('tf.python.keras.wrappers.scikit_learn.KerasRegressor' | 'tf.python.keras.wrappers.scikit_learn.KerasClassifier')) – Determines whether the learner is a regressor or a classifier.
layers (list) – List of tuples of the form (n_neurons, activation)
loss (str) – The tensorflow loss (e.g. ‘mean_absolute_error’)
optimizer (str) – Which tensorflow optimizer to use to fit the model.
kwargs (dict) – Other named parameters to use in the constructor of the base class.
fit_kwargs (dict) – Arguments to be used to fit the model. See https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit for legal arguments.
predict_kwargs (dict) – Arguments to be used to make predictions. See https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict for legal arguments.
kwargs – Other named parameters to use in the constructor of the base class. E.g.
epochs
,batch_size
. See https://www.tensorflow.org/api_docs/python/tf/keras/wrappers/scikit_learn/KerasRegressor and https://www.tensorflow.org/api_docs/python/tf/keras/wrappers/scikit_learn/KerasClassifier
- kxy.learning.base_learners.get_xgboost_learner(class_name, *args, fit_kwargs={}, predict_kwargs={}, **kwargs)¶
Generate a base learner class as a subclass of an xgboost learner class, but one whose hyper-parameters are frozen to user-specified values.
The class name should be full (e.g. xgboost.XGBRegressor).