# Complex Classification Example¶

We showcase our data valuation and variable selection analyses on a dataset with a mix of 14 categorical and continuous explanatory variables, namely the UCI US census income dataset (https://archive.ics.uci.edu/ml/datasets/census+income).

The aim here is to show that our analyses work beyond a few explanatory variables and adequately cope with categorical data.

In [1]:

%load_ext autoreload

import pandas as pd
import numpy as np
import kxy


The goal is predict whether a US resident’s income exceeds $50k/year based on information provided in the US census questionnaire. In [2]:  # Load the data using the kxy_datasets package (pip install kxy_datasets) from kxy_datasets.uci_classifications import Adult df = Adult().df df.kxy.describe()   ----------- Column: Age ----------- Type: Categorical Frequency: 2.76%, Label: 36 Frequency: 2.74%, Label: 35 Frequency: 2.73%, Label: 33 Frequency: 2.72%, Label: 23 Frequency: 2.71%, Label: 31 Frequency: 2.67%, Label: 34 Frequency: 2.62%, Label: 28 Frequency: 2.62%, Label: 37 Frequency: 2.62%, Label: 30 Frequency: 2.59%, Label: 38 Frequency: 2.57%, Label: 32 Frequency: 2.53%, Label: 41 Frequency: 2.52%, Label: 27 Frequency: 2.50%, Label: 29 Frequency: 2.47%, Label: 24 Frequency: 2.47%, Label: 39 Frequency: 2.45%, Label: 25 Frequency: 2.43%, Label: 40 Frequency: 2.41%, Label: 22 Frequency: 2.39%, Label: 42 Frequency: 2.36%, Label: 26 Frequency: 2.28%, Label: 20 Frequency: 2.26%, Label: 43 Frequency: 2.25%, Label: 46 Frequency: 2.24%, Label: 21 Frequency: 2.24%, Label: 45 Frequency: 2.21%, Label: 47 Frequency: 2.18%, Label: 44 Frequency: 2.16%, Label: 19 Frequency: 1.80%, Label: 51 Frequency: 1.77%, Label: 50 Frequency: 1.76%, Label: 18 Frequency: 1.73%, Label: 49 Frequency: 1.73%, Label: 48 Frequency: 1.51%, Label: 52 Frequency: 1.46%, Label: 53 Frequency: 1.27%, Label: 55 Frequency: 1.26%, Label: 54 Frequency: 1.22%, Label: 17 Frequency: 1.15%, Label: 56 Frequency: 1.14%, Label: 58 Frequency: 1.13%, Label: 57 Other Labels: 9.37% -------------------- Column: Capital Gain -------------------- Type: Continuous Max: 99,999 p75: 0.0 Mean: 1,079 Median: 0.0 p25: 0.0 Min: 0.0 -------------------- Column: Capital Loss -------------------- Type: Continuous Max: 4,356 p75: 0.0 Mean: 87 Median: 0.0 p25: 0.0 Min: 0.0 ----------------- Column: Education ----------------- Type: Categorical Frequency: 32.32%, Label: HS-grad Frequency: 22.27%, Label: Some-college Frequency: 16.43%, Label: Bachelors Frequency: 5.44%, Label: Masters Frequency: 4.22%, Label: Assoc-voc Frequency: 3.71%, Label: 11th Frequency: 3.28%, Label: Assoc-acdm Frequency: 2.84%, Label: 10th Other Labels: 9.49% --------------------- Column: Education Num --------------------- Type: Continuous Max: 16 p75: 12 Mean: 10 Median: 10 p25: 9.0 Min: 1.0 -------------- Column: Fnlwgt -------------- Type: Continuous Max: 1,490,400 p75: 237,642 Mean: 189,664 Median: 178,144 p25: 117,550 Min: 12,285 ---------------------- Column: Hours Per Week ---------------------- Type: Continuous Max: 99 p75: 45 Mean: 40 Median: 40 p25: 40 Min: 1.0 -------------- Column: Income -------------- Type: Categorical Frequency: 76.07%, Label: <=50K Frequency: 23.93%, Label: >50K Other Labels: 0.00% ---------------------- Column: Marital Status ---------------------- Type: Categorical Frequency: 45.82%, Label: Married-civ-spouse Frequency: 33.00%, Label: Never-married Frequency: 13.58%, Label: Divorced Other Labels: 7.60% ---------------------- Column: Native Country ---------------------- Type: Categorical Frequency: 89.74%, Label: United-States Frequency: 1.95%, Label: Mexico Other Labels: 8.31% ------------------ Column: Occupation ------------------ Type: Categorical Frequency: 12.64%, Label: Prof-specialty Frequency: 12.51%, Label: Craft-repair Frequency: 12.46%, Label: Exec-managerial Frequency: 11.49%, Label: Adm-clerical Frequency: 11.27%, Label: Sales Frequency: 10.08%, Label: Other-service Frequency: 6.19%, Label: Machine-op-inspct Frequency: 5.75%, Label: ? Frequency: 4.82%, Label: Transport-moving Frequency: 4.24%, Label: Handlers-cleaners Other Labels: 8.55% ------------ Column: Race ------------ Type: Categorical Frequency: 85.50%, Label: White Frequency: 9.59%, Label: Black Other Labels: 4.91% -------------------- Column: Relationship -------------------- Type: Categorical Frequency: 40.37%, Label: Husband Frequency: 25.76%, Label: Not-in-family Frequency: 15.52%, Label: Own-child Frequency: 10.49%, Label: Unmarried Other Labels: 7.86% ----------- Column: Sex ----------- Type: Categorical Frequency: 66.85%, Label: Male Frequency: 33.15%, Label: Female Other Labels: 0.00% ----------------- Column: Workclass ----------------- Type: Categorical Frequency: 69.42%, Label: Private Frequency: 7.91%, Label: Self-emp-not-inc Frequency: 6.42%, Label: Local-gov Frequency: 5.73%, Label: ? Frequency: 4.06%, Label: State-gov Other Labels: 6.47%  ## Data Valuation¶ About, 76% of people surveyed earned less than$50,000 annually. Thus, the baseline strategy consisting of always predicting <=50K would give us a 76% accuracy.

Thanks to the explanatory variables, we may achieve a higher accuracy than 76%. To determine just how much higher the accuracy can get, we run our data valuation analysis.

In [3]:

df.kxy.data_valuation('Income', problem_type='classification')

[====================================================================================================] 100% ETA: 0s

Out[3]:

Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 0.67 -1.11e-16 1.00

As it turns out demographic data can provide a substantial accuracy boost over the naive strategy.

## Variable Selection¶

Let us now trace which variables this boost may come from.

In [4]:

df.kxy.variable_selection('Income', problem_type='classification')

[====================================================================================================] 100% ETA: 0s

Out[4]:

Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
0 No Variable 0.00 0.76
1 Relationship 0.21 0.89
2 Capital Gain 0.31 0.91
3 Education 0.37 0.92
4 Age 0.40 0.93
5 Hours Per Week 0.49 0.96
6 Occupation 0.57 0.98
7 Workclass 0.60 0.98
8 Race 0.61 0.99
9 Capital Loss 0.62 0.99
10 Native Country 0.62 0.99
11 Marital Status 0.62 0.99
12 Sex 0.62 0.99
13 Fnlwgt 0.63 0.99
14 Education Num 0.67 1.00

The second row (Relationship) corresponds to the most important factor. When used in isolation in a classification model, an accuracy of up to 89% can be achieved. This is 13% above the accuracy that can be achieved by the naive strategy consisting of always predicting the most frequent outcome, namely <=50K.

The third row (Capital Gain) corresponds to the factor that complements the Relationship factor the most. It can bring about an additional 2% classification accuracy.

More generally, the $$(i+1)$$-th row corresponds to the factor that complements the first $$i$$ factors selected the most.

### Comparison with alternative (model-based) factor importance analyses¶

The kxy package is the only Python package capable of ex-ante and model-free factor importance analysis.

Other tools adopt a model-based approach consisting of quantifying the extent to which a trained model relies on each input, as opposed to the extent to which inputs are intrinsically useful for solving the problem at hand. Thus, their analyses may vary depending on the model used and as such they are subjective.

This dataset was previously studied using the tree-based SHAP method in this notebook.

Shapley value based studies are inherently model-based as they explain the predictions made by a specific model. Additionally, they are intrinsically unrelated to accuracy or any other performance metric. Indeed, the approach merely explains model predictions (of any model, although it scales better on tree-based models), with no regard on how well said model performs.

Our approach on the other hand quantifies the importance of each variable for effectively solving the problem at hand, in a model-free fashion, and before training any predictive model. When a trained model performs optimally, Shapley values will coincide with our model-free intrinsic variable importance. However, variable selection is typically needed to help find the best model, not after it is found.