Complex Classification Example

We showcase our data valuation and variable selection analyses on a dataset with a mix of 14 categorical and continuous explanatory variables, namely the UCI US census income dataset ( The aim here is to show that our analyses work beyond a few explanatory variables and adequately cope with categorical data.

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import kxy

df = pd.read_csv('', \
                 names=['Age', 'Workclass', 'Fnlwgt', 'Education',\
                        'Education Num', 'Marital Status',\
                        'Occupation', 'Relationship', 'Race', 'Sex', \
                        'Capital Gain', 'Capital Loss', 'Hours Per Week', \
                        'Native Country', 'Income'])

The Data

The goal is predict whether a US resident’s income exceeds $50k/year based on information provided in the US census questionnaire.

In [2]:

Column: Age
Type:   Continuous
Max:    90
p75:    48
Mean:   38
Median: 37
p25:    28
Min:    17

Column: Capital Gain
Type:   Continuous
Max:    99,999
p75:    0.0
Mean:   1,077
Median: 0.0
p25:    0.0
Min:    0.0

Column: Capital Loss
Type:   Continuous
Max:    4,356
p75:    0.0
Mean:   87
Median: 0.0
p25:    0.0
Min:    0.0

Column: Education
Type:      Categorical
Frequency: 32.25%, Label:  HS-grad
Frequency: 22.39%, Label:  Some-college
Frequency: 16.45%, Label:  Bachelors
Frequency:  5.29%, Label:  Masters
Frequency:  4.24%, Label:  Assoc-voc
Frequency:  3.61%, Label:  11th
Frequency:  3.28%, Label:  Assoc-acdm
Frequency:  2.87%, Label:  10th
Other Labels: 9.63%

Column: Education Num
Type:   Continuous
Max:    16
p75:    12
Mean:   10
Median: 10
p25:    9.0
Min:    1.0

Column: Fnlwgt
Type:   Continuous
Max:    1,484,705
p75:    237,051
Mean:   189,778
Median: 178,356
p25:    117,827
Min:    12,285

Column: Hours Per Week
Type:   Continuous
Max:    99
p75:    45
Mean:   40
Median: 40
p25:    40
Min:    1.0

Column: Income
Type:      Categorical
Frequency: 75.92%, Label:  <=50K
Frequency: 24.08%, Label:  >50K

Column: Marital Status
Type:      Categorical
Frequency: 45.99%, Label:  Married-civ-spouse
Frequency: 32.81%, Label:  Never-married
Frequency: 13.65%, Label:  Divorced
Other Labels: 7.55%

Column: Native Country
Type:      Categorical
Frequency: 89.59%, Label:  United-States
Frequency:  1.97%, Label:  Mexico
Other Labels: 8.44%

Column: Occupation
Type:      Categorical
Frequency: 12.71%, Label:  Prof-specialty
Frequency: 12.59%, Label:  Craft-repair
Frequency: 12.49%, Label:  Exec-managerial
Frequency: 11.58%, Label:  Adm-clerical
Frequency: 11.21%, Label:  Sales
Frequency: 10.12%, Label:  Other-service
Frequency:  6.15%, Label:  Machine-op-inspct
Frequency:  5.66%, Label:  ?
Frequency:  4.90%, Label:  Transport-moving
Frequency:  4.21%, Label:  Handlers-cleaners
Other Labels: 8.38%

Column: Race
Type:      Categorical
Frequency: 85.43%, Label:  White
Frequency:  9.59%, Label:  Black
Other Labels: 4.98%

Column: Relationship
Type:      Categorical
Frequency: 40.52%, Label:  Husband
Frequency: 25.51%, Label:  Not-in-family
Frequency: 15.56%, Label:  Own-child
Frequency: 10.58%, Label:  Unmarried
Other Labels: 7.83%

Column: Sex
Type:      Categorical
Frequency: 66.92%, Label:  Male
Frequency: 33.08%, Label:  Female

Column: Workclass
Type:      Categorical
Frequency: 69.70%, Label:  Private
Frequency:  7.80%, Label:  Self-emp-not-inc
Frequency:  6.43%, Label:  Local-gov
Frequency:  5.64%, Label:  ?
Frequency:  3.99%, Label:  State-gov
Other Labels: 6.44%

Data Valuation

About, 76% of people surveyed earned less than $50,000 annually. Thus, the baseline strategy consisting of always predicting <=50K would give us a 76% accuracy.

Thanks to the explanatory variables, we may achieve a higher accuracy than 76%. To determine just how much higher the accuracy can get, we run our data valuation analysis.

In [3]:
df.kxy.data_valuation('Income', problem_type='classification')
[====================================================================================================] 100% ETA: 0s
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 0.58 -1.23e-01 0.97

As it turns out demographic data can provide a substantial accuracy boost over the naive strategy.

Variable Selection

Let us now trace which variables this boost may come from.

In [4]:
df.kxy.variable_selection('Income', problem_type='classification')
[====================================================================================================] 100% ETA: 0s
Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
1 Capital Gain 0.32 0.88
2 Marital Status 0.45 0.93
3 Education Num 0.50 0.95
4 Workclass 0.52 0.96
5 Capital Loss 0.54 0.96
6 Hours Per Week 0.55 0.96
7 Relationship 0.56 0.97
8 Age 0.56 0.97
9 Education 0.58 0.97
10 Race 0.58 0.97
11 Occupation 0.58 0.97
12 Sex 0.58 0.97
13 Native Country 0.58 0.97
14 Fnlwgt 0.58 0.97

The first row (Marial Status) corresponds to the most important factor. When used in isolation in a classification model, an accuracy of up to 88% can be achieved. This is 12% above the accuracy that can be achieved by the naive strategy consisting of always predicting the most frequent outcome, namely <=50K.

The second row (Education Num) corresponds to the factor that complements the Marial Status factor the most. It can bring about an additional 5% classification accuracy.

More generally, the \((i+1)\)-th row corresponds to the factor that complements the first \(i\) factors selected the most.

Interestingly, both the length of a resident’s education and her marital status play a more important role in determining whether she earns more than USD 50,000 per year than the amount of capital gain or capital loss she’s had.

Comparison with alternative (model-based) factor importance analyses

The kxy package is the only Python package capable of ex-ante and model-free factor importance analysis.

Other tools adopt a model-based approach consisting of quantifying the extent to which a trained model relies on each input, as opposed to the extent to which inputs are intrinsically useful for solving the problem at hand. Thus, their analyses may vary depending on the model used and as such they are subjective.

This dataset was previously studied using the tree-based SHAP method in this notebook.

Shapley value based studies are inherently model-based as they explain the predictions made by a specific model. Additionally, they are intrinsically unrelated to accuracy or any other performance metric. Indeed, the approach merely explains model predictions (of any model, although it scales better on tree-based models), with no regard on how well said model performs.

Our approach on the other hand quantifies the importance of each variable for effectively solving the problem at hand, in a model-free fashion.