Understanding Factors Influencing US Household Income

We consider quantifying the importance of various factors in determining whether a US resident’s income exceeds $50k/year, based on the 1994 census data, as per the UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/census+income.

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import kxy

df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', \
                 names=['Age', 'Workclass', 'Fnlwgt', 'Education',\
                        'Education Num', 'Marital Status',\
                        'Occupation', 'Relationship', 'Race', 'Sex', \
                        'Capital Gain', 'Capital Loss', 'Hours Per Week', \
                        'Native Country', 'Income'])

The Data

In [2]:
df.kxy.describe()

-----------
Column: Age
-----------
Type:   Continuous
Max:    90.0000
Mean:   38.5816
Median: 37.0000
Min:    17.0000

--------------------
Column: Capital Gain
--------------------
Type:   Continuous
Max:    99999.0000
Mean:   1077.6488
Median: 0.0000
Min:    0.0000

--------------------
Column: Capital Loss
--------------------
Type:   Continuous
Max:    4356.0000
Mean:   87.3038
Median: 0.0000
Min:    0.0000

-----------------
Column: Education
-----------------
Type:      Categorical
Frequency: 32.25%, Label:  HS-grad
Frequency: 22.39%, Label:  Some-college
Frequency: 16.45%, Label:  Bachelors
Frequency:  5.29%, Label:  Masters
Frequency:  4.24%, Label:  Assoc-voc
Frequency:  3.61%, Label:  11th
Frequency:  3.28%, Label:  Assoc-acdm
Frequency:  2.87%, Label:  10th
Other Labels: 9.63%

---------------------
Column: Education Num
---------------------
Type:   Continuous
Max:    16.0000
Mean:   10.0807
Median: 10.0000
Min:    1.0000

--------------
Column: Fnlwgt
--------------
Type:   Continuous
Max:    1484705.0000
Mean:   189778.3665
Median: 178356.0000
Min:    12285.0000

----------------------
Column: Hours Per Week
----------------------
Type:   Continuous
Max:    99.0000
Mean:   40.4375
Median: 40.0000
Min:    1.0000

--------------
Column: Income
--------------
Type:      Categorical
Frequency: 75.92%, Label:  <=50K
Frequency: 24.08%, Label:  >50K

----------------------
Column: Marital Status
----------------------
Type:      Categorical
Frequency: 45.99%, Label:  Married-civ-spouse
Frequency: 32.81%, Label:  Never-married
Frequency: 13.65%, Label:  Divorced
Other Labels: 7.55%

----------------------
Column: Native Country
----------------------
Type:      Categorical
Frequency: 89.59%, Label:  United-States
Frequency:  1.97%, Label:  Mexico
Other Labels: 8.44%

------------------
Column: Occupation
------------------
Type:      Categorical
Frequency: 12.71%, Label:  Prof-specialty
Frequency: 12.59%, Label:  Craft-repair
Frequency: 12.49%, Label:  Exec-managerial
Frequency: 11.58%, Label:  Adm-clerical
Frequency: 11.21%, Label:  Sales
Frequency: 10.12%, Label:  Other-service
Frequency:  6.15%, Label:  Machine-op-inspct
Frequency:  5.66%, Label:  ?
Frequency:  4.90%, Label:  Transport-moving
Frequency:  4.21%, Label:  Handlers-cleaners
Other Labels: 8.38%

------------
Column: Race
------------
Type:      Categorical
Frequency: 85.43%, Label:  White
Frequency:  9.59%, Label:  Black
Other Labels: 4.98%

--------------------
Column: Relationship
--------------------
Type:      Categorical
Frequency: 40.52%, Label:  Husband
Frequency: 25.51%, Label:  Not-in-family
Frequency: 15.56%, Label:  Own-child
Frequency: 10.58%, Label:  Unmarried
Other Labels: 7.83%

-----------
Column: Sex
-----------
Type:      Categorical
Frequency: 66.92%, Label:  Male
Frequency: 33.08%, Label:  Female

-----------------
Column: Workclass
-----------------
Type:      Categorical
Frequency: 69.70%, Label:  Private
Frequency:  7.80%, Label:  Self-emp-not-inc
Frequency:  6.43%, Label:  Local-gov
Frequency:  5.64%, Label:  ?
Frequency:  3.99%, Label:  State-gov
Other Labels: 6.44%

Quantifying Factor Importance

To quantify the importance of each factor for determining whether a US resident’s income exceeds $50k/year, we may simply run our variable selection analysis.

In [3]:
df.kxy.variable_selection_analysis('Income')
Out[3]:
Selection Order Univariate Achievable R^2 Maximum Marginal R^2 Increase Running Achievable R^2 Running Achievable R^2 (%) Univariate Achievable Accuracy Maximum Marginal Accuracy Increase Running Achievable Accuracy Univariate Mutual Information (nats) Conditional Mutual Information (nats) Univariate Achievable True Log-Likelihood Per Sample Maximum Marginal True Log-Likelihood Per Sample Increase Running Achievable True Log-Likelihood Per Sample Running Mutual Information (nats) Running Mutual Information (%)
Variable
Marital Status 1 0.168705 0.168705 0.168705 36.3009 0.828 0.069 0.828 0.092385 0.092385 -0.459626 0.092385 -0.459626 0.092385 29.5631
Education Num 2 0.107763 0.094349 0.263054 56.6023 0.804 0.035 0.863 0.057012 0.060235 -0.494999 0.060235 -0.399391 0.15262 48.8382
Capital Gain 3 0.086237 0.0562822 0.319336 68.7128 0.795 0.021 0.884 0.045092 0.039723 -0.506919 0.039723 -0.359668 0.192343 61.5496
Education 4 0.0220121 0.0445914 0.363927 78.3077 0.769 0.016 0.9 0.011129 0.033878 -0.540882 0.033878 -0.32579 0.226221 72.3905
Hours Per Week 5 0.0715289 0.0253782 0.389305 83.7685 0.789 0.009 0.909 0.037108 0.020358 -0.514903 0.020358 -0.305432 0.246579 78.905
Sex 6 0.0465745 0.0148851 0.40419 86.9713 0.779 0.005 0.914 0.023847 0.012338 -0.528164 0.012338 -0.293094 0.258917 82.8532
Capital Loss 7 0.020352 0.0122952 0.416486 89.617 0.768 0.004 0.918 0.010281 0.010426 -0.54173 0.010426 -0.282668 0.269343 86.1895
Relationship 8 0.0767599 0.0118906 0.428376 92.1755 0.791 0.005 0.923 0.039933 0.010294 -0.512078 0.010294 -0.272374 0.279637 89.4836
Workclass 9 0.0156089 0.00986247 0.438239 94.2976 0.766 0.003 0.926 0.007866 0.008702 -0.544145 0.008702 -0.263672 0.288339 92.2682
Occupation 10 0.0119322 0.00787079 0.446109 95.9912 0.764 0.003 0.929 0.006002 0.007055 -0.546009 0.007055 -0.256617 0.295394 94.5258
Age 11 0.0739583 0.00571965 0.451829 97.222 0.79 0.002 0.931 0.038418 0.00519 -0.513593 0.00519 -0.251427 0.300584 96.1866
Race 12 0.00907259 0.00521229 0.457041 98.3435 0.763 0.002 0.933 0.004557 0.004777 -0.547454 0.004777 -0.24665 0.305361 97.7152
Native Country 13 0.00179239 0.00428002 0.461321 99.2645 0.76 0.001 0.934 0.000897 0.003957 -0.551114 0.003957 -0.242693 0.309318 98.9814
Fnlwgt 14 0.000263965 0.00341834 0.46474 100 0.759 0.001 0.935 0.000132 0.003183 -0.551879 0.003183 -0.23951 0.312501 100

The first row (marial-status) corresponds to the most important factor. When used in isolation in a classification model, an accuracy of up to 82.8% can be achieved (see the Univariate Achievable Accuracy column). This is 6.9% above the accuracy that can be achieved by the naive strategy consisting of always predicting the most frequent outcome, namely <=50K, (see the Maximum Marginal Accuracy Increase column).

The second row (education-num) corresponds to the factor that complements the marial-status factor the most. It can bring about an additional 3.5% classification accuracy (see the Maximum Marginal Accuracy Increase column).

More generally, the \((i+1)\)-th row corresponds to the factor that complements the first \(i\) factors selected the most.

Interestingly, both the length of a resident’s education and her marital status play a more important role in determining whether she earns more than USD 50,000 per year than the amount of capital gain or capital loss she’s had.

Comparison With Model-Based Factor Importance

The kxy package is the only Python package capable of ex-ante and model-free factor importance analysis.

Other tools adopt a model-based approach consisting of quantifying the extent to which a trained model relies on each input, as opposed to the extent to which inputs are intrinsically useful for solving the problem at hand. Thus, their analyses may vary depending on the model used and as such they are subjective.

This dataset was studied using the tree-based SHAP method in this notebook.