# Understanding Factors Influencing US Household Income¶

We consider quantifying the importance of various factors in determining whether a US resident’s income exceeds $50k/year, based on the 1994 census data, as per the UCI machine learning repository: https://archive.ics.uci.edu/ml/datasets/census+income. In [1]:  %load_ext autoreload %autoreload 2 import pandas as pd import numpy as np import kxy df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', \ names=['Age', 'Workclass', 'Fnlwgt', 'Education',\ 'Education Num', 'Marital Status',\ 'Occupation', 'Relationship', 'Race', 'Sex', \ 'Capital Gain', 'Capital Loss', 'Hours Per Week', \ 'Native Country', 'Income'])  ## The Data¶ In [2]:  df.kxy.describe()   ----------- Column: Age ----------- Type: Continuous Max: 90.0000 Mean: 38.5816 Median: 37.0000 Min: 17.0000 -------------------- Column: Capital Gain -------------------- Type: Continuous Max: 99999.0000 Mean: 1077.6488 Median: 0.0000 Min: 0.0000 -------------------- Column: Capital Loss -------------------- Type: Continuous Max: 4356.0000 Mean: 87.3038 Median: 0.0000 Min: 0.0000 ----------------- Column: Education ----------------- Type: Categorical Frequency: 32.25%, Label: HS-grad Frequency: 22.39%, Label: Some-college Frequency: 16.45%, Label: Bachelors Frequency: 5.29%, Label: Masters Frequency: 4.24%, Label: Assoc-voc Frequency: 3.61%, Label: 11th Frequency: 3.28%, Label: Assoc-acdm Frequency: 2.87%, Label: 10th Other Labels: 9.63% --------------------- Column: Education Num --------------------- Type: Continuous Max: 16.0000 Mean: 10.0807 Median: 10.0000 Min: 1.0000 -------------- Column: Fnlwgt -------------- Type: Continuous Max: 1484705.0000 Mean: 189778.3665 Median: 178356.0000 Min: 12285.0000 ---------------------- Column: Hours Per Week ---------------------- Type: Continuous Max: 99.0000 Mean: 40.4375 Median: 40.0000 Min: 1.0000 -------------- Column: Income -------------- Type: Categorical Frequency: 75.92%, Label: <=50K Frequency: 24.08%, Label: >50K ---------------------- Column: Marital Status ---------------------- Type: Categorical Frequency: 45.99%, Label: Married-civ-spouse Frequency: 32.81%, Label: Never-married Frequency: 13.65%, Label: Divorced Other Labels: 7.55% ---------------------- Column: Native Country ---------------------- Type: Categorical Frequency: 89.59%, Label: United-States Frequency: 1.97%, Label: Mexico Other Labels: 8.44% ------------------ Column: Occupation ------------------ Type: Categorical Frequency: 12.71%, Label: Prof-specialty Frequency: 12.59%, Label: Craft-repair Frequency: 12.49%, Label: Exec-managerial Frequency: 11.58%, Label: Adm-clerical Frequency: 11.21%, Label: Sales Frequency: 10.12%, Label: Other-service Frequency: 6.15%, Label: Machine-op-inspct Frequency: 5.66%, Label: ? Frequency: 4.90%, Label: Transport-moving Frequency: 4.21%, Label: Handlers-cleaners Other Labels: 8.38% ------------ Column: Race ------------ Type: Categorical Frequency: 85.43%, Label: White Frequency: 9.59%, Label: Black Other Labels: 4.98% -------------------- Column: Relationship -------------------- Type: Categorical Frequency: 40.52%, Label: Husband Frequency: 25.51%, Label: Not-in-family Frequency: 15.56%, Label: Own-child Frequency: 10.58%, Label: Unmarried Other Labels: 7.83% ----------- Column: Sex ----------- Type: Categorical Frequency: 66.92%, Label: Male Frequency: 33.08%, Label: Female ----------------- Column: Workclass ----------------- Type: Categorical Frequency: 69.70%, Label: Private Frequency: 7.80%, Label: Self-emp-not-inc Frequency: 6.43%, Label: Local-gov Frequency: 5.64%, Label: ? Frequency: 3.99%, Label: State-gov Other Labels: 6.44%  ## Quantifying Factor Importance¶ To quantify the importance of each factor for determining whether a US resident’s income exceeds$50k/year, we may simply run our variable selection analysis.

In [3]:

df.kxy.variable_selection_analysis('Income')

Out[3]:

Selection Order Univariate Achievable R^2 Maximum Marginal R^2 Increase Running Achievable R^2 Running Achievable R^2 (%) Univariate Achievable Accuracy Maximum Marginal Accuracy Increase Running Achievable Accuracy Univariate Mutual Information (nats) Conditional Mutual Information (nats) Univariate Achievable True Log-Likelihood Per Sample Maximum Marginal True Log-Likelihood Per Sample Increase Running Achievable True Log-Likelihood Per Sample Running Mutual Information (nats) Running Mutual Information (%)
Variable
Marital Status 1 0.166102 0.166102 0.166102 37.8598 0.827 0.068 0.827 0.090822 0.090822 -0.461189 0.090822 -0.461189 0.090822 31.4507
Education Num 2 0.107763 0.0946444 0.260746 59.4321 0.804 0.035 0.862 0.057012 0.060235 -0.494999 0.060235 -0.400954 0.151057 52.3094
Capital Gain 3 0.086237 0.0564584 0.317205 72.3007 0.795 0.021 0.883 0.045092 0.039723 -0.506919 0.039723 -0.361231 0.19078 66.065
Education 4 0.0220121 0.044731 0.361936 82.4963 0.769 0.016 0.899 0.011129 0.033878 -0.540882 0.033878 -0.327353 0.224658 77.7966
Hours Per Week 5 0.0715289 0.0254576 0.387393 88.2989 0.789 0.009 0.908 0.037108 0.020358 -0.514903 0.020358 -0.306995 0.245016 84.8464
Sex 6 0.0461683 0.00946442 0.396858 90.4561 0.779 0.003 0.911 0.023634 0.007785 -0.528377 0.007785 -0.29921 0.252801 87.5422
Occupation 7 0.0119322 0.0043414 0.401199 91.4457 0.764 0.002 0.913 0.006002 0.003612 -0.546009 0.003612 -0.295598 0.256413 88.793
Age 8 0.0739583 0.000276582 0.401476 91.5087 0.79 0 0.913 0.038418 0.000231 -0.513593 0.000231 -0.295367 0.256644 88.873
Capital Loss 9 0.020352 0.0123512 0.413827 94.3239 0.768 0.004 0.917 0.010281 0.010426 -0.54173 0.010426 -0.284941 0.26707 92.4834
Workclass 10 0.0156089 0.0101135 0.42394 96.6291 0.766 0.004 0.921 0.007866 0.008702 -0.544145 0.008702 -0.276239 0.275772 95.4969
Relationship 11 0.0767599 0.00708538 0.431026 98.2441 0.791 0.003 0.924 0.039933 0.006188 -0.512078 0.006188 -0.270051 0.28196 97.6397
Race 12 0.00907259 0.0054101 0.436436 99.4772 0.763 0.001 0.925 0.004557 0.004777 -0.547454 0.004777 -0.265274 0.286737 99.2939
Native Country 13 0.00179239 0.0013678 0.437804 99.789 0.76 0.001 0.926 0.000897 0.001215 -0.551114 0.001215 -0.264059 0.287952 99.7147
Fnlwgt 14 0.000263965 0.000925737 0.438729 100 0.759 0 0.926 0.000132 0.000824 -0.551879 0.000824 -0.263235 0.288776 100

The first row (marial-status) corresponds to the most important factor. When used in isolation in a classification model, an accuracy of up to 82.7% can be achieved (see the Univariate Achievable Accuracy column). This is 6.8% above the accuracy that can be achieved by the naive strategy consisting of always predicting the most frequent outcome, namely <=50K, (see the Maximum Marginal Accuracy Increase column).

The second row (education-num) corresponds to the factor that complements the marial-status factor the most. It can bring about an additional 3.5% classification accuracy (see the Maximum Marginal Accuracy Increase column).

More generally, the $$(i+1)$$-th row corresponds to the factor that complements the first $$i$$ factors selected the most.

Interestingly, both the length of a resident’s education and her marital status play a more important role in determining whether she earns more than USD 50,000 per year than the amount of capital gain or capital loss she’s had.

## Comparison With Model-Based Factor Importance¶

The kxy package is the only Python package capable of ex-ante and model-free factor importance analysis.

Other tools adopt a model-based approach consisting of quantifying the extent to which a trained model relies on each input, as opposed to the extent to which inputs are intrinsically useful for solving the problem at hand. Thus, their analyses may vary depending on the model used and as such they are subjective.

This dataset was previously studied using the tree-based SHAP method in this notebook.

Shapley value based studies are inherently model-based as they explain the predictions made by a specific model. Additionally, they are intrinsically unrelated to accuracy or any other performance metric. Indeed, the approach merely explains model predictions (of any model, although it scales better on tree-based models), with no regard on how well said model performs.

Our approach on the other hand quantifies the importance of each variable for effectively solving the problem at hand, in a model-free fashion.