Adult (UCI, Classification, n=48843, d=14, 3 classes)

Loading The Data

In [1]:
from kxy_datasets.uci_classifications import Adult # pip install kxy_datasets
In [2]:
dataset = Adult()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

-----------
Column: Age
-----------
Type:      Categorical
Frequency:  2.76%, Label: 36
Frequency:  2.74%, Label: 35
Frequency:  2.73%, Label: 33
Frequency:  2.72%, Label: 23
Frequency:  2.71%, Label: 31
Frequency:  2.67%, Label: 34
Frequency:  2.62%, Label: 28
Frequency:  2.62%, Label: 37
Frequency:  2.62%, Label: 30
Frequency:  2.59%, Label: 38
Frequency:  2.57%, Label: 32
Frequency:  2.53%, Label: 41
Frequency:  2.52%, Label: 27
Frequency:  2.50%, Label: 29
Frequency:  2.47%, Label: 24
Frequency:  2.47%, Label: 39
Frequency:  2.45%, Label: 25
Frequency:  2.43%, Label: 40
Frequency:  2.41%, Label: 22
Frequency:  2.39%, Label: 42
Frequency:  2.36%, Label: 26
Frequency:  2.28%, Label: 20
Frequency:  2.26%, Label: 43
Frequency:  2.25%, Label: 46
Frequency:  2.24%, Label: 21
Frequency:  2.24%, Label: 45
Frequency:  2.21%, Label: 47
Frequency:  2.18%, Label: 44
Frequency:  2.16%, Label: 19
Frequency:  1.80%, Label: 51
Frequency:  1.77%, Label: 50
Frequency:  1.76%, Label: 18
Frequency:  1.73%, Label: 49
Frequency:  1.73%, Label: 48
Frequency:  1.51%, Label: 52
Frequency:  1.46%, Label: 53
Frequency:  1.27%, Label: 55
Frequency:  1.26%, Label: 54
Frequency:  1.22%, Label: 17
Frequency:  1.15%, Label: 56
Frequency:  1.14%, Label: 58
Frequency:  1.13%, Label: 57
Other Labels: 9.37%

--------------------
Column: Capital Gain
--------------------
Type:   Continuous
Max:    99,999
p75:    0.0
Mean:   1,079
Median: 0.0
p25:    0.0
Min:    0.0

--------------------
Column: Capital Loss
--------------------
Type:   Continuous
Max:    4,356
p75:    0.0
Mean:   87
Median: 0.0
p25:    0.0
Min:    0.0

-----------------
Column: Education
-----------------
Type:      Categorical
Frequency: 32.32%, Label:  HS-grad
Frequency: 22.27%, Label:  Some-college
Frequency: 16.43%, Label:  Bachelors
Frequency:  5.44%, Label:  Masters
Frequency:  4.22%, Label:  Assoc-voc
Frequency:  3.71%, Label:  11th
Frequency:  3.28%, Label:  Assoc-acdm
Frequency:  2.84%, Label:  10th
Other Labels: 9.49%

---------------------
Column: Education Num
---------------------
Type:   Continuous
Max:    16
p75:    12
Mean:   10
Median: 10
p25:    9.0
Min:    1.0

--------------
Column: Fnlwgt
--------------
Type:   Continuous
Max:    1,490,400
p75:    237,642
Mean:   189,664
Median: 178,144
p25:    117,550
Min:    12,285

----------------------
Column: Hours Per Week
----------------------
Type:   Continuous
Max:    99
p75:    45
Mean:   40
Median: 40
p25:    40
Min:    1.0

--------------
Column: Income
--------------
Type:      Categorical
Frequency: 76.07%, Label:  <=50K
Frequency: 23.93%, Label:  >50K
Other Labels: 0.00%

----------------------
Column: Marital Status
----------------------
Type:      Categorical
Frequency: 45.82%, Label:  Married-civ-spouse
Frequency: 33.00%, Label:  Never-married
Frequency: 13.58%, Label:  Divorced
Other Labels: 7.60%

----------------------
Column: Native Country
----------------------
Type:      Categorical
Frequency: 89.74%, Label:  United-States
Frequency:  1.95%, Label:  Mexico
Other Labels: 8.31%

------------------
Column: Occupation
------------------
Type:      Categorical
Frequency: 12.64%, Label:  Prof-specialty
Frequency: 12.51%, Label:  Craft-repair
Frequency: 12.46%, Label:  Exec-managerial
Frequency: 11.49%, Label:  Adm-clerical
Frequency: 11.27%, Label:  Sales
Frequency: 10.08%, Label:  Other-service
Frequency:  6.19%, Label:  Machine-op-inspct
Frequency:  5.75%, Label:  ?
Frequency:  4.82%, Label:  Transport-moving
Frequency:  4.24%, Label:  Handlers-cleaners
Other Labels: 8.55%

------------
Column: Race
------------
Type:      Categorical
Frequency: 85.50%, Label:  White
Frequency:  9.59%, Label:  Black
Other Labels: 4.91%

--------------------
Column: Relationship
--------------------
Type:      Categorical
Frequency: 40.37%, Label:  Husband
Frequency: 25.76%, Label:  Not-in-family
Frequency: 15.52%, Label:  Own-child
Frequency: 10.49%, Label:  Unmarried
Other Labels: 7.86%

-----------
Column: Sex
-----------
Type:      Categorical
Frequency: 66.85%, Label:  Male
Frequency: 33.15%, Label:  Female
Other Labels: 0.00%

-----------------
Column: Workclass
-----------------
Type:      Categorical
Frequency: 69.42%, Label:  Private
Frequency:  7.91%, Label:  Self-emp-not-inc
Frequency:  6.42%, Label:  Local-gov
Frequency:  5.73%, Label:  ?
Frequency:  4.06%, Label:  State-gov
Other Labels: 6.47%

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 0.67 0.00 1.00

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
0 No Variable 0.00 0.76
1 Relationship 0.21 0.89
2 Capital Gain 0.31 0.91
3 Education 0.37 0.92
4 Age 0.40 0.93
5 Hours Per Week 0.49 0.96
6 Occupation 0.57 0.98
7 Workclass 0.60 0.98
8 Race 0.61 0.99
9 Capital Loss 0.62 0.99
10 Native Country 0.62 0.99
11 Marital Status 0.62 0.99
12 Sex 0.62 0.99
13 Fnlwgt 0.63 0.99
14 Education Num 0.67 1.00