Card Default (UCI, Classification, n=30000, d=23, 2 classes)

Loading The Data

In [1]:
from kxy_datasets.uci_classifications import CardDefault # pip install kxy_datasets
In [2]:
dataset = CardDefault()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

-----------
Column: AGE
-----------
Type:   Continuous
Max:    79
p75:    41
Mean:   35
Median: 34
p25:    28
Min:    21

-----------------
Column: BILL_AMT1
-----------------
Type:   Continuous
Max:    964,511
p75:    67,091
Mean:   51,223
Median: 22,381
p25:    3,558
Min:    -165580.0

-----------------
Column: BILL_AMT2
-----------------
Type:   Continuous
Max:    983,931
p75:    64,006
Mean:   49,179
Median: 21,200
p25:    2,984
Min:    -69777.0

-----------------
Column: BILL_AMT3
-----------------
Type:   Continuous
Max:    1,664,089
p75:    60,164
Mean:   47,013
Median: 20,088
p25:    2,666
Min:    -157264.0

-----------------
Column: BILL_AMT4
-----------------
Type:   Continuous
Max:    891,586
p75:    54,506
Mean:   43,262
Median: 19,052
p25:    2,326
Min:    -170000.0

-----------------
Column: BILL_AMT5
-----------------
Type:   Continuous
Max:    927,171
p75:    50,190
Mean:   40,311
Median: 18,104
p25:    1,763
Min:    -81334.0

-----------------
Column: BILL_AMT6
-----------------
Type:   Continuous
Max:    961,664
p75:    49,198
Mean:   38,871
Median: 17,071
p25:    1,256
Min:    -339603.0

-----------------
Column: EDUCATION
-----------------
Type:   Continuous
Max:    6.0
p75:    2.0
Mean:   1.9
Median: 2.0
p25:    1.0
Min:    0.0

-----------------
Column: LIMIT_BAL
-----------------
Type:   Continuous
Max:    1,000,000
p75:    240,000
Mean:   167,484
Median: 140,000
p25:    50,000
Min:    10,000

----------------
Column: MARRIAGE
----------------
Type:   Continuous
Max:    3.0
p75:    2.0
Mean:   1.6
Median: 2.0
p25:    1.0
Min:    0.0

-------------
Column: PAY_0
-------------
Type:   Continuous
Max:    8.0
p75:    0.0
Mean:   -0.0
Median: 0.0
p25:    -1.0
Min:    -2.0

-------------
Column: PAY_2
-------------
Type:   Continuous
Max:    8.0
p75:    0.0
Mean:   -0.1
Median: 0.0
p25:    -1.0
Min:    -2.0

-------------
Column: PAY_3
-------------
Type:   Continuous
Max:    8.0
p75:    0.0
Mean:   -0.2
Median: 0.0
p25:    -1.0
Min:    -2.0

-------------
Column: PAY_4
-------------
Type:   Continuous
Max:    8.0
p75:    0.0
Mean:   -0.2
Median: 0.0
p25:    -1.0
Min:    -2.0

-------------
Column: PAY_5
-------------
Type:   Continuous
Max:    8.0
p75:    0.0
Mean:   -0.3
Median: 0.0
p25:    -1.0
Min:    -2.0

-------------
Column: PAY_6
-------------
Type:   Continuous
Max:    8.0
p75:    0.0
Mean:   -0.3
Median: 0.0
p25:    -1.0
Min:    -2.0

----------------
Column: PAY_AMT1
----------------
Type:   Continuous
Max:    873,552
p75:    5,006
Mean:   5,663
Median: 2,100
p25:    1,000
Min:    0.0

----------------
Column: PAY_AMT2
----------------
Type:   Continuous
Max:    1,684,259
p75:    5,000
Mean:   5,921
Median: 2,009
p25:    833
Min:    0.0

----------------
Column: PAY_AMT3
----------------
Type:   Continuous
Max:    896,040
p75:    4,505
Mean:   5,225
Median: 1,800
p25:    390
Min:    0.0

----------------
Column: PAY_AMT4
----------------
Type:   Continuous
Max:    621,000
p75:    4,013
Mean:   4,826
Median: 1,500
p25:    296
Min:    0.0

----------------
Column: PAY_AMT5
----------------
Type:   Continuous
Max:    426,529
p75:    4,031
Mean:   4,799
Median: 1,500
p25:    252
Min:    0.0

----------------
Column: PAY_AMT6
----------------
Type:   Continuous
Max:    528,666
p75:    4,000
Mean:   5,215
Median: 1,500
p25:    117
Min:    0.0

-----------
Column: SEX
-----------
Type:   Continuous
Max:    2.0
p75:    2.0
Mean:   1.6
Median: 2.0
p25:    1.0
Min:    1.0

----------------------------------
Column: default payment next month
----------------------------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.2
Median: 0.0
p25:    0.0
Min:    0.0

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 0.75 1.65e-01 1.00

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
0 No Variable 0.00 0.78
1 PAY_0 0.13 0.83
2 LIMIT_BAL 0.16 0.84
3 AGE 0.28 0.88
4 PAY_5 0.39 0.92
5 EDUCATION 0.47 0.95
6 SEX 0.52 0.96
7 MARRIAGE 0.55 0.97
8 PAY_3 0.58 0.98
9 PAY_6 0.59 0.98
10 PAY_AMT3 0.59 0.98
11 PAY_4 0.59 0.98
12 BILL_AMT4 0.60 0.99
13 PAY_AMT2 0.60 0.99
14 PAY_2 0.60 0.99
15 PAY_AMT1 0.60 0.99
16 BILL_AMT3 0.60 0.99
17 BILL_AMT1 0.63 1.00
18 PAY_AMT4 0.66 1.00
19 BILL_AMT5 0.69 1.00
20 BILL_AMT2 0.69 1.00
21 PAY_AMT6 0.69 1.00
22 BILL_AMT6 0.69 1.00
23 PAY_AMT5 0.75 1.00