Card Default (UCI, Classification, n=30000, d=23, 2 classes)¶

Loading The Data¶

In [1]:

from kxy_datasets.uci_classifications import CardDefault # pip install kxy_datasets

In [2]:

dataset = CardDefault()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'

In [3]:

df.kxy.describe() # Visualize a summary of the data


-----------
Column: AGE
-----------
Type:   Continuous
Max:    79
p75:    41
Mean:   35
Median: 34
p25:    28
Min:    21

-----------------
Column: BILL_AMT1
-----------------
Type:   Continuous
Max:    964,511
p75:    67,091
Mean:   51,223
Median: 22,381
p25:    3,558
Min:    -165580.0

-----------------
Column: BILL_AMT2
-----------------
Type:   Continuous
Max:    983,931
p75:    64,006
Mean:   49,179
Median: 21,200
p25:    2,984
Min:    -69777.0

-----------------
Column: BILL_AMT3
-----------------
Type:   Continuous
Max:    1,664,089
p75:    60,164
Mean:   47,013
Median: 20,088
p25:    2,666
Min:    -157264.0

-----------------
Column: BILL_AMT4
-----------------
Type:   Continuous
Max:    891,586
p75:    54,506
Mean:   43,262
Median: 19,052
p25:    2,326
Min:    -170000.0

-----------------
Column: BILL_AMT5
-----------------
Type:   Continuous
Max:    927,171
p75:    50,190
Mean:   40,311
Median: 18,104
p25:    1,763
Min:    -81334.0

-----------------
Column: BILL_AMT6
-----------------
Type:   Continuous
Max:    961,664
p75:    49,198
Mean:   38,871
Median: 17,071
p25:    1,256
Min:    -339603.0

-----------------
Column: EDUCATION
-----------------
Type:   Continuous
Max:    6.0
p75:    2.0
Mean:   1.9
Median: 2.0
p25:    1.0
Min:    0.0

-----------------
Column: LIMIT_BAL
-----------------
Type:   Continuous
Max:    1,000,000
p75:    240,000
Mean:   167,484
Median: 140,000
p25:    50,000
Min:    10,000

----------------
Column: MARRIAGE
----------------
Type:   Continuous
Max:    3.0
p75:    2.0
Mean:   1.6
Median: 2.0
p25:    1.0
Min:    0.0

-------------
Column: PAY_0
-------------
Type:   Continuous
Max:    8.0
p75:    0.0
Mean:   -0.0
Median: 0.0
p25:    -1.0
Min:    -2.0

-------------
Column: PAY_2
-------------
Type:   Continuous
Max:    8.0
p75:    0.0
Mean:   -0.1
Median: 0.0
p25:    -1.0
Min:    -2.0

-------------
Column: PAY_3
-------------
Type:   Continuous
Max:    8.0
p75:    0.0
Mean:   -0.2
Median: 0.0
p25:    -1.0
Min:    -2.0

-------------
Column: PAY_4
-------------
Type:   Continuous
Max:    8.0
p75:    0.0
Mean:   -0.2
Median: 0.0
p25:    -1.0
Min:    -2.0

-------------
Column: PAY_5
-------------
Type:   Continuous
Max:    8.0
p75:    0.0
Mean:   -0.3
Median: 0.0
p25:    -1.0
Min:    -2.0

-------------
Column: PAY_6
-------------
Type:   Continuous
Max:    8.0
p75:    0.0
Mean:   -0.3
Median: 0.0
p25:    -1.0
Min:    -2.0

----------------
Column: PAY_AMT1
----------------
Type:   Continuous
Max:    873,552
p75:    5,006
Mean:   5,663
Median: 2,100
p25:    1,000
Min:    0.0

----------------
Column: PAY_AMT2
----------------
Type:   Continuous
Max:    1,684,259
p75:    5,000
Mean:   5,921
Median: 2,009
p25:    833
Min:    0.0

----------------
Column: PAY_AMT3
----------------
Type:   Continuous
Max:    896,040
p75:    4,505
Mean:   5,225
Median: 1,800
p25:    390
Min:    0.0

----------------
Column: PAY_AMT4
----------------
Type:   Continuous
Max:    621,000
p75:    4,013
Mean:   4,826
Median: 1,500
p25:    296
Min:    0.0

----------------
Column: PAY_AMT5
----------------
Type:   Continuous
Max:    426,529
p75:    4,031
Mean:   4,799
Median: 1,500
p25:    252
Min:    0.0

----------------
Column: PAY_AMT6
----------------
Type:   Continuous
Max:    528,666
p75:    4,000
Mean:   5,215
Median: 1,500
p25:    117
Min:    0.0

-----------
Column: SEX
-----------
Type:   Continuous
Max:    2.0
p75:    2.0
Mean:   1.6
Median: 2.0
p25:    1.0
Min:    1.0

----------------------------------
Column: default payment next month
----------------------------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.2
Median: 0.0
p25:    0.0
Min:    0.0

Data Valuation¶

In [4]:

df.kxy.data_valuation(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[4]:

	Achievable R-Squared	Achievable Log-Likelihood Per Sample	Achievable Accuracy
0	0.75	1.65e-01	1.00

Automatic (Model-Free) Variable Selection¶

In [5]:

df.kxy.variable_selection(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[5]:

	Variable	Running Achievable R-Squared	Running Achievable Accuracy
Selection Order
0	No Variable	0.00	0.78
1	PAY_0	0.13	0.83
2	LIMIT_BAL	0.16	0.84
3	AGE	0.28	0.88
4	PAY_5	0.39	0.92
5	EDUCATION	0.47	0.95
6	SEX	0.52	0.96
7	MARRIAGE	0.55	0.97
8	PAY_3	0.58	0.98
9	PAY_6	0.59	0.98
10	PAY_AMT3	0.59	0.98
11	PAY_4	0.59	0.98
12	BILL_AMT4	0.60	0.99
13	PAY_AMT2	0.60	0.99
14	PAY_2	0.60	0.99
15	PAY_AMT1	0.60	0.99
16	BILL_AMT3	0.60	0.99
17	BILL_AMT1	0.63	1.00
18	PAY_AMT4	0.66	1.00
19	BILL_AMT5	0.69	1.00
20	BILL_AMT2	0.69	1.00
21	PAY_AMT6	0.69	1.00
22	BILL_AMT6	0.69	1.00
23	PAY_AMT5	0.75	1.00