Bank Marketing (UCI, Classification, n=41188, d=20, 2 classes)

Loading The Data

In [1]:
from kxy_datasets.uci_classifications import BankMarketing # pip install kxy_datasets
In [2]:
dataset = BankMarketing()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

-----------
Column: age
-----------
Type:   Continuous
Max:    98
p75:    47
Mean:   40
Median: 38
p25:    32
Min:    17

----------------
Column: campaign
----------------
Type:   Continuous
Max:    56
p75:    3.0
Mean:   2.6
Median: 2.0
p25:    1.0
Min:    1.0

---------------------
Column: cons.conf.idx
---------------------
Type:   Continuous
Max:    -26.9
p75:    -36.4
Mean:   -40.5
Median: -41.8
p25:    -42.7
Min:    -50.8

----------------------
Column: cons.price.idx
----------------------
Type:   Continuous
Max:    94
p75:    93
Mean:   93
Median: 93
p25:    93
Min:    92

---------------
Column: contact
---------------
Type:      Categorical
Frequency: 63.47%, Label: cellular
Frequency: 36.53%, Label: telephone

-------------------
Column: day_of_week
-------------------
Type:      Categorical
Frequency: 20.94%, Label: thu
Frequency: 20.67%, Label: mon
Frequency: 19.75%, Label: wed
Frequency: 19.64%, Label: tue
Frequency: 19.00%, Label: fri

---------------
Column: default
---------------
Type:      Categorical
Frequency: 79.12%, Label: no
Frequency: 20.87%, Label: unknown
Other Labels: 0.01%

----------------
Column: duration
----------------
Type:   Continuous
Max:    4,918
p75:    319
Mean:   258
Median: 180
p25:    102
Min:    0.0

-----------------
Column: education
-----------------
Type:      Categorical
Frequency: 29.54%, Label: university.degree
Frequency: 23.10%, Label: high.school
Frequency: 14.68%, Label: basic.9y
Frequency: 12.73%, Label: professional.course
Frequency: 10.14%, Label: basic.4y
Other Labels: 9.81%

--------------------
Column: emp.var.rate
--------------------
Type:   Continuous
Max:    1.4
p75:    1.4
Mean:   0.1
Median: 1.1
p25:    -1.8
Min:    -3.4

-----------------
Column: euribor3m
-----------------
Type:   Continuous
Max:    5.0
p75:    5.0
Mean:   3.6
Median: 4.9
p25:    1.3
Min:    0.6

---------------
Column: housing
---------------
Type:      Categorical
Frequency: 52.38%, Label: yes
Frequency: 45.21%, Label: no
Other Labels: 2.40%

-----------
Column: job
-----------
Type:      Categorical
Frequency: 25.30%, Label: admin.
Frequency: 22.47%, Label: blue-collar
Frequency: 16.37%, Label: technician
Frequency:  9.64%, Label: services
Frequency:  7.10%, Label: management
Frequency:  4.18%, Label: retired
Frequency:  3.54%, Label: entrepreneur
Frequency:  3.45%, Label: self-employed
Other Labels: 7.96%

------------
Column: loan
------------
Type:      Categorical
Frequency: 82.43%, Label: no
Frequency: 15.17%, Label: yes
Other Labels: 2.40%

---------------
Column: marital
---------------
Type:      Categorical
Frequency: 60.52%, Label: married
Frequency: 28.09%, Label: single
Frequency: 11.20%, Label: divorced
Other Labels: 0.19%

-------------
Column: month
-------------
Type:      Categorical
Frequency: 33.43%, Label: may
Frequency: 17.42%, Label: jul
Frequency: 15.00%, Label: aug
Frequency: 12.91%, Label: jun
Frequency:  9.96%, Label: nov
Frequency:  6.39%, Label: apr
Other Labels: 4.89%

-------------------
Column: nr.employed
-------------------
Type:   Continuous
Max:    5,228
p75:    5,228
Mean:   5,167
Median: 5,191
p25:    5,099
Min:    4,963

-------------
Column: pdays
-------------
Type:   Continuous
Max:    999
p75:    999
Mean:   962
Median: 999
p25:    999
Min:    0.0

----------------
Column: poutcome
----------------
Type:      Categorical
Frequency: 86.34%, Label: nonexistent
Frequency: 10.32%, Label: failure
Other Labels: 3.33%

----------------
Column: previous
----------------
Type:   Continuous
Max:    7.0
p75:    0.0
Mean:   0.2
Median: 0.0
p25:    0.0
Min:    0.0

---------
Column: y
---------
Type:      Categorical
Frequency: 88.73%, Label: no
Frequency: 11.27%, Label: yes

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 0.72 2.85e-01 1.00

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
0 No Variable 0.00 0.89
1 duration 0.16 0.93
2 euribor3m 0.48 1.00
3 age 0.48 1.00
4 job 0.48 1.00
5 housing 0.48 1.00
6 marital 0.48 1.00
7 education 0.48 1.00
8 default 0.48 1.00
9 loan 0.48 1.00
10 contact 0.48 1.00
11 month 0.48 1.00
12 day_of_week 0.48 1.00
13 campaign 0.48 1.00
14 pdays 0.48 1.00
15 previous 0.48 1.00
16 poutcome 0.48 1.00
17 emp.var.rate 0.49 1.00
18 cons.price.idx 0.50 1.00
19 cons.conf.idx 0.51 1.00
20 nr.employed 0.72 1.00