Bank Marketing (UCI, Classification, n=41188, d=20, 2 classes)¶

Loading The Data¶

In [1]:

from kxy_datasets.uci_classifications import BankMarketing # pip install kxy_datasets

In [2]:

dataset = BankMarketing()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'

In [3]:

df.kxy.describe() # Visualize a summary of the data


-----------
Column: age
-----------
Type:   Continuous
Max:    98
p75:    47
Mean:   40
Median: 38
p25:    32
Min:    17

----------------
Column: campaign
----------------
Type:   Continuous
Max:    56
p75:    3.0
Mean:   2.6
Median: 2.0
p25:    1.0
Min:    1.0

---------------------
Column: cons.conf.idx
---------------------
Type:   Continuous
Max:    -26.9
p75:    -36.4
Mean:   -40.5
Median: -41.8
p25:    -42.7
Min:    -50.8

----------------------
Column: cons.price.idx
----------------------
Type:   Continuous
Max:    94
p75:    93
Mean:   93
Median: 93
p25:    93
Min:    92

---------------
Column: contact
---------------
Type:      Categorical
Frequency: 63.47%, Label: cellular
Frequency: 36.53%, Label: telephone

-------------------
Column: day_of_week
-------------------
Type:      Categorical
Frequency: 20.94%, Label: thu
Frequency: 20.67%, Label: mon
Frequency: 19.75%, Label: wed
Frequency: 19.64%, Label: tue
Frequency: 19.00%, Label: fri

---------------
Column: default
---------------
Type:      Categorical
Frequency: 79.12%, Label: no
Frequency: 20.87%, Label: unknown
Other Labels: 0.01%

----------------
Column: duration
----------------
Type:   Continuous
Max:    4,918
p75:    319
Mean:   258
Median: 180
p25:    102
Min:    0.0

-----------------
Column: education
-----------------
Type:      Categorical
Frequency: 29.54%, Label: university.degree
Frequency: 23.10%, Label: high.school
Frequency: 14.68%, Label: basic.9y
Frequency: 12.73%, Label: professional.course
Frequency: 10.14%, Label: basic.4y
Other Labels: 9.81%

--------------------
Column: emp.var.rate
--------------------
Type:   Continuous
Max:    1.4
p75:    1.4
Mean:   0.1
Median: 1.1
p25:    -1.8
Min:    -3.4

-----------------
Column: euribor3m
-----------------
Type:   Continuous
Max:    5.0
p75:    5.0
Mean:   3.6
Median: 4.9
p25:    1.3
Min:    0.6

---------------
Column: housing
---------------
Type:      Categorical
Frequency: 52.38%, Label: yes
Frequency: 45.21%, Label: no
Other Labels: 2.40%

-----------
Column: job
-----------
Type:      Categorical
Frequency: 25.30%, Label: admin.
Frequency: 22.47%, Label: blue-collar
Frequency: 16.37%, Label: technician
Frequency:  9.64%, Label: services
Frequency:  7.10%, Label: management
Frequency:  4.18%, Label: retired
Frequency:  3.54%, Label: entrepreneur
Frequency:  3.45%, Label: self-employed
Other Labels: 7.96%

------------
Column: loan
------------
Type:      Categorical
Frequency: 82.43%, Label: no
Frequency: 15.17%, Label: yes
Other Labels: 2.40%

---------------
Column: marital
---------------
Type:      Categorical
Frequency: 60.52%, Label: married
Frequency: 28.09%, Label: single
Frequency: 11.20%, Label: divorced
Other Labels: 0.19%

-------------
Column: month
-------------
Type:      Categorical
Frequency: 33.43%, Label: may
Frequency: 17.42%, Label: jul
Frequency: 15.00%, Label: aug
Frequency: 12.91%, Label: jun
Frequency:  9.96%, Label: nov
Frequency:  6.39%, Label: apr
Other Labels: 4.89%

-------------------
Column: nr.employed
-------------------
Type:   Continuous
Max:    5,228
p75:    5,228
Mean:   5,167
Median: 5,191
p25:    5,099
Min:    4,963

-------------
Column: pdays
-------------
Type:   Continuous
Max:    999
p75:    999
Mean:   962
Median: 999
p25:    999
Min:    0.0

----------------
Column: poutcome
----------------
Type:      Categorical
Frequency: 86.34%, Label: nonexistent
Frequency: 10.32%, Label: failure
Other Labels: 3.33%

----------------
Column: previous
----------------
Type:   Continuous
Max:    7.0
p75:    0.0
Mean:   0.2
Median: 0.0
p25:    0.0
Min:    0.0

---------
Column: y
---------
Type:      Categorical
Frequency: 88.73%, Label: no
Frequency: 11.27%, Label: yes

Data Valuation¶

In [4]:

df.kxy.data_valuation(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[4]:

	Achievable R-Squared	Achievable Log-Likelihood Per Sample	Achievable Accuracy
0	0.72	2.85e-01	1.00

Automatic (Model-Free) Variable Selection¶

In [5]:

df.kxy.variable_selection(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[5]:

	Variable	Running Achievable R-Squared	Running Achievable Accuracy
Selection Order
0	No Variable	0.00	0.89
1	duration	0.16	0.93
2	euribor3m	0.48	1.00
3	age	0.48	1.00
4	job	0.48	1.00
5	housing	0.48	1.00
6	marital	0.48	1.00
7	education	0.48	1.00
8	default	0.48	1.00
9	loan	0.48	1.00
10	contact	0.48	1.00
11	month	0.48	1.00
12	day_of_week	0.48	1.00
13	campaign	0.48	1.00
14	pdays	0.48	1.00
15	previous	0.48	1.00
16	poutcome	0.48	1.00
17	emp.var.rate	0.49	1.00
18	cons.price.idx	0.50	1.00
19	cons.conf.idx	0.51	1.00
20	nr.employed	0.72	1.00