Heart Attack (Kaggle, Classification, n=303, d=13, 2 classes)

Loading The Data

In [1]:
from kxy_datasets.kaggle_classifications import HeartAttack # pip install kxy_datasets
In [2]:
dataset = HeartAttack()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
In [3]:
df.kxy.describe() # Visualize a summary of the data

-----------
Column: age
-----------
Type:   Continuous
Max:    77
p75:    61
Mean:   54
Median: 55
p25:    47
Min:    29

-----------
Column: caa
-----------
Type:   Continuous
Max:    4.0
p75:    1.0
Mean:   0.7
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: chol
------------
Type:   Continuous
Max:    564
p75:    274
Mean:   246
Median: 240
p25:    211
Min:    126

----------
Column: cp
----------
Type:   Continuous
Max:    3.0
p75:    2.0
Mean:   1.0
Median: 1.0
p25:    0.0
Min:    0.0

------------
Column: exng
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.3
Median: 0.0
p25:    0.0
Min:    0.0

-----------
Column: fbs
-----------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

---------------
Column: oldpeak
---------------
Type:   Continuous
Max:    6.2
p75:    1.6
Mean:   1.0
Median: 0.8
p25:    0.0
Min:    0.0

--------------
Column: output
--------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.5
Median: 1.0
p25:    0.0
Min:    0.0

---------------
Column: restecg
---------------
Type:   Continuous
Max:    2.0
p75:    1.0
Mean:   0.5
Median: 1.0
p25:    0.0
Min:    0.0

-----------
Column: sex
-----------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.7
Median: 1.0
p25:    0.0
Min:    0.0

-----------
Column: slp
-----------
Type:   Continuous
Max:    2.0
p75:    2.0
Mean:   1.4
Median: 1.0
p25:    1.0
Min:    0.0

----------------
Column: thalachh
----------------
Type:   Continuous
Max:    202
p75:    166
Mean:   149
Median: 153
p25:    133
Min:    71

-------------
Column: thall
-------------
Type:   Continuous
Max:    3.0
p75:    3.0
Mean:   2.3
Median: 2.0
p25:    2.0
Min:    0.0

--------------
Column: trtbps
--------------
Type:   Continuous
Max:    200
p75:    140
Mean:   131
Median: 130
p25:    120
Min:    94

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 0.64 -1.85e-01 0.95

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
0 No Variable 0.00 0.54
1 thall 0.25 0.76
2 caa 0.39 0.84
3 oldpeak 0.39 0.84
4 cp 0.48 0.88
5 slp 0.55 0.92
6 restecg 0.60 0.94
7 sex 0.64 0.95
8 exng 0.64 0.95
9 fbs 0.64 0.95
10 thalachh 0.64 0.95
11 trtbps 0.64 0.95
12 chol 0.64 0.95
13 age 0.64 0.95