Water Quality (Kaggle, Classification, n=3276, d=9, 2 classes)

Loading The Data

In [1]:
from kxy_datasets.kaggle_classifications import WaterQuality # pip install kxy_datasets
In [2]:
dataset = WaterQuality()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
In [3]:
df.kxy.describe() # Visualize a summary of the data

-------------------
Column: Chloramines
-------------------
Type:   Continuous
Max:    13
p75:    8.1
Mean:   7.1
Median: 7.1
p25:    6.1
Min:    0.4

--------------------
Column: Conductivity
--------------------
Type:   Continuous
Max:    753
p75:    481
Mean:   426
Median: 421
p25:    365
Min:    181

----------------
Column: Hardness
----------------
Type:   Continuous
Max:    323
p75:    216
Mean:   196
Median: 196
p25:    176
Min:    47

----------------------
Column: Organic_carbon
----------------------
Type:   Continuous
Max:    28
p75:    16
Mean:   14
Median: 14
p25:    12
Min:    2.2

------------------
Column: Potability
------------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.4
Median: 0.0
p25:    0.0
Min:    0.0

--------------
Column: Solids
--------------
Type:   Continuous
Max:    61,227
p75:    27,332
Mean:   22,014
Median: 20,927
p25:    15,666
Min:    320

---------------
Column: Sulfate
---------------
Type:   Continuous
Max:    481
p75:    359
Mean:   333
Median: 333
p25:    307
Min:    129

-----------------------
Column: Trihalomethanes
-----------------------
Type:   Continuous
Max:    124
p75:    77
Mean:   66
Median: 66
p25:    55
Min:    0.7

-----------------
Column: Turbidity
-----------------
Type:   Continuous
Max:    6.7
p75:    4.5
Mean:   4.0
Median: 4.0
p25:    3.4
Min:    1.4

----------
Column: ph
----------
Type:   Continuous
Max:    13
p75:    8.1
Mean:   7.1
Median: 7.0
p25:    6.1
Min:    0.0

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 0.04 -6.50e-01 0.65

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
0 No Variable 0.00 0.61
1 ph 0.01 0.62
2 Sulfate 0.04 0.65
3 Chloramines 0.04 0.65
4 Hardness 0.04 0.65
5 Solids 0.04 0.65
6 Organic_carbon 0.04 0.65
7 Conductivity 0.04 0.65
8 Turbidity 0.04 0.65
9 Trihalomethanes 0.04 0.65