White Wine Quality (UCI, Regression, n=4898, d=11)

Loading The Data

In [1]:
from kxy_datasets.uci_regressions import WhiteWineQuality # pip install kxy_datasets
In [2]:
dataset = WhiteWineQuality()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

---------------
Column: alcohol
---------------
Type:   Continuous
Max:    14
p75:    11
Mean:   10
Median: 10
p25:    9.5
Min:    8.0

-----------------
Column: chlorides
-----------------
Type:   Continuous
Max:    0.3
p75:    0.1
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

-------------------
Column: citric acid
-------------------
Type:   Continuous
Max:    1.7
p75:    0.4
Mean:   0.3
Median: 0.3
p25:    0.3
Min:    0.0

---------------
Column: density
---------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   1.0
Median: 1.0
p25:    1.0
Min:    1.0

---------------------
Column: fixed acidity
---------------------
Type:   Continuous
Max:    14
p75:    7.3
Mean:   6.9
Median: 6.8
p25:    6.3
Min:    3.8

---------------------------
Column: free sulfur dioxide
---------------------------
Type:   Continuous
Max:    289
p75:    46
Mean:   35
Median: 34
p25:    23
Min:    2.0

----------
Column: pH
----------
Type:   Continuous
Max:    3.8
p75:    3.3
Mean:   3.2
Median: 3.2
p25:    3.1
Min:    2.7

---------------
Column: quality
---------------
Type:   Continuous
Max:    9.0
p75:    6.0
Mean:   5.9
Median: 6.0
p25:    5.0
Min:    3.0

----------------------
Column: residual sugar
----------------------
Type:   Continuous
Max:    65
p75:    9.9
Mean:   6.4
Median: 5.2
p25:    1.7
Min:    0.6

-----------------
Column: sulphates
-----------------
Type:   Continuous
Max:    1.1
p75:    0.6
Mean:   0.5
Median: 0.5
p25:    0.4
Min:    0.2

----------------------------
Column: total sulfur dioxide
----------------------------
Type:   Continuous
Max:    440
p75:    167
Mean:   138
Median: 134
p25:    108
Min:    9.0

------------------------
Column: volatile acidity
------------------------
Type:   Continuous
Max:    1.1
p75:    0.3
Mean:   0.3
Median: 0.3
p25:    0.2
Min:    0.1

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable RMSE
0 0.33 -1.41 7.27e-01

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
0 No Variable 0.00 0.89
1 alcohol 0.26 0.76
2 volatile acidity 0.29 0.75
3 free sulfur dioxide 0.29 0.75
4 pH 0.32 0.73
5 residual sugar 0.32 0.73
6 citric acid 0.32 0.73
7 density 0.33 0.73
8 fixed acidity 0.33 0.73
9 sulphates 0.33 0.73
10 chlorides 0.33 0.73
11 total sulfur dioxide 0.33 0.73