White Wine Quality (UCI, Regression, n=4898, d=11)¶

Loading The Data¶

In [1]:

from kxy_datasets.uci_regressions import WhiteWineQuality # pip install kxy_datasets

In [2]:

dataset = WhiteWineQuality()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'

In [3]:

df.kxy.describe() # Visualize a summary of the data


---------------
Column: alcohol
---------------
Type:   Continuous
Max:    14
p75:    11
Mean:   10
Median: 10
p25:    9.5
Min:    8.0

-----------------
Column: chlorides
-----------------
Type:   Continuous
Max:    0.3
p75:    0.1
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

-------------------
Column: citric acid
-------------------
Type:   Continuous
Max:    1.7
p75:    0.4
Mean:   0.3
Median: 0.3
p25:    0.3
Min:    0.0

---------------
Column: density
---------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   1.0
Median: 1.0
p25:    1.0
Min:    1.0

---------------------
Column: fixed acidity
---------------------
Type:   Continuous
Max:    14
p75:    7.3
Mean:   6.9
Median: 6.8
p25:    6.3
Min:    3.8

---------------------------
Column: free sulfur dioxide
---------------------------
Type:   Continuous
Max:    289
p75:    46
Mean:   35
Median: 34
p25:    23
Min:    2.0

----------
Column: pH
----------
Type:   Continuous
Max:    3.8
p75:    3.3
Mean:   3.2
Median: 3.2
p25:    3.1
Min:    2.7

---------------
Column: quality
---------------
Type:   Continuous
Max:    9.0
p75:    6.0
Mean:   5.9
Median: 6.0
p25:    5.0
Min:    3.0

----------------------
Column: residual sugar
----------------------
Type:   Continuous
Max:    65
p75:    9.9
Mean:   6.4
Median: 5.2
p25:    1.7
Min:    0.6

-----------------
Column: sulphates
-----------------
Type:   Continuous
Max:    1.1
p75:    0.6
Mean:   0.5
Median: 0.5
p25:    0.4
Min:    0.2

----------------------------
Column: total sulfur dioxide
----------------------------
Type:   Continuous
Max:    440
p75:    167
Mean:   138
Median: 134
p25:    108
Min:    9.0

------------------------
Column: volatile acidity
------------------------
Type:   Continuous
Max:    1.1
p75:    0.3
Mean:   0.3
Median: 0.3
p25:    0.2
Min:    0.1

Data Valuation¶

In [4]:

df.kxy.data_valuation(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[4]:

	Achievable R-Squared	Achievable Log-Likelihood Per Sample	Achievable RMSE
0	0.33	-1.41	7.27e-01

Automatic (Model-Free) Variable Selection¶

In [5]:

df.kxy.variable_selection(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[5]:

	Variable	Running Achievable R-Squared	Running Achievable RMSE
Selection Order
0	No Variable	0.00	0.89
1	alcohol	0.26	0.76
2	volatile acidity	0.29	0.75
3	free sulfur dioxide	0.29	0.75
4	pH	0.32	0.73
5	residual sugar	0.32	0.73
6	citric acid	0.32	0.73
7	density	0.33	0.73
8	fixed acidity	0.33	0.73
9	sulphates	0.33	0.73
10	chlorides	0.33	0.73
11	total sulfur dioxide	0.33	0.73