Air Quality (UCI, Regression, n=8991, d=14)

Loading The Data

In [1]:
from kxy_datasets.uci_regressions import AirQuality # pip install kxy_datasets
In [2]:
dataset = AirQuality()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

----------
Column: AH
----------
Type:   Continuous
Max:    2.2
p75:    1.3
Mean:   1.0
Median: 1.0
p25:    0.7
Min:    0.2

----------------
Column: C6H6(GT)
----------------
Type:   Continuous
Max:    63
p75:    14
Mean:   10
Median: 8.2
p25:    4.4
Min:    0.1

--------------
Column: CO(GT)
--------------
Type:   Continuous
Max:    11
p75:    2.8
Mean:   2.1
Median: 1.8
p25:    1.1
Min:    0.1

------------
Column: Date
------------
Type:   Continuous
Max:    699
p75:    357
Mean:   258
Median: 253
p25:    141
Min:    0.0

----------------
Column: NMHC(GT)
----------------
Type:   Continuous
Max:    1,189
p75:    -200.0
Mean:   -158.7
Median: -200.0
p25:    -200.0
Min:    -200.0

---------------
Column: NO2(GT)
---------------
Type:   Continuous
Max:    333
p75:    132
Mean:   56
Median: 96
p25:    52
Min:    -200.0

---------------
Column: NOx(GT)
---------------
Type:   Continuous
Max:    1,479
p75:    280
Mean:   163
Median: 140
p25:    49
Min:    -200.0

-------------------
Column: PT08.S1(CO)
-------------------
Type:   Continuous
Max:    2,040
p75:    1,231
Mean:   1,099
Median: 1,063
p25:    937
Min:    647

---------------------
Column: PT08.S2(NMHC)
---------------------
Type:   Continuous
Max:    2,214
p75:    1,116
Mean:   939
Median: 909
p25:    734
Min:    383

--------------------
Column: PT08.S3(NOx)
--------------------
Type:   Continuous
Max:    2,683
p75:    969
Mean:   835
Median: 806
p25:    658
Min:    322

--------------------
Column: PT08.S4(NO2)
--------------------
Type:   Continuous
Max:    2,775
p75:    1,674
Mean:   1,456
Median: 1,463
p25:    1,227
Min:    551

-------------------
Column: PT08.S5(O3)
-------------------
Type:   Continuous
Max:    2,523
p75:    1,273
Mean:   1,022
Median: 963
p25:    731
Min:    221

----------
Column: RH
----------
Type:   Continuous
Max:    88
p75:    62
Mean:   49
Median: 49
p25:    35
Min:    9.2

---------
Column: T
---------
Type:   Continuous
Max:    44
p75:    24
Mean:   18
Median: 17
p25:    11
Min:    -1.9

------------
Column: Time
------------
Type:   Continuous
Max:    23
p75:    17
Mean:   11
Median: 11
p25:    5.0
Min:    0.0

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable RMSE
0 1.00 5.51 1.14e-03

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
0 No Variable 0.00 7.45
1 PT08.S2(NMHC) 0.96 1.45
2 AH 1.00 0.0719
3 PT08.S4(NO2) 1.00 0.0719
4 RH 1.00 0.0719
5 T 1.00 0.0719
6 PT08.S3(NOx) 1.00 0.0719
7 Time 1.00 0.0719
8 CO(GT) 1.00 0.0719
9 Date 1.00 0.0719
10 NO2(GT) 1.00 0.0719
11 NOx(GT) 1.00 0.0668
12 PT08.S5(O3) 1.00 0.0621
13 NMHC(GT) 1.00 0.0577
14 PT08.S1(CO) 1.00 0.0011