Power Plant (UCI, Regression, n=9568, d=4)

Loading The Data

In [1]:
from kxy_datasets.uci_regressions import PowerPlant # pip install kxy_datasets
In [2]:
dataset = PowerPlant()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

----------
Column: AP
----------
Type:   Continuous
Max:    1,033
p75:    1,017
Mean:   1,013
Median: 1,012
p25:    1,009
Min:    992

----------
Column: AT
----------
Type:   Continuous
Max:    37
p75:    25
Mean:   19
Median: 20
p25:    13
Min:    1.8

----------
Column: PE
----------
Type:   Continuous
Max:    495
p75:    468
Mean:   454
Median: 451
p25:    439
Min:    420

----------
Column: RH
----------
Type:   Continuous
Max:    100
p75:    84
Mean:   73
Median: 74
p25:    63
Min:    25

---------
Column: V
---------
Type:   Continuous
Max:    81
p75:    66
Mean:   54
Median: 52
p25:    41
Min:    25

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable RMSE
0 0.94 1.38 4.31

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
0 No Variable 0.00 1.71e+01
1 AT 0.90 5.47
2 V 0.93 4.60
3 RH 0.93 4.60
4 AP 0.94 4.31