Power Plant (UCI, Regression, n=9568, d=4)¶

Loading The Data¶

In [1]:

from kxy_datasets.uci_regressions import PowerPlant # pip install kxy_datasets

In [2]:

dataset = PowerPlant()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'

In [3]:

df.kxy.describe() # Visualize a summary of the data


----------
Column: AP
----------
Type:   Continuous
Max:    1,033
p75:    1,017
Mean:   1,013
Median: 1,012
p25:    1,009
Min:    992

----------
Column: AT
----------
Type:   Continuous
Max:    37
p75:    25
Mean:   19
Median: 20
p25:    13
Min:    1.8

----------
Column: PE
----------
Type:   Continuous
Max:    495
p75:    468
Mean:   454
Median: 451
p25:    439
Min:    420

----------
Column: RH
----------
Type:   Continuous
Max:    100
p75:    84
Mean:   73
Median: 74
p25:    63
Min:    25

---------
Column: V
---------
Type:   Continuous
Max:    81
p75:    66
Mean:   54
Median: 52
p25:    41
Min:    25

Data Valuation¶

In [4]:

df.kxy.data_valuation(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[4]:

	Achievable R-Squared	Achievable Log-Likelihood Per Sample	Achievable RMSE
0	0.94	1.38	4.31

Automatic (Model-Free) Variable Selection¶

In [5]:

df.kxy.variable_selection(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[5]:

	Variable	Running Achievable R-Squared	Running Achievable RMSE
Selection Order
0	No Variable	0.00	1.71e+01
1	AT	0.90	5.47
2	V	0.93	4.60
3	RH	0.93	4.60
4	AP	0.94	4.31