Shuttle (UCI, Classification, n=58000, d=9, 7 classes)

Loading The Data

In [1]:
from kxy_datasets.uci_classifications import Shuttle # pip install kxy_datasets
In [2]:
dataset = Shuttle()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

-----------
Column: x_0
-----------
Type:   Continuous
Max:    126
p75:    55
Mean:   48
Median: 45
p25:    38
Min:    27

-----------
Column: x_1
-----------
Type:   Continuous
Max:    5,075
p75:    0.0
Mean:   -0.0
Median: 0.0
p25:    0.0
Min:    -4821.0

-----------
Column: x_2
-----------
Type:   Continuous
Max:    149
p75:    89
Mean:   85
Median: 83
p25:    79
Min:    21

-----------
Column: x_3
-----------
Type:   Continuous
Max:    3,830
p75:    0.0
Mean:   0.3
Median: 0.0
p25:    0.0
Min:    -3939.0

-----------
Column: x_4
-----------
Type:   Continuous
Max:    436
p75:    46
Mean:   34
Median: 42
p25:    26
Min:    -188.0

-----------
Column: x_5
-----------
Type:   Continuous
Max:    15,164
p75:    5.0
Mean:   1.6
Median: 0.0
p25:    -5.0
Min:    -26739.0

-----------
Column: x_6
-----------
Type:   Continuous
Max:    105
p75:    42
Mean:   37
Median: 39
p25:    32
Min:    -48.0

-----------
Column: x_7
-----------
Type:   Continuous
Max:    270
p75:    60
Mean:   50
Median: 44
p25:    37
Min:    -353.0

-----------
Column: x_8
-----------
Type:   Continuous
Max:    266
p75:    14
Mean:   13
Median: 2.0
p25:    0.0
Min:    -356.0

---------
Column: y
---------
Type:   Continuous
Max:    7.0
p75:    1.0
Mean:   1.7
Median: 1.0
p25:    1.0
Min:    1.0

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 0.74 0.00 1.00

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
0 No Variable 0.00 0.79
1 x_0 0.65 1.00
2 x_8 0.73 1.00
3 x_1 0.74 1.00
4 x_3 0.74 1.00
5 x_2 0.74 1.00
6 x_5 0.74 1.00
7 x_4 0.74 1.00
8 x_6 0.74 1.00
9 x_7 0.74 1.00