Shuttle (UCI, Classification, n=58000, d=9, 7 classes)¶

Loading The Data¶

In [1]:

from kxy_datasets.uci_classifications import Shuttle # pip install kxy_datasets

In [2]:

dataset = Shuttle()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'

In [3]:

df.kxy.describe() # Visualize a summary of the data


-----------
Column: x_0
-----------
Type:   Continuous
Max:    126
p75:    55
Mean:   48
Median: 45
p25:    38
Min:    27

-----------
Column: x_1
-----------
Type:   Continuous
Max:    5,075
p75:    0.0
Mean:   -0.0
Median: 0.0
p25:    0.0
Min:    -4821.0

-----------
Column: x_2
-----------
Type:   Continuous
Max:    149
p75:    89
Mean:   85
Median: 83
p25:    79
Min:    21

-----------
Column: x_3
-----------
Type:   Continuous
Max:    3,830
p75:    0.0
Mean:   0.3
Median: 0.0
p25:    0.0
Min:    -3939.0

-----------
Column: x_4
-----------
Type:   Continuous
Max:    436
p75:    46
Mean:   34
Median: 42
p25:    26
Min:    -188.0

-----------
Column: x_5
-----------
Type:   Continuous
Max:    15,164
p75:    5.0
Mean:   1.6
Median: 0.0
p25:    -5.0
Min:    -26739.0

-----------
Column: x_6
-----------
Type:   Continuous
Max:    105
p75:    42
Mean:   37
Median: 39
p25:    32
Min:    -48.0

-----------
Column: x_7
-----------
Type:   Continuous
Max:    270
p75:    60
Mean:   50
Median: 44
p25:    37
Min:    -353.0

-----------
Column: x_8
-----------
Type:   Continuous
Max:    266
p75:    14
Mean:   13
Median: 2.0
p25:    0.0
Min:    -356.0

---------
Column: y
---------
Type:   Continuous
Max:    7.0
p75:    1.0
Mean:   1.7
Median: 1.0
p25:    1.0
Min:    1.0

Data Valuation¶

In [4]:

df.kxy.data_valuation(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[4]:

	Achievable R-Squared	Achievable Log-Likelihood Per Sample	Achievable Accuracy
0	0.74	0.00	1.00

Automatic (Model-Free) Variable Selection¶

In [5]:

df.kxy.variable_selection(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[5]:

	Variable	Running Achievable R-Squared	Running Achievable Accuracy
Selection Order
0	No Variable	0.00	0.79
1	x_0	0.65	1.00
2	x_8	0.73	1.00
3	x_1	0.74	1.00
4	x_3	0.74	1.00
5	x_2	0.74	1.00
6	x_5	0.74	1.00
7	x_4	0.74	1.00
8	x_6	0.74	1.00
9	x_7	0.74	1.00