Year Prediction MSD (UCI, Regression, n=515345, d=90)

Loading The Data

In [1]:
from kxy_datasets.uci_regressions import YearPredictionMSD # pip install kxy_datasets
In [2]:
dataset = YearPredictionMSD()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

-----------
Column: x_0
-----------
Type:   Continuous
Max:    61
p75:    47
Mean:   43
Median: 44
p25:    39
Min:    1.7

-----------
Column: x_1
-----------
Type:   Continuous
Max:    384
p75:    36
Mean:   1.3
Median: 8.4
p25:    -26.1
Min:    -337.1

------------
Column: x_10
------------
Type:   Continuous
Max:    88
p75:    2.4
Mean:   -0.1
Median: -0.1
p25:    -2.7
Min:    -69.7

------------
Column: x_11
------------
Type:   Continuous
Max:    87
p75:    7.4
Mean:   2.5
Median: 2.3
p25:    -2.6
Min:    -94.0

------------
Column: x_12
------------
Type:   Continuous
Max:    549
p75:    43
Mean:   33
Median: 29
p25:    18
Min:    0.1

------------
Column: x_13
------------
Type:   Continuous
Max:    65,735
p75:    3,041
Mean:   2,439
Median: 2,009
p25:    1,315
Min:    8.5

------------
Column: x_14
------------
Type:   Continuous
Max:    36,816
p75:    2,489
Mean:   1,967
Median: 1,687
p25:    1,113
Min:    21

------------
Column: x_15
------------
Type:   Continuous
Max:    31,849
p75:    1,892
Mean:   1,514
Median: 1,240
p25:    814
Min:    17

------------
Column: x_16
------------
Type:   Continuous
Max:    19,865
p75:    1,100
Mean:   910
Median: 817
p25:    606
Min:    12

------------
Column: x_17
------------
Type:   Continuous
Max:    16,831
p75:    1,114
Mean:   879
Median: 739
p25:    487
Min:    5.5

------------
Column: x_18
------------
Type:   Continuous
Max:    11,901
p75:    740
Mean:   603
Median: 541
p25:    392
Min:    19

------------
Column: x_19
------------
Type:   Continuous
Max:    9,569
p75:    626
Mean:   517
Median: 444
p25:    318
Min:    6.3

-----------
Column: x_2
-----------
Type:   Continuous
Max:    322
p75:    29
Mean:   8.7
Median: 10
p25:    -11.5
Min:    -301.0

------------
Column: x_20
------------
Type:   Continuous
Max:    9,616
p75:    476
Mean:   393
Median: 349
p25:    257
Min:    6.2

------------
Column: x_21
------------
Type:   Continuous
Max:    3,721
p75:    397
Mean:   325
Median: 292
p25:    214
Min:    15

------------
Column: x_22
------------
Type:   Continuous
Max:    6,737
p75:    356
Mean:   288
Median: 242
p25:    167
Min:    6.1

------------
Column: x_23
------------
Type:   Continuous
Max:    9,813
p75:    349
Mean:   291
Median: 264
p25:    197
Min:    5.2

------------
Column: x_24
------------
Type:   Continuous
Max:    2,049
p75:    88
Mean:   43
Median: 27
p25:    -11.6
Min:    -2821.4

------------
Column: x_25
------------
Type:   Continuous
Max:    24,479
p75:    284
Mean:   43
Median: -43.2
p25:    -288.6
Min:    -13390.4

------------
Column: x_26
------------
Type:   Continuous
Max:    14,505
p75:    167
Mean:   -46.4
Median: -38.9
p25:    -250.8
Min:    -12017.1

------------
Column: x_27
------------
Type:   Continuous
Max:    3,410
p75:    65
Mean:   -27.7
Median: -23.4
p25:    -118.6
Min:    -4324.9

------------
Column: x_28
------------
Type:   Continuous
Max:    3,277
p75:    84
Mean:   14
Median: 7.8
p25:    -60.2
Min:    -3357.3

------------
Column: x_29
------------
Type:   Continuous
Max:    3,553
p75:    97
Mean:   44
Median: 30
p25:    -24.4
Min:    -3115.4

-----------
Column: x_3
-----------
Type:   Continuous
Max:    335
p75:    8.8
Mean:   1.2
Median: -0.7
p25:    -8.5
Min:    -154.2

------------
Column: x_30
------------
Type:   Continuous
Max:    2,347
p75:    48
Mean:   5.1
Median: 4.1
p25:    -38.6
Min:    -3805.7

------------
Column: x_31
------------
Type:   Continuous
Max:    1,954
p75:    55
Mean:   24
Median: 23
p25:    -6.5
Min:    -1516.4

------------
Column: x_32
------------
Type:   Continuous
Max:    2,887
p75:    40
Mean:   9.5
Median: 7.8
p25:    -24.8
Min:    -1679.1

------------
Column: x_33
------------
Type:   Continuous
Max:    2,330
p75:    12
Mean:   -4.2
Median: -9.7
p25:    -28.7
Min:    -1590.6

------------
Column: x_34
------------
Type:   Continuous
Max:    1,813
p75:    20
Mean:   0.5
Median: 3.2
p25:    -17.0
Min:    -989.6

------------
Column: x_35
------------
Type:   Continuous
Max:    2,496
p75:    116
Mean:   72
Median: 56
p25:    14
Min:    -1711.5

------------
Column: x_36
------------
Type:   Continuous
Max:    14,148
p75:    85
Mean:   -51.4
Median: -73.2
p25:    -223.8
Min:    -8448.2

------------
Column: x_37
------------
Type:   Continuous
Max:    8,059
p75:    307
Mean:   117
Median: 79
p25:    -99.6
Min:    -10095.7

------------
Column: x_38
------------
Type:   Continuous
Max:    6,065
p75:    -60.3
Mean:   -189.9
Median: -160.5
p25:    -289.7
Min:    -9803.8

------------
Column: x_39
------------
Type:   Continuous
Max:    8,360
p75:    115
Mean:   23
Median: 21
p25:    -69.7
Min:    -7882.8

-----------
Column: x_4
-----------
Type:   Continuous
Max:    262
p75:    7.7
Mean:   -6.6
Median: -6.0
p25:    -20.7
Min:    -182.0

------------
Column: x_40
------------
Type:   Continuous
Max:    3,537
p75:    46
Mean:   -1.3
Median: -4.2
p25:    -53.5
Min:    -4673.4

------------
Column: x_41
------------
Type:   Continuous
Max:    3,892
p75:    75
Mean:   18
Median: 25
p25:    -27.2
Min:    -4175.4

------------
Column: x_42
------------
Type:   Continuous
Max:    1,202
p75:    -18.0
Mean:   -52.0
Median: -45.4
p25:    -79.2
Min:    -4975.4

------------
Column: x_43
------------
Type:   Continuous
Max:    1,830
p75:    21
Mean:   3.2
Median: 3.4
p25:    -14.6
Min:    -1073.0

------------
Column: x_44
------------
Type:   Continuous
Max:    746
p75:    19
Mean:   -1.5
Median: -0.9
p25:    -20.7
Min:    -1021.3

------------
Column: x_45
------------
Type:   Continuous
Max:    1,198
p75:    28
Mean:   6.3
Median: 4.0
p25:    -17.3
Min:    -1330.0

------------
Column: x_46
------------
Type:   Continuous
Max:    9,059
p75:    282
Mean:   78
Median: 75
p25:    -113.7
Min:    -14861.7

------------
Column: x_47
------------
Type:   Continuous
Max:    6,967
p75:    241
Mean:   142
Median: 105
p25:    10
Min:    -3992.7

------------
Column: x_48
------------
Type:   Continuous
Max:    6,172
p75:    12
Mean:   -86.5
Median: -62.8
p25:    -157.6
Min:    -6642.4

------------
Column: x_49
------------
Type:   Continuous
Max:    2,067
p75:    85
Mean:   25
Median: 31
p25:    -24.5
Min:    -2344.5

-----------
Column: x_5
-----------
Type:   Continuous
Max:    166
p75:    -2.4
Mean:   -9.5
Median: -11.2
p25:    -18.4
Min:    -81.8

------------
Column: x_50
------------
Type:   Continuous
Max:    1,426
p75:    50
Mean:   6.4
Median: 5.1
p25:    -38.7
Min:    -2270.8

------------
Column: x_51
------------
Type:   Continuous
Max:    2,460
p75:    62
Mean:   28
Median: 25
p25:    -9.1
Min:    -1746.5

------------
Column: x_52
------------
Type:   Continuous
Max:    2,394
p75:    39
Mean:   12
Median: 9.0
p25:    -18.1
Min:    -3188.2

------------
Column: x_53
------------
Type:   Continuous
Max:    2,900
p75:    37
Mean:   1.7
Median: 0.6
p25:    -34.0
Min:    -2199.8

------------
Column: x_54
------------
Type:   Continuous
Max:    569
p75:    18
Mean:   -10.2
Median: -1.9
p25:    -28.9
Min:    -1694.3

------------
Column: x_55
------------
Type:   Continuous
Max:    6,955
p75:    163
Mean:   64
Median: 29
p25:    -68.5
Min:    -5154.0

------------
Column: x_56
------------
Type:   Continuous
Max:    12,700
p75:    225
Mean:   104
Median: 85
p25:    -38.8
Min:    -5111.6

------------
Column: x_57
------------
Type:   Continuous
Max:    13,001
p75:    86
Mean:   -0.0
Median: -16.1
p25:    -117.3
Min:    -4730.6

------------
Column: x_58
------------
Type:   Continuous
Max:    5,419
p75:    111
Mean:   38
Median: 27
p25:    -44.4
Min:    -3756.5

------------
Column: x_59
------------
Type:   Continuous
Max:    5,690
p75:    30
Mean:   -28.0
Median: -31.4
p25:    -94.4
Min:    -2500.0

-----------
Column: x_6
-----------
Type:   Continuous
Max:    172
p75:    6.5
Mean:   -2.4
Median: -2.0
p25:    -10.8
Min:    -188.2

------------
Column: x_60
------------
Type:   Continuous
Max:    1,811
p75:    27
Mean:   3.3
Median: -0.3
p25:    -25.2
Min:    -1900.1

------------
Column: x_61
------------
Type:   Continuous
Max:    973
p75:    23
Mean:   0.3
Median: 1.7
p25:    -21.6
Min:    -1396.7

------------
Column: x_62
------------
Type:   Continuous
Max:    812
p75:    14
Mean:   -0.5
Median: -3.3
p25:    -18.7
Min:    -600.1

------------
Column: x_63
------------
Type:   Continuous
Max:    11,048
p75:    1.2
Mean:   -138.2
Median: -116.0
p25:    -263.0
Min:    -10345.8

------------
Column: x_64
------------
Type:   Continuous
Max:    2,877
p75:    105
Mean:   -0.7
Median: 8.4
p25:    -87.2
Min:    -7376.0

------------
Column: x_65
------------
Type:   Continuous
Max:    3,447
p75:    54
Mean:   0.2
Median: 4.8
p25:    -52.1
Min:    -3896.3

------------
Column: x_66
------------
Type:   Continuous
Max:    2,055
p75:    51
Mean:   3.2
Median: 2.3
p25:    -46.4
Min:    -1199.0

------------
Column: x_67
------------
Type:   Continuous
Max:    4,779
p75:    61
Mean:   27
Median: 18
p25:    -19.4
Min:    -2564.8

------------
Column: x_68
------------
Type:   Continuous
Max:    5,286
p75:    72
Mean:   31
Median: 22
p25:    -19.6
Min:    -1905.0

------------
Column: x_69
------------
Type:   Continuous
Max:    745
p75:    14
Mean:   -0.8
Median: -0.8
p25:    -16.1
Min:    -974.7

-----------
Column: x_7
-----------
Type:   Continuous
Max:    126
p75:    2.9
Mean:   -1.8
Median: -1.7
p25:    -6.5
Min:    -72.5

------------
Column: x_70
------------
Type:   Continuous
Max:    3,958
p75:    109
Mean:   -8.9
Median: 14
p25:    -96.3
Min:    -7057.7

------------
Column: x_71
------------
Type:   Continuous
Max:    4,741
p75:    103
Mean:   4.8
Median: 12
p25:    -90.7
Min:    -6953.4

------------
Column: x_72
------------
Type:   Continuous
Max:    2,124
p75:    42
Mean:   -27.3
Median: -19.8
p25:    -86.4
Min:    -8400.6

------------
Column: x_73
------------
Type:   Continuous
Max:    1,639
p75:    17
Mean:   -11.9
Median: -12.4
p25:    -42.9
Min:    -1812.9

------------
Column: x_74
------------
Type:   Continuous
Max:    1,278
p75:    11
Mean:   -21.6
Median: -16.8
p25:    -50.8
Min:    -1387.5

------------
Column: x_75
------------
Type:   Continuous
Max:    741
p75:    7.6
Mean:   -5.6
Median: -2.4
p25:    -15.0
Min:    -718.4

------------
Column: x_76
------------
Type:   Continuous
Max:    10,020
p75:    62
Mean:   -23.3
Median: -56.3
p25:    -149.6
Min:    -9831.5

------------
Column: x_77
------------
Type:   Continuous
Max:    3,423
p75:    105
Mean:   31
Median: 38
p25:    -31.7
Min:    -2025.8

------------
Column: x_78
------------
Type:   Continuous
Max:    5,188
p75:    -7.3
Mean:   -105.0
Median: -71.4
p25:    -163.4
Min:    -8390.0

------------
Column: x_79
------------
Type:   Continuous
Max:    3,735
p75:    85
Mean:   26
Median: 26
p25:    -27.9
Min:    -4754.9

-----------
Column: x_8
-----------
Type:   Continuous
Max:    146
p75:    10.0
Mean:   3.7
Median: 3.8
p25:    -2.3
Min:    -126.5

------------
Column: x_80
------------
Type:   Continuous
Max:    840
p75:    26
Mean:   15
Median: 9.2
p25:    -1.8
Min:    -437.7

------------
Column: x_81
------------
Type:   Continuous
Max:    4,469
p75:    13
Mean:   -73.5
Median: -53.1
p25:    -139.6
Min:    -4402.4

------------
Column: x_82
------------
Type:   Continuous
Max:    3,210
p75:    89
Mean:   41
Median: 28
p25:    -21.0
Min:    -1810.7

------------
Column: x_83
------------
Type:   Continuous
Max:    1,734
p75:    77
Mean:   37
Median: 33
p25:    -4.7
Min:    -3098.4

------------
Column: x_84
------------
Type:   Continuous
Max:    260
p75:    8.5
Mean:   0.3
Median: 0.8
p25:    -6.8
Min:    -341.8

------------
Column: x_85
------------
Type:   Continuous
Max:    3,662
p75:    67
Mean:   17
Median: 15
p25:    -31.6
Min:    -3168.9

------------
Column: x_86
------------
Type:   Continuous
Max:    2,833
p75:    52
Mean:   -26.3
Median: -21.2
p25:    -101.5
Min:    -4320.0

------------
Column: x_87
------------
Type:   Continuous
Max:    463
p75:    10.0
Mean:   4.5
Median: 3.1
p25:    -2.6
Min:    -236.0

------------
Column: x_88
------------
Type:   Continuous
Max:    7,393
p75:    86
Mean:   20
Median: 7.8
p25:    -59.5
Min:    -7458.4

------------
Column: x_89
------------
Type:   Continuous
Max:    677
p75:    9.7
Mean:   1.3
Median: 0.1
p25:    -8.8
Min:    -381.4

-----------
Column: x_9
-----------
Type:   Continuous
Max:    60
p75:    6.1
Mean:   1.9
Median: 1.8
p25:    -2.4
Min:    -41.6

---------
Column: y
---------
Type:   Continuous
Max:    2,011
p75:    2,006
Mean:   1,998
Median: 2,002
p25:    1,994
Min:    1,922

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable RMSE
0 0.97 -1.88 1.94

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
0 No Variable 0.00 1.09e+01
1 x_0 0.05 1.06e+01
2 x_13 0.13 1.02e+01
3 x_1 0.18 9.87
4 x_2 0.23 9.61
... ... ... ...
86 x_25 0.97 1.94
87 x_83 0.97 1.94
88 x_44 0.97 1.94
89 x_27 0.97 1.94
90 x_76 0.97 1.94

91 rows × 3 columns