Facebook Comments (UCI, Regression, n=209074, d=53)

Loading The Data

In [1]:
from kxy_datasets.uci_regressions import FacebookComments # pip install kxy_datasets
In [2]:
dataset = FacebookComments()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

-----------
Column: x_0
-----------
Type:   Continuous
Max:    486,972,297
p75:    1,341,299
Mean:   1,451,183
Median: 313,452
p25:    43,331
Min:    36

-----------
Column: x_1
-----------
Type:   Continuous
Max:    1,100,558
p75:    99
Mean:   4,748
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_10
------------
Type:   Continuous
Max:    2,783
p75:    599
Mean:   392
Median: 193
p25:    41
Min:    0.0

------------
Column: x_11
------------
Type:   Continuous
Max:    1,672
p75:    29
Mean:   24
Median: 9.4
p25:    2.0
Min:    0.0

------------
Column: x_12
------------
Type:   Continuous
Max:    1,672
p75:    8.0
Mean:   8.8
Median: 3.0
p25:    0.0
Min:    0.0

------------
Column: x_13
------------
Type:   Continuous
Max:    1,101
p75:    66
Mean:   43
Median: 20
p25:    5.0
Min:    0.0

------------
Column: x_14
------------
Type:   Continuous
Max:    2,218
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_15
------------
Type:   Continuous
Max:    2,455
p75:    565
Mean:   376
Median: 197
p25:    38
Min:    0.0

------------
Column: x_16
------------
Type:   Continuous
Max:    2,218
p75:    26
Mean:   20
Median: 7.9
p25:    2.0
Min:    0.0

------------
Column: x_17
------------
Type:   Continuous
Max:    2,218
p75:    5.0
Mean:   4.8
Median: 1.0
p25:    0.0
Min:    0.0

------------
Column: x_18
------------
Type:   Continuous
Max:    912
p75:    59
Mean:   41
Median: 19
p25:    4.7
Min:    0.0

------------
Column: x_19
------------
Type:   Continuous
Max:    2,405
p75:    0.0
Mean:   0.9
Median: 0.0
p25:    0.0
Min:    0.0

-----------
Column: x_2
-----------
Type:   Continuous
Max:    6,784,263
p75:    54,849
Mean:   56,123
Median: 9,484
p25:    770
Min:    0.0

------------
Column: x_20
------------
Type:   Continuous
Max:    2,783
p75:    705
Mean:   447
Median: 233
p25:    44
Min:    0.0

------------
Column: x_21
------------
Type:   Continuous
Max:    2,405
p75:    74
Mean:   54
Median: 23
p25:    5.3
Min:    0.0

------------
Column: x_22
------------
Type:   Continuous
Max:    2,405
p75:    40
Mean:   34
Median: 12
p25:    2.0
Min:    0.0

------------
Column: x_23
------------
Type:   Continuous
Max:    1,101
p75:    100
Mean:   66
Median: 31
p25:    7.7
Min:    0.0

------------
Column: x_24
------------
Type:   Continuous
Max:    1,660
p75:    -33.0
Mean:   -320.6
Median: -163.0
p25:    -455.0
Min:    -2119.0

------------
Column: x_25
------------
Type:   Continuous
Max:    2,783
p75:    594
Mean:   387
Median: 182
p25:    41
Min:    -1861.0

------------
Column: x_26
------------
Type:   Continuous
Max:    1,660
p75:    1.7
Mean:   4.1
Median: 0.3
p25:    -0.0
Min:    -1861.0

------------
Column: x_27
------------
Type:   Continuous
Max:    1,660
p75:    0.0
Mean:   -0.6
Median: 0.0
p25:    -2.0
Min:    -1861.0

------------
Column: x_28
------------
Type:   Continuous
Max:    1,386
p75:    87
Mean:   59
Median: 27
p25:    6.9
Min:    0.0

------------
Column: x_29
------------
Type:   Continuous
Max:    2,858
p75:    47
Mean:   58
Median: 11
p25:    2.0
Min:    0.0

-----------
Column: x_3
-----------
Type:   Continuous
Max:    107
p75:    32
Mean:   24
Median: 18
p25:    9.0
Min:    1.0

------------
Column: x_30
------------
Type:   Continuous
Max:    2,783
p75:    12
Mean:   24
Median: 2.0
p25:    0.0
Min:    0.0

------------
Column: x_31
------------
Type:   Continuous
Max:    2,455
p75:    8.0
Mean:   20
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_32
------------
Type:   Continuous
Max:    2,783
p75:    45
Mean:   54
Median: 11
p25:    2.0
Min:    0.0

------------
Column: x_33
------------
Type:   Continuous
Max:    2,783
p75:    4.0
Mean:   4.1
Median: 0.0
p25:    -6.0
Min:    -2119.0

------------
Column: x_34
------------
Type:   Continuous
Max:    72
p75:    53
Mean:   34
Median: 34
p25:    16
Min:    0.0

------------
Column: x_35
------------
Type:   Continuous
Max:    21,480
p75:    172
Mean:   163
Median: 97
p25:    38
Min:    0.0

------------
Column: x_36
------------
Type:   Continuous
Max:    144,860
p75:    62
Mean:   119
Median: 14
p25:    2.0
Min:    1.0

------------
Column: x_37
------------
Type:   Continuous
Max:    0.0
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_38
------------
Type:   Continuous
Max:    24
p75:    24
Mean:   23
Median: 24
p25:    24
Min:    0.0

------------
Column: x_39
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

-----------
Column: x_4
-----------
Type:   Continuous
Max:    2,575
p75:    0.0
Mean:   0.9
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_40
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_41
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_42
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_43
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_44
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.2
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_45
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_46
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_47
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_48
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_49
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

-----------
Column: x_5
-----------
Type:   Continuous
Max:    2,858
p75:    803
Mean:   497
Median: 258
p25:    50
Min:    0.0

------------
Column: x_50
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.2
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_51
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: x_52
------------
Type:   Continuous
Max:    1.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

-----------
Column: x_6
-----------
Type:   Continuous
Max:    2,575
p75:    78
Mean:   58
Median: 24
p25:    5.6
Min:    0.0

-----------
Column: x_7
-----------
Type:   Continuous
Max:    2,575
p75:    42
Mean:   36
Median: 12
p25:    2.0
Min:    0.0

-----------
Column: x_8
-----------
Type:   Continuous
Max:    1,101
p75:    107
Mean:   71
Median: 36
p25:    8.1
Min:    0.0

-----------
Column: x_9
-----------
Type:   Continuous
Max:    1,672
p75:    0.0
Mean:   0.3
Median: 0.0
p25:    0.0
Min:    0.0

---------
Column: y
---------
Type:   Continuous
Max:    2,412
p75:    3.0
Mean:   8.2
Median: 0.0
p25:    0.0
Min:    0.0

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable RMSE
0 0.98 -7.48 6.46

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
0 No Variable 0.00 4.22e+01
1 x_30 0.57 2.78e+01
2 x_34 0.67 2.41e+01
3 x_31 0.67 2.41e+01
4 x_17 0.67 2.41e+01
5 x_38 0.67 2.41e+01
6 x_33 0.67 2.41e+01
7 x_29 0.78 2.00e+01
8 x_36 0.78 2.00e+01
9 x_24 0.78 2.00e+01
10 x_12 0.78 2.00e+01
11 x_46 0.89 1.37e+01
12 x_1 0.95 9.41
13 x_15 0.98 6.46
14 x_10 0.98 6.46
15 x_2 0.98 6.46
16 x_26 0.98 6.46
17 x_52 0.98 6.46
18 x_22 0.98 6.46
19 x_7 0.98 6.46
20 x_23 0.98 6.46
21 x_8 0.98 6.46
22 x_5 0.98 6.46
23 x_21 0.98 6.46
24 x_28 0.98 6.46
25 x_25 0.98 6.46
26 x_3 0.98 6.46
27 x_16 0.98 6.46
28 x_18 0.98 6.46
29 x_11 0.98 6.46
30 x_0 0.98 6.46
31 x_32 0.98 6.46
32 x_14 0.98 6.46
33 x_9 0.98 6.46
34 x_19 0.98 6.46
35 x_27 0.98 6.46
36 x_51 0.98 6.46
37 x_48 0.98 6.46
38 x_47 0.98 6.46
39 x_50 0.98 6.46
40 x_42 0.98 6.46
41 x_49 0.98 6.46
42 x_41 0.98 6.46
43 x_43 0.98 6.46
44 x_44 0.98 6.46
45 x_40 0.98 6.46
46 x_39 0.98 6.46
47 x_45 0.98 6.46
48 x_35 0.98 6.46
49 x_4 0.98 6.46
50 x_13 0.98 6.46
51 x_6 0.98 6.46
52 x_20 0.98 6.46