Social Media Buzz (UCI, Regression, n=583250, d=77)

Loading The Data

In [1]:
from kxy_datasets.uci_regressions import SocialMediaBuzz # pip install kxy_datasets
In [2]:
dataset = SocialMediaBuzz()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

-------------
Column: ADL_0
-------------
Type:   Continuous
Max:    294
p75:    1.1
Mean:   1.1
Median: 1.0
p25:    1.0
Min:    0.0

-------------
Column: ADL_1
-------------
Type:   Continuous
Max:    295
p75:    1.1
Mean:   1.1
Median: 1.0
p25:    1.0
Min:    0.0

-------------
Column: ADL_2
-------------
Type:   Continuous
Max:    295
p75:    1.1
Mean:   1.1
Median: 1.0
p25:    1.0
Min:    0.0

-------------
Column: ADL_3
-------------
Type:   Continuous
Max:    273
p75:    1.1
Mean:   1.1
Median: 1.0
p25:    1.0
Min:    0.0

-------------
Column: ADL_4
-------------
Type:   Continuous
Max:    294
p75:    1.1
Mean:   1.1
Median: 1.0
p25:    1.0
Min:    0.0

-------------
Column: ADL_5
-------------
Type:   Continuous
Max:    262
p75:    1.1
Mean:   1.1
Median: 1.0
p25:    1.0
Min:    0.0

-------------
Column: ADL_6
-------------
Type:   Continuous
Max:    295
p75:    1.1
Mean:   1.1
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: AI_0
------------
Type:   Continuous
Max:    18,654
p75:    59
Mean:   71
Median: 11
p25:    2.0
Min:    0.0

------------
Column: AI_1
------------
Type:   Continuous
Max:    22,035
p75:    57
Mean:   69
Median: 11
p25:    2.0
Min:    0.0

------------
Column: AI_2
------------
Type:   Continuous
Max:    29,402
p75:    65
Mean:   82
Median: 13
p25:    2.0
Min:    0.0

------------
Column: AI_3
------------
Type:   Continuous
Max:    57,617
p75:    74
Mean:   93
Median: 14
p25:    3.0
Min:    0.0

------------
Column: AI_4
------------
Type:   Continuous
Max:    63,147
p75:    82
Mean:   103
Median: 16
p25:    3.0
Min:    0.0

------------
Column: AI_5
------------
Type:   Continuous
Max:    63,147
p75:    92
Mean:   113
Median: 19
p25:    4.0
Min:    0.0

------------
Column: AI_6
------------
Type:   Continuous
Max:    63,147
p75:    91
Mean:   112
Median: 18
p25:    4.0
Min:    0.0

----------------
Column: AS(NA)_0
----------------
Type:   Continuous
Max:    0.0
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

----------------
Column: AS(NA)_1
----------------
Type:   Continuous
Max:    0.0
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

----------------
Column: AS(NA)_2
----------------
Type:   Continuous
Max:    0.0
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

----------------
Column: AS(NA)_3
----------------
Type:   Continuous
Max:    0.1
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

----------------
Column: AS(NA)_4
----------------
Type:   Continuous
Max:    0.1
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

----------------
Column: AS(NA)_5
----------------
Type:   Continuous
Max:    0.1
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

----------------
Column: AS(NA)_6
----------------
Type:   Continuous
Max:    0.1
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

-----------------
Column: AS(NAC)_0
-----------------
Type:   Continuous
Max:    0.0
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

-----------------
Column: AS(NAC)_1
-----------------
Type:   Continuous
Max:    0.0
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

-----------------
Column: AS(NAC)_2
-----------------
Type:   Continuous
Max:    0.0
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

-----------------
Column: AS(NAC)_3
-----------------
Type:   Continuous
Max:    0.0
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

-----------------
Column: AS(NAC)_4
-----------------
Type:   Continuous
Max:    0.0
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

-----------------
Column: AS(NAC)_5
-----------------
Type:   Continuous
Max:    0.0
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

-----------------
Column: AS(NAC)_6
-----------------
Type:   Continuous
Max:    0.0
p75:    0.0
Mean:   0.0
Median: 0.0
p25:    0.0
Min:    0.0

------------
Column: AT_0
------------
Type:   Continuous
Max:    282
p75:    1.1
Mean:   1.0
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: AT_1
------------
Type:   Continuous
Max:    283
p75:    1.1
Mean:   1.0
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: AT_2
------------
Type:   Continuous
Max:    283
p75:    1.1
Mean:   1.0
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: AT_3
------------
Type:   Continuous
Max:    263
p75:    1.1
Mean:   1.0
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: AT_4
------------
Type:   Continuous
Max:    282
p75:    1.1
Mean:   1.0
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: AT_5
------------
Type:   Continuous
Max:    249
p75:    1.1
Mean:   1.1
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: AT_6
------------
Type:   Continuous
Max:    283
p75:    1.1
Mean:   1.1
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: BL_0
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.9
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: BL_1
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.9
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: BL_2
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.9
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: BL_3
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.9
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: BL_4
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.9
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: BL_5
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   1.0
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: BL_6
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   1.0
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: CS_0
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.9
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: CS_1
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.9
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: CS_2
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.9
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: CS_3
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.9
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: CS_4
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.9
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: CS_5
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   1.0
Median: 1.0
p25:    1.0
Min:    0.0

------------
Column: CS_6
------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   1.0
Median: 1.0
p25:    1.0
Min:    0.0

-------------
Column: NAC_0
-------------
Type:   Continuous
Max:    26,644
p75:    113
Mean:   150
Median: 20
p25:    3.0
Min:    0.0

-------------
Column: NAC_1
-------------
Type:   Continuous
Max:    29,574
p75:    108
Mean:   146
Median: 19
p25:    3.0
Min:    0.0

-------------
Column: NAC_2
-------------
Type:   Continuous
Max:    37,505
p75:    125
Mean:   170
Median: 23
p25:    4.0
Min:    0.0

-------------
Column: NAC_3
-------------
Type:   Continuous
Max:    72,397
p75:    143
Mean:   193
Median: 26
p25:    5.0
Min:    0.0

-------------
Column: NAC_4
-------------
Type:   Continuous
Max:    79,136
p75:    161
Mean:   214
Median: 30
p25:    5.0
Min:    0.0

-------------
Column: NAC_5
-------------
Type:   Continuous
Max:    79,136
p75:    181
Mean:   234
Median: 34
p25:    7.0
Min:    0.0

-------------
Column: NAC_6
-------------
Type:   Continuous
Max:    79,136
p75:    179
Mean:   233
Median: 34
p25:    7.0
Min:    0.0

-------------
Column: NAD_0
-------------
Type:   Continuous
Max:    24,301
p75:    104
Mean:   140
Median: 18
p25:    3.0
Min:    0.0

-------------
Column: NAD_1
-------------
Type:   Continuous
Max:    29,574
p75:    101
Mean:   137
Median: 17
p25:    3.0
Min:    0.0

-------------
Column: NAD_2
-------------
Type:   Continuous
Max:    37,505
p75:    115
Mean:   160
Median: 21
p25:    4.0
Min:    0.0

-------------
Column: NAD_3
-------------
Type:   Continuous
Max:    72,366
p75:    131
Mean:   182
Median: 24
p25:    4.0
Min:    0.0

-------------
Column: NAD_4
-------------
Type:   Continuous
Max:    79,083
p75:    148
Mean:   201
Median: 27
p25:    5.0
Min:    0.0

-------------
Column: NAD_5
-------------
Type:   Continuous
Max:    79,083
p75:    167
Mean:   220
Median: 31
p25:    6.0
Min:    0.0

-------------
Column: NAD_6
-------------
Type:   Continuous
Max:    79,083
p75:    165
Mean:   219
Median: 31
p25:    6.0
Min:    0.0

------------
Column: NA_0
------------
Type:   Continuous
Max:    21,723
p75:    96
Mean:   123
Median: 17
p25:    3.0
Min:    0.0

------------
Column: NA_1
------------
Type:   Continuous
Max:    26,134
p75:    92
Mean:   119
Median: 17
p25:    3.0
Min:    0.0

------------
Column: NA_2
------------
Type:   Continuous
Max:    34,085
p75:    105
Mean:   138
Median: 20
p25:    4.0
Min:    0.0

------------
Column: NA_3
------------
Type:   Continuous
Max:    64,984
p75:    118
Mean:   157
Median: 22
p25:    4.0
Min:    0.0

------------
Column: NA_4
------------
Type:   Continuous
Max:    72,159
p75:    132
Mean:   173
Median: 25
p25:    5.0
Min:    0.0

------------
Column: NA_5
------------
Type:   Continuous
Max:    72,159
p75:    148
Mean:   189
Median: 29
p25:    6.0
Min:    0.0

------------
Column: NA_6
------------
Type:   Continuous
Max:    72,159
p75:    146
Mean:   188
Median: 28
p25:    6.0
Min:    0.0

-------------
Column: NCD_0
-------------
Type:   Continuous
Max:    24,210
p75:    104
Mean:   140
Median: 18
p25:    3.0
Min:    0.0

-------------
Column: NCD_1
-------------
Type:   Continuous
Max:    29,574
p75:    100
Mean:   136
Median: 17
p25:    3.0
Min:    0.0

-------------
Column: NCD_2
-------------
Type:   Continuous
Max:    37,505
p75:    115
Mean:   159
Median: 21
p25:    4.0
Min:    0.0

-------------
Column: NCD_3
-------------
Type:   Continuous
Max:    72,366
p75:    131
Mean:   181
Median: 24
p25:    4.0
Min:    0.0

-------------
Column: NCD_4
-------------
Type:   Continuous
Max:    79,079
p75:    147
Mean:   201
Median: 27
p25:    5.0
Min:    0.0

-------------
Column: NCD_5
-------------
Type:   Continuous
Max:    79,079
p75:    166
Mean:   220
Median: 31
p25:    6.0
Min:    0.0

-------------
Column: NCD_6
-------------
Type:   Continuous
Max:    79,079
p75:    164
Mean:   219
Median: 30
p25:    6.0
Min:    0.0

------------------
Column: annotation
------------------
Type:   Continuous
Max:    75,724
p75:    139
Mean:   191
Median: 25
p25:    4.5
Min:    0.0

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable RMSE
0 1.00 -7.37 5.14

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
0 No Variable 0.00 6.12e+02
1 NAD_6 0.91 1.80e+02
2 NAD_5 0.93 1.61e+02
3 NAD_4 0.94 1.54e+02
4 NAD_0 0.94 1.54e+02
... ... ... ...
73 AS(NA)_3 1.00 5.14
74 AI_5 1.00 5.14
75 ADL_5 1.00 5.14
76 AT_0 1.00 5.14
77 CS_3 1.00 5.14

78 rows × 3 columns