Avila (UCI, Classification, n=20867, d=10, 12 classes)

Loading The Data

In [1]:
from kxy_datasets.uci_classifications import Avila # pip install kxy_datasets
In [2]:
dataset = Avila()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

-------------
Column: Class
-------------
Type:      Categorical
Frequency: 41.08%, Label: A
Frequency: 18.80%, Label: F
Frequency: 10.50%, Label: E
Frequency:  7.97%, Label: I
Frequency:  5.00%, Label: X
Frequency:  4.98%, Label: H
Frequency:  4.28%, Label: G
Other Labels: 7.39%

----------
Column: F1
----------
Type:   Continuous
Max:    11
p75:    0.2
Mean:   -0.0
Median: 0.1
p25:    -0.1
Min:    -3.5

-----------
Column: F10
-----------
Type:   Continuous
Max:    11
p75:    0.5
Mean:   0.0
Median: -0.0
p25:    -0.5
Min:    -6.7

----------
Column: F2
----------
Type:   Continuous
Max:    386
p75:    0.2
Mean:   0.0
Median: -0.1
p25:    -0.3
Min:    -2.4

----------
Column: F3
----------
Type:   Continuous
Max:    50
p75:    0.4
Mean:   0.0
Median: 0.2
p25:    0.1
Min:    -3.2

----------
Column: F4
----------
Type:   Continuous
Max:    4.0
p75:    0.6
Mean:   0.0
Median: 0.1
p25:    -0.5
Min:    -5.4

----------
Column: F5
----------
Type:   Continuous
Max:    1.1
p75:    0.3
Mean:   0.0
Median: 0.3
p25:    0.2
Min:    -4.9

----------
Column: F6
----------
Type:   Continuous
Max:    53
p75:    0.6
Mean:   0.0
Median: -0.1
p25:    -0.6
Min:    -7.5

----------
Column: F7
----------
Type:   Continuous
Max:    83
p75:    0.4
Mean:   0.0
Median: 0.2
p25:    -0.0
Min:    -11.9

----------
Column: F8
----------
Type:   Continuous
Max:    13
p75:    0.6
Mean:   0.0
Median: 0.1
p25:    -0.5
Min:    -4.2

----------
Column: F9
----------
Type:   Continuous
Max:    44
p75:    0.5
Mean:   0.0
Median: 0.1
p25:    -0.4
Min:    -5.5

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 0.97 4.44e-16 1.00

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
0 No Variable 0.00 0.41
1 F4 0.97 1.00
2 F10 0.97 1.00
3 F8 0.97 1.00
4 F6 0.97 1.00
5 F7 0.97 1.00
6 F1 0.97 1.00
7 F2 0.97 1.00
8 F3 0.97 1.00
9 F5 0.97 1.00
10 F9 0.97 1.00