Letter Recognition (UCI, Classification, n=20000, d=16, 26 classes)

Loading The Data

In [1]:
from kxy_datasets.uci_classifications import LetterRecognition # pip install kxy_datasets
In [2]:
dataset = LetterRecognition()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
In [3]:
df.kxy.describe() # Visualize a summary of the data

-----------
Column: x_0
-----------
Type:      Categorical
Frequency: 22.39%, Label: 4
Frequency: 20.79%, Label: 3
Frequency: 15.85%, Label: 5
Frequency: 14.54%, Label: 2
Frequency:  9.47%, Label: 6
Frequency:  6.30%, Label: 1
Frequency:  5.03%, Label: 7
Other Labels: 5.63%

-----------
Column: x_1
-----------
Type:      Categorical
Frequency: 13.51%, Label: 9
Frequency: 11.51%, Label: 7
Frequency: 11.05%, Label: 10
Frequency: 10.90%, Label: 8
Frequency:  8.53%, Label: 6
Frequency:  8.12%, Label: 11
Frequency:  7.83%, Label: 5
Frequency:  6.71%, Label: 4
Frequency:  6.65%, Label: 3
Frequency:  3.89%, Label: 1
Frequency:  3.54%, Label: 0
Other Labels: 7.75%

------------
Column: x_10
------------
Type:      Categorical
Frequency: 28.33%, Label: 6
Frequency: 12.99%, Label: 7
Frequency: 12.29%, Label: 5
Frequency:  7.47%, Label: 4
Frequency:  7.47%, Label: 8
Frequency:  7.04%, Label: 9
Frequency:  4.67%, Label: 11
Frequency:  4.44%, Label: 10
Frequency:  4.43%, Label: 2
Frequency:  3.63%, Label: 3
Other Labels: 7.22%

------------
Column: x_11
------------
Type:      Categorical
Frequency: 32.74%, Label: 8
Frequency: 15.00%, Label: 9
Frequency: 14.04%, Label: 7
Frequency:  9.37%, Label: 6
Frequency:  7.18%, Label: 10
Frequency:  6.67%, Label: 5
Frequency:  4.63%, Label: 11
Frequency:  3.50%, Label: 4
Other Labels: 6.86%

------------
Column: x_12
------------
Type:      Categorical
Frequency: 23.89%, Label: 3
Frequency: 21.07%, Label: 2
Frequency: 12.84%, Label: 1
Frequency: 12.30%, Label: 0
Frequency:  7.50%, Label: 4
Frequency:  6.82%, Label: 5
Frequency:  6.32%, Label: 6
Other Labels: 9.26%

------------
Column: x_13
------------
Type:      Categorical
Frequency: 38.12%, Label: 8
Frequency: 17.18%, Label: 9
Frequency: 12.58%, Label: 7
Frequency: 11.97%, Label: 10
Frequency:  7.96%, Label: 6
Frequency:  7.18%, Label: 11
Other Labels: 5.00%

------------
Column: x_14
------------
Type:      Categorical
Frequency: 15.46%, Label: 4
Frequency: 15.39%, Label: 3
Frequency: 12.38%, Label: 2
Frequency: 12.36%, Label: 0
Frequency: 10.24%, Label: 5
Frequency: 10.20%, Label: 1
Frequency:  8.62%, Label: 6
Frequency:  6.13%, Label: 7
Other Labels: 9.23%

------------
Column: x_15
------------
Type:      Categorical
Frequency: 40.23%, Label: 8
Frequency: 17.36%, Label: 7
Frequency: 11.79%, Label: 9
Frequency:  9.13%, Label: 6
Frequency:  7.89%, Label: 10
Frequency:  4.96%, Label: 5
Other Labels: 8.63%

-----------
Column: x_2
-----------
Type:      Categorical
Frequency: 21.31%, Label: 5
Frequency: 19.08%, Label: 4
Frequency: 18.20%, Label: 6
Frequency:  9.97%, Label: 3
Frequency:  9.73%, Label: 7
Frequency:  7.09%, Label: 8
Frequency:  6.42%, Label: 2
Other Labels: 8.19%

-----------
Column: x_3
-----------
Type:      Categorical
Frequency: 18.28%, Label: 6
Frequency: 17.80%, Label: 8
Frequency: 13.59%, Label: 4
Frequency: 13.47%, Label: 7
Frequency: 13.38%, Label: 5
Frequency:  7.79%, Label: 3
Frequency:  6.52%, Label: 2
Other Labels: 9.17%

-----------
Column: x_4
-----------
Type:      Categorical
Frequency: 20.77%, Label: 2
Frequency: 19.70%, Label: 3
Frequency: 15.79%, Label: 4
Frequency: 12.19%, Label: 1
Frequency: 10.77%, Label: 5
Frequency:  6.89%, Label: 6
Frequency:  4.29%, Label: 7
Other Labels: 9.62%

-----------
Column: x_5
-----------
Type:      Categorical
Frequency: 30.12%, Label: 7
Frequency: 20.09%, Label: 8
Frequency: 13.59%, Label: 6
Frequency:  8.90%, Label: 5
Frequency:  8.76%, Label: 9
Frequency:  5.34%, Label: 4
Frequency:  4.01%, Label: 10
Other Labels: 9.18%

-----------
Column: x_6
-----------
Type:      Categorical
Frequency: 30.05%, Label: 7
Frequency: 18.77%, Label: 8
Frequency: 12.53%, Label: 6
Frequency:  8.00%, Label: 9
Frequency:  6.26%, Label: 10
Frequency:  5.66%, Label: 11
Frequency:  4.38%, Label: 5
Frequency:  3.57%, Label: 4
Frequency:  3.31%, Label: 12
Other Labels: 7.47%

-----------
Column: x_7
-----------
Type:      Categorical
Frequency: 17.12%, Label: 3
Frequency: 15.99%, Label: 4
Frequency: 14.91%, Label: 5
Frequency: 13.46%, Label: 2
Frequency: 11.70%, Label: 6
Frequency:  7.11%, Label: 7
Frequency:  5.42%, Label: 1
Frequency:  5.07%, Label: 8
Other Labels: 9.21%

-----------
Column: x_8
-----------
Type:      Categorical
Frequency: 16.21%, Label: 5
Frequency: 15.49%, Label: 6
Frequency: 15.05%, Label: 4
Frequency: 13.12%, Label: 7
Frequency:  9.57%, Label: 3
Frequency:  9.26%, Label: 2
Frequency:  8.44%, Label: 8
Frequency:  4.37%, Label: 9
Other Labels: 8.48%

-----------
Column: x_9
-----------
Type:      Categorical
Frequency: 27.89%, Label: 7
Frequency: 13.76%, Label: 6
Frequency: 12.81%, Label: 10
Frequency:  9.86%, Label: 9
Frequency:  8.99%, Label: 11
Frequency:  8.34%, Label: 8
Frequency:  5.21%, Label: 12
Frequency:  4.09%, Label: 5
Other Labels: 9.04%

---------
Column: y
---------
Type:      Categorical
Frequency:  4.07%, Label: U
Frequency:  4.03%, Label: D
Frequency:  4.01%, Label: P
Frequency:  3.98%, Label: T
Frequency:  3.96%, Label: M
Frequency:  3.94%, Label: A
Frequency:  3.94%, Label: X
Frequency:  3.93%, Label: Y
Frequency:  3.92%, Label: N
Frequency:  3.92%, Label: Q
Frequency:  3.88%, Label: F
Frequency:  3.87%, Label: G
Frequency:  3.84%, Label: E
Frequency:  3.83%, Label: B
Frequency:  3.82%, Label: V
Frequency:  3.81%, Label: L
Frequency:  3.79%, Label: R
Frequency:  3.77%, Label: I
Frequency:  3.77%, Label: O
Frequency:  3.76%, Label: W
Frequency:  3.74%, Label: S
Frequency:  3.73%, Label: J
Frequency:  3.69%, Label: K
Frequency:  3.68%, Label: C
Other Labels: 7.34%

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 1.00 -1.89e-04 1.00

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
0 No Variable 0.00 0.04
1 x_12 0.71 0.39
2 x_14 0.94 0.64
3 x_8 0.99 0.80
4 x_7 1.00 0.92
5 x_10 1.00 0.98
6 x_1 1.00 0.99
7 x_11 1.00 1.00
8 x_9 1.00 1.00
9 x_15 1.00 1.00
10 x_2 1.00 1.00
11 x_5 1.00 1.00
12 x_3 1.00 1.00
13 x_0 1.00 1.00
14 x_4 1.00 1.00
15 x_6 1.00 1.00
16 x_13 1.00 1.00