The KXY Data Valuation Function Works (Classification)

We consider two synthetic classification problems where the highest accuracy achievable is known. We restrict ourselves to bivariate problems so that we may visualize the data.

We show that the data valuation analysis of the kxy package is able to recover the highest classification accuracy achievable almost perfectly.

In [1]:
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)

# Required imports
import pandas as pd
import seaborn as sns
import kxy

I - Generate the data

In each experiment, we partition a square into two parts: one containing blue dots, and the other containing red dots.

Tasked with determining the color of a dot from its coordinates, the best classifier should be able to achieve perfect accuracy (by design).

To introduce a known level of noise/errors, we loop through points on the square and flip their color with probability \(1-e\).

In [2]:
p = 0.1 # Error probability
n = 2000 # Sample size

Flag Data

In [3]:
# Generate
u = -1.+2.*np.random.rand(n, 2)
perm = np.random.rand(n)<p
b = np.zeros(n)

b[perm] = np.logical_or(u[perm, 0] < -0.33, u[perm, 0] > 0.33)
b[np.logical_not(perm)] = np.logical_and(u[np.logical_not(perm), 0] >= -0.33,
                                        u[np.logical_not(perm), 0] <= 0.33)

b = b.astype(bool)
r = np.logical_not(b)
zb = u[b, :]
zr = u[r, :]
yb = np.ones(zb.shape[0])
yr = np.zeros(zr.shape[0])
Xq = np.concatenate([zb, zr], axis=0)
Yq = np.concatenate([yb, yr], axis=0)
Zq = np.concatenate([Yq[:, None], Xq], axis=1)
# Numpy -> DataFrame
dfq = pd.DataFrame(Zq, columns=['label', 'var1', 'var2'])

# Plot
y_border = np.arange(-1., 1, 0.05)
fig, ax = plt.subplots(1, 1, figsize=(15, 15))
plt.plot(zb[:, 0], zb[:, 1], '.b')
plt.plot(-.33*np.ones_like(y_border), y_border, '--k')
plt.plot(.33*np.ones_like(y_border), y_border, '--k')
plt.plot(zr[:, 0], zr[:, 1], '.r')
ax.set_xlim([-1., 1.])
ax.set_ylim([-1., 1.])
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
ax.axis('off')
ax.set_title(r'Error Proba: %.2f'% (p), fontsize=25)
plt.show()
../../../_images/latest_applications_case_studies_empirical_validation_classification_5_0.png

Circular Data

In [4]:
# Generate the data
dr = 0.05
u = -1.+2.*np.random.rand(n, 2)
perm = np.random.rand(n)<p
b = np.zeros(n)
b[perm] = np.logical_and(np.linalg.norm(u[perm], axis=1)>0.5+dr, np.linalg.norm(u[perm], axis=1)<1.)
b[np.logical_not(perm)] = np.linalg.norm(u[np.logical_not(perm)], axis=1)<0.5-dr
b = b.astype(bool)
r = np.logical_and(np.logical_not(b), np.linalg.norm(u, axis=1)<1.)
r = np.logical_and(r, np.logical_or(np.linalg.norm(u, axis=1)<0.5-dr, np.linalg.norm(u, axis=1)>0.5+dr))
z = u[np.linalg.norm(u, axis=1)<1., :]

zb = u[b, :]
zr = u[r, :]
yb = np.ones(zb.shape[0])
yr = np.zeros(zr.shape[0])
Xc = np.concatenate([zb, zr], axis=0)
Yc = np.concatenate([yb, yr], axis=0)
Zc = np.concatenate([Yc[:, None], Xc], axis=1)

# Numpy -> DataFrame
dfc = pd.DataFrame(Zc, columns=['label', 'var1', 'var2'])

# Plot
fig, ax = plt.subplots(1, 1, figsize=(15, 15))
theta = np.linspace(0, 2.*np.pi, 100)
r1, r2 = 0.5, 1.0
xc1, yc1 = r1*np.cos(theta), r1*np.sin(theta)
xc2, yc2 = r2*np.cos(theta), r2*np.sin(theta)

plt.plot(zb[:, 0], zb[:, 1], '.b')
plt.plot(zr[:, 0], zr[:, 1], '.r')
plt.plot(xc1, yc1, '--k')
plt.plot(xc2, yc2, '--k')
ax.set_xlim([-1., 1.])
ax.set_ylim([-1., 1.])
ax.axis('off')
ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
ax.set_title(r'Error Proba: %.2f'% (p), fontsize=25)
plt.show()
../../../_images/latest_applications_case_studies_empirical_validation_classification_7_0.png

II - Data Valuation

Flag Data

In [5]:
dfq.kxy.data_valuation('label', problem_type='classification')
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 0.49 -3.15e-01 0.91

Circular Data

In [6]:
dfc.kxy.data_valuation('label', problem_type='classification')
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[6]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 0.42 -3.16e-01 0.90