How To Anonymize Your Data

We show how to anonymize your data and illustrate that this has no impact on performance using the Census Income UCI dataset.

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import kxy
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00267/' \
    'data_banknote_authentication.txt'
df = pd.read_csv(url, names=['Variance', 'Skewness', 'Kurtosis', 'Entropy', 'Is Fake'])

The Data

In [2]:
df.kxy.describe()

---------------
Column: Entropy
---------------
Type:   Continuous
Max:    2.4
p75:    0.4
Mean:   -1.2
Median: -0.6
p25:    -2.4
Min:    -8.5

---------------
Column: Is Fake
---------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.4
Median: 0.0
p25:    0.0
Min:    0.0

----------------
Column: Kurtosis
----------------
Type:   Continuous
Max:    17
p75:    3.2
Mean:   1.4
Median: 0.6
p25:    -1.6
Min:    -5.3

----------------
Column: Skewness
----------------
Type:   Continuous
Max:    12
p75:    6.8
Mean:   1.9
Median: 2.3
p25:    -1.7
Min:    -13.8

----------------
Column: Variance
----------------
Type:   Continuous
Max:    6.8
p75:    2.8
Mean:   0.4
Median: 0.5
p25:    -1.8
Min:    -7.0

The Anonymized Data

In [3]:
anon_df = df.kxy.anonymize(columns_to_exclude=['Is Fake'])
anon_df.kxy.describe()

---------------
Column: Entropy
---------------
Type:   Continuous
Max:    1,372
p75:    1,029
Mean:   686
Median: 686
p25:    343
Min:    1.0

---------------
Column: Is Fake
---------------
Type:   Continuous
Max:    1.0
p75:    1.0
Mean:   0.4
Median: 0.0
p25:    0.0
Min:    0.0

----------------
Column: Kurtosis
----------------
Type:   Continuous
Max:    1,372
p75:    1,029
Mean:   686
Median: 686
p25:    343
Min:    1.0

----------------
Column: Skewness
----------------
Type:   Continuous
Max:    1,372
p75:    1,029
Mean:   686
Median: 686
p25:    343
Min:    1.0

----------------
Column: Variance
----------------
Type:   Continuous
Max:    1,372
p75:    1,029
Mean:   686
Median: 686
p25:    343
Min:    1.0

Comparing Variable Selection With and Without Anonymization

Model-Free Variable Selection With Clear Data

In [4]:
df.kxy.data_valuation('Is Fake', problem_type='classification')
[====================================================================================================] 100% ETA: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable Accuracy
0 0.75 7.29e-04 1.00
In [5]:
df.kxy.variable_selection('Is Fake', problem_type='classification')
[====================================================================================================] 100% ETA: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
1 Variance 0.39 0.84
2 Skewness 0.44 0.86
3 Kurtosis 0.65 0.96
4 Entropy 0.65 0.96

Model-Free Variable Selection With Anonymized Data

In [6]:
anon_df.kxy.variable_selection('Is Fake', problem_type='classification')
[====================================================================================================] 100% ETA: 0s
Out[6]:
Variable Running Achievable R-Squared Running Achievable Accuracy
Selection Order
1 Variance 0.39 0.84
2 Skewness 0.44 0.86
3 Kurtosis 0.65 0.96
4 Entropy 0.65 0.96