How To Anonymize Your Data¶
We show how to anonymize your data and illustrate that this has no impact on performance using the Census Income UCI dataset.
In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import kxy
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00267/' \
'data_banknote_authentication.txt'
df = pd.read_csv(url, names=['Variance', 'Skewness', 'Kurtosis', 'Entropy', 'Is Fake'])
The Data¶
In [2]:
df.kxy.describe()
---------------
Column: Entropy
---------------
Type: Continuous
Max: 2.4
p75: 0.4
Mean: -1.2
Median: -0.6
p25: -2.4
Min: -8.5
---------------
Column: Is Fake
---------------
Type: Continuous
Max: 1.0
p75: 1.0
Mean: 0.4
Median: 0.0
p25: 0.0
Min: 0.0
----------------
Column: Kurtosis
----------------
Type: Continuous
Max: 17
p75: 3.2
Mean: 1.4
Median: 0.6
p25: -1.6
Min: -5.3
----------------
Column: Skewness
----------------
Type: Continuous
Max: 12
p75: 6.8
Mean: 1.9
Median: 2.3
p25: -1.7
Min: -13.8
----------------
Column: Variance
----------------
Type: Continuous
Max: 6.8
p75: 2.8
Mean: 0.4
Median: 0.5
p25: -1.8
Min: -7.0
The Anonymized Data¶
In [3]:
anon_df = df.kxy.anonymize(columns_to_exclude=['Is Fake'])
anon_df.kxy.describe()
---------------
Column: Entropy
---------------
Type: Continuous
Max: 1,372
p75: 1,029
Mean: 686
Median: 686
p25: 343
Min: 1.0
---------------
Column: Is Fake
---------------
Type: Continuous
Max: 1.0
p75: 1.0
Mean: 0.4
Median: 0.0
p25: 0.0
Min: 0.0
----------------
Column: Kurtosis
----------------
Type: Continuous
Max: 1,372
p75: 1,029
Mean: 686
Median: 686
p25: 343
Min: 1.0
----------------
Column: Skewness
----------------
Type: Continuous
Max: 1,372
p75: 1,029
Mean: 686
Median: 686
p25: 343
Min: 1.0
----------------
Column: Variance
----------------
Type: Continuous
Max: 1,372
p75: 1,029
Mean: 686
Median: 686
p25: 343
Min: 1.0
Comparing Variable Selection With and Without Anonymization¶
Model-Free Variable Selection With Clear Data¶
In [4]:
df.kxy.data_valuation('Is Fake', problem_type='classification')
[====================================================================================================] 100% ETA: 0s
Out[4]:
Achievable R-Squared | Achievable Log-Likelihood Per Sample | Achievable Accuracy | |
---|---|---|---|
0 | 0.75 | 7.29e-04 | 1.00 |
In [5]:
df.kxy.variable_selection('Is Fake', problem_type='classification')
[====================================================================================================] 100% ETA: 0s
Out[5]:
Variable | Running Achievable R-Squared | Running Achievable Accuracy | |
---|---|---|---|
Selection Order | |||
1 | Variance | 0.39 | 0.84 |
2 | Skewness | 0.44 | 0.86 |
3 | Kurtosis | 0.65 | 0.96 |
4 | Entropy | 0.65 | 0.96 |
Model-Free Variable Selection With Anonymized Data¶
In [6]:
anon_df.kxy.variable_selection('Is Fake', problem_type='classification')
[====================================================================================================] 100% ETA: 0s
Out[6]:
Variable | Running Achievable R-Squared | Running Achievable Accuracy | |
---|---|---|---|
Selection Order | |||
1 | Variance | 0.39 | 0.84 |
2 | Skewness | 0.44 | 0.86 |
3 | Kurtosis | 0.65 | 0.96 |
4 | Entropy | 0.65 | 0.96 |