How To Anonymize Your Regression Data

We show how to anonymize your data and illustrate that this has no impact on performance using the Census Income UCI dataset. We use the UCI Yacht Hydrodynamics dataset.

In [1]:
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')

# Required imports
import pandas as pd
import kxy
In [2]:
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/'\
                 '00243/yacht_hydrodynamics.data', sep='[ ]{1,2}',\
                 names=['Longitudinal Position', 'Prismatic Coeefficient',\
                        'Length-Displacement', 'Beam-Draught Ratio',\
                        'Length-Beam Ratio', 'Froude Number',\
                        'Residuary Resistance'])
df.rename(columns={col: col.title() for col in df.columns}, inplace=True)

The Data

In [3]:
df.kxy.describe()

--------------------------
Column: Beam-Draught Ratio
--------------------------
Type:   Continuous
Max:    5.3
p75:    4.2
Mean:   3.9
Median: 4.0
p25:    3.8
Min:    2.8

---------------------
Column: Froude Number
---------------------
Type:   Continuous
Max:    0.5
p75:    0.4
Mean:   0.3
Median: 0.3
p25:    0.2
Min:    0.1

-------------------------
Column: Length-Beam Ratio
-------------------------
Type:   Continuous
Max:    3.6
p75:    3.5
Mean:   3.2
Median: 3.1
p25:    3.1
Min:    2.7

---------------------------
Column: Length-Displacement
---------------------------
Type:   Continuous
Max:    5.1
p75:    5.1
Mean:   4.8
Median: 4.8
p25:    4.8
Min:    4.3

-----------------------------
Column: Longitudinal Position
-----------------------------
Type:   Continuous
Max:    0.0
p75:    -2.3
Mean:   -2.4
Median: -2.3
p25:    -2.4
Min:    -5.0

------------------------------
Column: Prismatic Coeefficient
------------------------------
Type:   Continuous
Max:    0.6
p75:    0.6
Mean:   0.6
Median: 0.6
p25:    0.5
Min:    0.5

----------------------------
Column: Residuary Resistance
----------------------------
Type:   Continuous
Max:    62
p75:    12
Mean:   10
Median: 3.1
p25:    0.8
Min:    0.0

The Anonymized Data

In [4]:
anon_df = df.kxy.anonymize(columns_to_exclude=['Residuary Resistance'])
anon_df.kxy.describe()

--------------------------
Column: Beam-Draught Ratio
--------------------------
Type:   Continuous
Max:    308
p75:    231
Mean:   154
Median: 154
p25:    77
Min:    1.0

---------------------
Column: Froude Number
---------------------
Type:   Continuous
Max:    308
p75:    231
Mean:   154
Median: 154
p25:    77
Min:    1.0

-------------------------
Column: Length-Beam Ratio
-------------------------
Type:   Continuous
Max:    308
p75:    231
Mean:   154
Median: 154
p25:    77
Min:    1.0

---------------------------
Column: Length-Displacement
---------------------------
Type:   Continuous
Max:    308
p75:    231
Mean:   154
Median: 154
p25:    77
Min:    1.0

-----------------------------
Column: Longitudinal Position
-----------------------------
Type:   Continuous
Max:    308
p75:    231
Mean:   154
Median: 154
p25:    77
Min:    1.0

------------------------------
Column: Prismatic Coeefficient
------------------------------
Type:   Continuous
Max:    308
p75:    231
Mean:   154
Median: 154
p25:    77
Min:    1.0

----------------------------
Column: Residuary Resistance
----------------------------
Type:   Continuous
Max:    62
p75:    12
Mean:   10
Median: 3.1
p25:    0.8
Min:    0.0

Comparing Variable Selection With and Without Anonymization

Model-Free Variable Selection With Clear Data

In [5]:
df.kxy.variable_selection('Residuary Resistance', problem_type='regression')
[====================================================================================================] 100% ETA: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
1 Froude Number 0.99 1.66
2 Beam-Draught Ratio 0.99 1.62
3 Length-Displacement 0.99 1.60
4 Longitudinal Position 0.99 1.59
5 Length-Beam Ratio 0.99 1.58
6 Prismatic Coeefficient 0.99 1.58

Column Meaning

  • Selection Order: Order in which variables (see Variable) are selected, from the most important to the least important. The first variable selected is the one with the highest mutual information with the label (i.e. that is the most useful when used in isolation). The \((i+1)\)-th variable selected is the variable, among all variables not yet selected, that complements all \(i\) variables previously selected the most

  • Running Achievable R-Squared: The highest \(R^2\) achievable by a model using all variables selected so far to predict the label.

  • Running Achievable RMSE: The lowest Root Mean Square Error achievable by a model using all variables selected so far to predict the label.

Model-Free Variable Selection With Anonymized Data

In [6]:
anon_df.kxy.variable_selection('Residuary Resistance', problem_type='regression')
[====================================================================================================] 100% ETA: 0s
Out[6]:
Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
1 Froude Number 0.99 1.66
2 Beam-Draught Ratio 0.99 1.62
3 Length-Displacement 0.99 1.60
4 Longitudinal Position 0.99 1.59
5 Length-Beam Ratio 0.99 1.58
6 Prismatic Coeefficient 0.99 1.58