Complex Regression Example

We showcase our data valuation and variable selection analyses on a dataset with a mix of 79 categorical and continuous explanatory variables, namely the Kaggle house prices advanced regression dataset (https://www.kaggle.com/c/house-prices-advanced-regression-techniques). The aim here is to show that our analyses work beyond a few explanatory variables and adequately cope with categorical data.

In [1]:
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')

# Required imports
import pandas as pd
import kxy
In [2]:
# The training data can be downloaded here:
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques
df = pd.read_csv('kaggle_house_prices_advanced_regression.csv')
df.set_index('Id', inplace=True)

Data Valuation

In [3]:
# Calculating achievable performance
df.kxy.data_valuation('SalePrice', problem_type='regression')
[====================================================================================================] 100% ETA: 0s
Out[3]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable RMSE
0 0.98 -10 9,917

Variable Selection Analysis

In [4]:
# Model Free Variable Selection Analysis (i.e. where is the juice in your data)
df.kxy.variable_selection('SalePrice', problem_type='regression')
[====================================================================================================] 100% ETA: 0s
Out[4]:
Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
1 OverallQual 0.65 46,800
2 GrLivArea 0.78 37,155
3 YearBuilt 0.85 31,235
4 TotalBsmtSF 0.88 27,806
5 OverallCond 0.91 23,889
... ... ... ...
75 Heating 0.98 9,917
76 BsmtExposure 0.98 9,917
77 PoolQC 0.98 9,917
78 MiscVal 0.98 9,917
79 BsmtQual 0.98 9,917

79 rows × 3 columns