Complex Regression Example

We showcase our data valuation and variable selection analyses on a dataset with a mix of 79 categorical and continuous explanatory variables, namely the Kaggle house prices advanced regression dataset (https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

The aim here is to show that our analyses work beyond a few explanatory variables and adequately cope with categorical data.

In [1]:
%load_ext autoreload
%autoreload 2
import logging
logging.basicConfig(level=logging.WARNING)
import warnings
warnings.filterwarnings('ignore')

# Required imports
import pandas as pd
import kxy
In [2]:
# Load the data using the kxy_datasets package (pip install kxy_datasets)
from kxy_datasets.kaggle_regressions import HousePricesAdvanced
df = HousePricesAdvanced().df

Data Valuation

In [3]:
# Calculating achievable performance
df.kxy.data_valuation('SalePrice', problem_type='regression')
[====================================================================================================] 100% ETA: 0s
Out[3]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable RMSE
0 1.00 -8.67 1,747

Variable Selection Analysis

In [4]:
# Model Free Variable Selection Analysis (i.e. where is the juice in your data)
df.kxy.variable_selection('SalePrice', problem_type='regression')
[====================================================================================================] 100% ETA: 0s
Out[4]:
Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
0 No Variable 0.00 7.94e+04
1 OverallQual 0.65 4.70e+04
2 GrLivArea 0.78 3.70e+04
3 YearBuilt 0.84 3.17e+04
4 TotalBsmtSF 0.85 3.12e+04
... ... ... ...
75 BsmtFullBath 1.00 1.75e+03
76 Street 1.00 1.75e+03
77 Fence 1.00 1.75e+03
78 TotRmsAbvGrd 1.00 1.75e+03
79 3SsnPorch 1.00 1.75e+03

80 rows × 3 columns