# Complex Regression Example¶

We showcase our data valuation and variable selection analyses on a dataset with a mix of 79 categorical and continuous explanatory variables, namely the Kaggle house prices advanced regression dataset (https://www.kaggle.com/c/house-prices-advanced-regression-techniques). The aim here is to show that our analyses work beyond a few explanatory variables and adequately cope with categorical data.

```
In [1]:
```

```
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings('ignore')
# Required imports
import pandas as pd
import kxy
```

```
In [2]:
```

```
# The training data can be downloaded here:
# https://www.kaggle.com/c/house-prices-advanced-regression-techniques
df = pd.read_csv('kaggle_house_prices_advanced_regression.csv')
df.set_index('Id', inplace=True)
```

## Data Valuation¶

```
In [3]:
```

```
# Calculating achievable performance
df.kxy.data_valuation('SalePrice', problem_type='regression')
```

```
[====================================================================================================] 100% ETA: 0s
```

```
Out[3]:
```

Achievable R-Squared | Achievable Log-Likelihood Per Sample | Achievable RMSE | |
---|---|---|---|

0 | 0.98 | -10 | 9,917 |

## Variable Selection Analysis¶

```
In [4]:
```

```
# Model Free Variable Selection Analysis (i.e. where is the juice in your data)
df.kxy.variable_selection('SalePrice', problem_type='regression')
```

```
[====================================================================================================] 100% ETA: 0s
```

```
Out[4]:
```

Variable | Running Achievable R-Squared | Running Achievable RMSE | |
---|---|---|---|

Selection Order | |||

1 | OverallQual | 0.65 | 46,800 |

2 | GrLivArea | 0.78 | 37,155 |

3 | YearBuilt | 0.85 | 31,235 |

4 | TotalBsmtSF | 0.88 | 27,806 |

5 | OverallCond | 0.91 | 23,889 |

... | ... | ... | ... |

75 | Heating | 0.98 | 9,917 |

76 | BsmtExposure | 0.98 | 9,917 |

77 | PoolQC | 0.98 | 9,917 |

78 | MiscVal | 0.98 | 9,917 |

79 | BsmtQual | 0.98 | 9,917 |

79 rows × 3 columns