Real Estate (UCI, Regression, n=414, d=6)¶

Loading The Data¶

In [1]:

from kxy_datasets.uci_regressions import RealEstate # pip install kxy_datasets

In [2]:

dataset = RealEstate()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'

In [3]:

df.kxy.describe() # Visualize a summary of the data


---------------------------
Column: X1 transaction date
---------------------------
Type:   Continuous
Max:    2,013
p75:    2,013
Mean:   2,013
Median: 2,013
p25:    2,012
Min:    2,012

--------------------
Column: X2 house age
--------------------
Type:   Continuous
Max:    43
p75:    28
Mean:   17
Median: 16
p25:    9.0
Min:    0.0

----------------------------------------------
Column: X3 distance to the nearest MRT station
----------------------------------------------
Type:   Continuous
Max:    6,488
p75:    1,454
Mean:   1,083
Median: 492
p25:    289
Min:    23

---------------------------------------
Column: X4 number of convenience stores
---------------------------------------
Type:   Continuous
Max:    10
p75:    6.0
Mean:   4.1
Median: 4.0
p25:    1.0
Min:    0.0

-------------------
Column: X5 latitude
-------------------
Type:   Continuous
Max:    25
p75:    24
Mean:   24
Median: 24
p25:    24
Min:    24

--------------------
Column: X6 longitude
--------------------
Type:   Continuous
Max:    121
p75:    121
Mean:   121
Median: 121
p25:    121
Min:    121

----------------------------------
Column: Y house price of unit area
----------------------------------
Type:   Continuous
Max:    117
p75:    46
Mean:   37
Median: 38
p25:    27
Min:    7.6

Data Valuation¶

In [4]:

df.kxy.data_valuation(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[4]:

	Achievable R-Squared	Achievable Log-Likelihood Per Sample	Achievable RMSE
0	0.80	-3.23	6.05

Automatic (Model-Free) Variable Selection¶

In [5]:

df.kxy.variable_selection(y_column, problem_type=problem_type)

[====================================================================================================] 100% ETA: 0s    Duration: 0s

Out[5]:

	Variable	Running Achievable R-Squared	Running Achievable RMSE
Selection Order
0	No Variable	0.00	1.36e+01
1	X3 distance to the nearest MRT station	0.70	7.49
2	X5 latitude	0.70	7.49
3	X2 house age	0.70	7.49
4	X1 transaction date	0.80	6.05
5	X4 number of convenience stores	0.80	6.05
6	X6 longitude	0.80	6.05