House Prices Advanced (Kaggle, Regression, n=1460, d=79)

Loading The Data

In [1]:
from kxy_datasets.kaggle_regressions import HousePricesAdvanced # pip install kxy_datasets
In [2]:
dataset = HousePricesAdvanced()
df = dataset.df # Retrieve the dataset as a pandas dataframe
y_column = dataset.y_column # The name of the column corresponding to the target
problem_type = dataset.problem_type # 'regression' or 'classification'
INFO:botocore.credentials:Found credentials in shared credentials file: ~/.aws/credentials
In [3]:
df.kxy.describe() # Visualize a summary of the data

----------------
Column: 1stFlrSF
----------------
Type:   Continuous
Max:    4,692
p75:    1,391
Mean:   1,162
Median: 1,087
p25:    882
Min:    334

----------------
Column: 2ndFlrSF
----------------
Type:   Continuous
Max:    2,065
p75:    728
Mean:   346
Median: 0.0
p25:    0.0
Min:    0.0

-----------------
Column: 3SsnPorch
-----------------
Type:   Continuous
Max:    508
p75:    0.0
Mean:   3.4
Median: 0.0
p25:    0.0
Min:    0.0

-------------
Column: Alley
-------------
Type:      Categorical
Frequency: 93.77%, Label: nan
Other Labels: 6.23%

--------------------
Column: BedroomAbvGr
--------------------
Type:   Continuous
Max:    8.0
p75:    3.0
Mean:   2.9
Median: 3.0
p25:    2.0
Min:    0.0

----------------
Column: BldgType
----------------
Type:      Categorical
Frequency: 83.56%, Label: 1Fam
Frequency:  7.81%, Label: TwnhsE
Other Labels: 8.63%

----------------
Column: BsmtCond
----------------
Type:      Categorical
Frequency: 89.79%, Label: TA
Frequency:  4.45%, Label: Gd
Other Labels: 5.75%

--------------------
Column: BsmtExposure
--------------------
Type:      Categorical
Frequency: 65.27%, Label: No
Frequency: 15.14%, Label: Av
Frequency:  9.18%, Label: Gd
Frequency:  7.81%, Label: Mn
Other Labels: 2.60%

------------------
Column: BsmtFinSF1
------------------
Type:   Continuous
Max:    5,644
p75:    712
Mean:   443
Median: 383
p25:    0.0
Min:    0.0

------------------
Column: BsmtFinSF2
------------------
Type:   Continuous
Max:    1,474
p75:    0.0
Mean:   46
Median: 0.0
p25:    0.0
Min:    0.0

--------------------
Column: BsmtFinType1
--------------------
Type:      Categorical
Frequency: 29.45%, Label: Unf
Frequency: 28.63%, Label: GLQ
Frequency: 15.07%, Label: ALQ
Frequency: 10.14%, Label: BLQ
Frequency:  9.11%, Label: Rec
Other Labels: 7.60%

--------------------
Column: BsmtFinType2
--------------------
Type:      Categorical
Frequency: 86.03%, Label: Unf
Frequency:  3.70%, Label: Rec
Frequency:  3.15%, Label: LwQ
Other Labels: 7.12%

--------------------
Column: BsmtFullBath
--------------------
Type:   Continuous
Max:    3.0
p75:    1.0
Mean:   0.4
Median: 0.0
p25:    0.0
Min:    0.0

--------------------
Column: BsmtHalfBath
--------------------
Type:   Continuous
Max:    2.0
p75:    0.0
Mean:   0.1
Median: 0.0
p25:    0.0
Min:    0.0

----------------
Column: BsmtQual
----------------
Type:      Categorical
Frequency: 44.45%, Label: TA
Frequency: 42.33%, Label: Gd
Frequency:  8.29%, Label: Ex
Other Labels: 4.93%

-----------------
Column: BsmtUnfSF
-----------------
Type:   Continuous
Max:    2,336
p75:    808
Mean:   567
Median: 477
p25:    223
Min:    0.0

------------------
Column: CentralAir
------------------
Type:      Categorical
Frequency: 93.49%, Label: Y
Other Labels: 6.51%

------------------
Column: Condition1
------------------
Type:      Categorical
Frequency: 86.30%, Label: Norm
Frequency:  5.55%, Label: Feedr
Other Labels: 8.15%

------------------
Column: Condition2
------------------
Type:      Categorical
Frequency: 98.97%, Label: Norm
Other Labels: 1.03%

------------------
Column: Electrical
------------------
Type:      Categorical
Frequency: 91.37%, Label: SBrkr
Other Labels: 8.63%

---------------------
Column: EnclosedPorch
---------------------
Type:   Continuous
Max:    552
p75:    0.0
Mean:   21
Median: 0.0
p25:    0.0
Min:    0.0

-----------------
Column: ExterCond
-----------------
Type:      Categorical
Frequency: 87.81%, Label: TA
Frequency: 10.00%, Label: Gd
Other Labels: 2.19%

-----------------
Column: ExterQual
-----------------
Type:      Categorical
Frequency: 62.05%, Label: TA
Frequency: 33.42%, Label: Gd
Other Labels: 4.52%

-------------------
Column: Exterior1st
-------------------
Type:      Categorical
Frequency: 35.27%, Label: VinylSd
Frequency: 15.21%, Label: HdBoard
Frequency: 15.07%, Label: MetalSd
Frequency: 14.11%, Label: Wd Sdng
Frequency:  7.40%, Label: Plywood
Frequency:  4.18%, Label: CemntBd
Other Labels: 8.77%

-------------------
Column: Exterior2nd
-------------------
Type:      Categorical
Frequency: 34.52%, Label: VinylSd
Frequency: 14.66%, Label: MetalSd
Frequency: 14.18%, Label: HdBoard
Frequency: 13.49%, Label: Wd Sdng
Frequency:  9.73%, Label: Plywood
Frequency:  4.11%, Label: CmentBd
Other Labels: 9.32%

-------------
Column: Fence
-------------
Type:      Categorical
Frequency: 80.75%, Label: nan
Frequency: 10.75%, Label: MnPrv
Other Labels: 8.49%

-------------------
Column: FireplaceQu
-------------------
Type:      Categorical
Frequency: 47.26%, Label: nan
Frequency: 26.03%, Label: Gd
Frequency: 21.44%, Label: TA
Other Labels: 5.27%

------------------
Column: Fireplaces
------------------
Type:   Continuous
Max:    3.0
p75:    1.0
Mean:   0.6
Median: 1.0
p25:    0.0
Min:    0.0

------------------
Column: Foundation
------------------
Type:      Categorical
Frequency: 44.32%, Label: PConc
Frequency: 43.42%, Label: CBlock
Frequency: 10.00%, Label: BrkTil
Other Labels: 2.26%

----------------
Column: FullBath
----------------
Type:   Continuous
Max:    3.0
p75:    2.0
Mean:   1.6
Median: 2.0
p25:    1.0
Min:    0.0

------------------
Column: Functional
------------------
Type:      Categorical
Frequency: 93.15%, Label: Typ
Other Labels: 6.85%

------------------
Column: GarageArea
------------------
Type:   Continuous
Max:    1,418
p75:    576
Mean:   472
Median: 480
p25:    334
Min:    0.0

------------------
Column: GarageCars
------------------
Type:   Continuous
Max:    4.0
p75:    2.0
Mean:   1.8
Median: 2.0
p25:    1.0
Min:    0.0

------------------
Column: GarageCond
------------------
Type:      Categorical
Frequency: 90.82%, Label: TA
Other Labels: 9.18%

--------------------
Column: GarageFinish
--------------------
Type:      Categorical
Frequency: 41.44%, Label: Unf
Frequency: 28.90%, Label: RFn
Frequency: 24.11%, Label: Fin
Other Labels: 5.55%

------------------
Column: GarageQual
------------------
Type:      Categorical
Frequency: 89.79%, Label: TA
Frequency:  5.55%, Label: nan
Other Labels: 4.66%

------------------
Column: GarageType
------------------
Type:      Categorical
Frequency: 59.59%, Label: Attchd
Frequency: 26.51%, Label: Detchd
Frequency:  6.03%, Label: BuiltIn
Other Labels: 7.88%

-------------------
Column: GarageYrBlt
-------------------
Type:   Continuous
Max:    2,010
p75:    2,002
Mean:   1,978
Median: 1,980
p25:    1,961
Min:    1,900

-----------------
Column: GrLivArea
-----------------
Type:   Continuous
Max:    5,642
p75:    1,776
Mean:   1,515
Median: 1,464
p25:    1,129
Min:    334

----------------
Column: HalfBath
----------------
Type:   Continuous
Max:    2.0
p75:    1.0
Mean:   0.4
Median: 0.0
p25:    0.0
Min:    0.0

---------------
Column: Heating
---------------
Type:      Categorical
Frequency: 97.81%, Label: GasA
Other Labels: 2.19%

-----------------
Column: HeatingQC
-----------------
Type:      Categorical
Frequency: 50.75%, Label: Ex
Frequency: 29.32%, Label: TA
Frequency: 16.51%, Label: Gd
Other Labels: 3.42%

------------------
Column: HouseStyle
------------------
Type:      Categorical
Frequency: 49.73%, Label: 1Story
Frequency: 30.48%, Label: 2Story
Frequency: 10.55%, Label: 1.5Fin
Other Labels: 9.25%

--------------------
Column: KitchenAbvGr
--------------------
Type:   Continuous
Max:    3.0
p75:    1.0
Mean:   1.0
Median: 1.0
p25:    1.0
Min:    0.0

-------------------
Column: KitchenQual
-------------------
Type:      Categorical
Frequency: 50.34%, Label: TA
Frequency: 40.14%, Label: Gd
Other Labels: 9.52%

-------------------
Column: LandContour
-------------------
Type:      Categorical
Frequency: 89.79%, Label: Lvl
Frequency:  4.32%, Label: Bnk
Other Labels: 5.89%

-----------------
Column: LandSlope
-----------------
Type:      Categorical
Frequency: 94.66%, Label: Gtl
Other Labels: 5.34%

---------------
Column: LotArea
---------------
Type:   Continuous
Max:    215,245
p75:    11,601
Mean:   10,516
Median: 9,478
p25:    7,553
Min:    1,300

-----------------
Column: LotConfig
-----------------
Type:      Categorical
Frequency: 72.05%, Label: Inside
Frequency: 18.01%, Label: Corner
Other Labels: 9.93%

-------------------
Column: LotFrontage
-------------------
Type:   Continuous
Max:    313
p75:    80
Mean:   70
Median: 69
p25:    59
Min:    21

----------------
Column: LotShape
----------------
Type:      Categorical
Frequency: 63.36%, Label: Reg
Frequency: 33.15%, Label: IR1
Other Labels: 3.49%

--------------------
Column: LowQualFinSF
--------------------
Type:   Continuous
Max:    572
p75:    0.0
Mean:   5.8
Median: 0.0
p25:    0.0
Min:    0.0

------------------
Column: MSSubClass
------------------
Type:   Continuous
Max:    190
p75:    70
Mean:   56
Median: 50
p25:    20
Min:    20

----------------
Column: MSZoning
----------------
Type:      Categorical
Frequency: 78.84%, Label: RL
Frequency: 14.93%, Label: RM
Other Labels: 6.23%

------------------
Column: MasVnrArea
------------------
Type:   Continuous
Max:    1,600
p75:    166
Mean:   103
Median: 0.0
p25:    0.0
Min:    0.0

------------------
Column: MasVnrType
------------------
Type:      Categorical
Frequency: 59.18%, Label: None
Frequency: 30.48%, Label: BrkFace
Frequency:  8.77%, Label: Stone
Other Labels: 1.58%

-------------------
Column: MiscFeature
-------------------
Type:      Categorical
Frequency: 96.30%, Label: nan
Other Labels: 3.70%

---------------
Column: MiscVal
---------------
Type:   Continuous
Max:    15,500
p75:    0.0
Mean:   43
Median: 0.0
p25:    0.0
Min:    0.0

--------------
Column: MoSold
--------------
Type:   Continuous
Max:    12
p75:    8.0
Mean:   6.3
Median: 6.0
p25:    5.0
Min:    1.0

--------------------
Column: Neighborhood
--------------------
Type:      Categorical
Frequency: 15.41%, Label: NAmes
Frequency: 10.27%, Label: CollgCr
Frequency:  7.74%, Label: OldTown
Frequency:  6.85%, Label: Edwards
Frequency:  5.89%, Label: Somerst
Frequency:  5.41%, Label: Gilbert
Frequency:  5.27%, Label: NridgHt
Frequency:  5.07%, Label: Sawyer
Frequency:  5.00%, Label: NWAmes
Frequency:  4.04%, Label: SawyerW
Frequency:  3.97%, Label: BrkSide
Frequency:  3.49%, Label: Crawfor
Frequency:  3.36%, Label: Mitchel
Frequency:  2.81%, Label: NoRidge
Frequency:  2.60%, Label: Timber
Frequency:  2.53%, Label: IDOTRR
Frequency:  1.92%, Label: ClearCr
Other Labels: 8.36%

-------------------
Column: OpenPorchSF
-------------------
Type:   Continuous
Max:    547
p75:    68
Mean:   46
Median: 25
p25:    0.0
Min:    0.0

-------------------
Column: OverallCond
-------------------
Type:   Continuous
Max:    9.0
p75:    6.0
Mean:   5.6
Median: 5.0
p25:    5.0
Min:    1.0

-------------------
Column: OverallQual
-------------------
Type:   Continuous
Max:    10
p75:    7.0
Mean:   6.1
Median: 6.0
p25:    5.0
Min:    1.0

------------------
Column: PavedDrive
------------------
Type:      Categorical
Frequency: 91.78%, Label: Y
Other Labels: 8.22%

----------------
Column: PoolArea
----------------
Type:   Continuous
Max:    738
p75:    0.0
Mean:   2.8
Median: 0.0
p25:    0.0
Min:    0.0

--------------
Column: PoolQC
--------------
Type:      Categorical
Frequency: 99.52%, Label: nan
Other Labels: 0.48%

----------------
Column: RoofMatl
----------------
Type:      Categorical
Frequency: 98.22%, Label: CompShg
Other Labels: 1.78%

-----------------
Column: RoofStyle
-----------------
Type:      Categorical
Frequency: 78.15%, Label: Gable
Frequency: 19.59%, Label: Hip
Other Labels: 2.26%

---------------------
Column: SaleCondition
---------------------
Type:      Categorical
Frequency: 82.05%, Label: Normal
Frequency:  8.56%, Label: Partial
Other Labels: 9.38%

-----------------
Column: SalePrice
-----------------
Type:   Continuous
Max:    755,000
p75:    214,000
Mean:   180,921
Median: 163,000
p25:    129,975
Min:    34,900

----------------
Column: SaleType
----------------
Type:      Categorical
Frequency: 86.78%, Label: WD
Frequency:  8.36%, Label: New
Other Labels: 4.86%

-------------------
Column: ScreenPorch
-------------------
Type:   Continuous
Max:    480
p75:    0.0
Mean:   15
Median: 0.0
p25:    0.0
Min:    0.0

--------------
Column: Street
--------------
Type:      Categorical
Frequency: 99.59%, Label: Pave
Other Labels: 0.41%

--------------------
Column: TotRmsAbvGrd
--------------------
Type:   Continuous
Max:    14
p75:    7.0
Mean:   6.5
Median: 6.0
p25:    5.0
Min:    2.0

-------------------
Column: TotalBsmtSF
-------------------
Type:   Continuous
Max:    6,110
p75:    1,298
Mean:   1,057
Median: 991
p25:    795
Min:    0.0

-----------------
Column: Utilities
-----------------
Type:      Categorical
Frequency: 99.93%, Label: AllPub
Other Labels: 0.07%

------------------
Column: WoodDeckSF
------------------
Type:   Continuous
Max:    857
p75:    168
Mean:   94
Median: 0.0
p25:    0.0
Min:    0.0

-----------------
Column: YearBuilt
-----------------
Type:   Continuous
Max:    2,010
p75:    2,000
Mean:   1,971
Median: 1,973
p25:    1,954
Min:    1,872

--------------------
Column: YearRemodAdd
--------------------
Type:   Continuous
Max:    2,010
p75:    2,004
Mean:   1,984
Median: 1,994
p25:    1,967
Min:    1,950

--------------
Column: YrSold
--------------
Type:   Continuous
Max:    2,010
p75:    2,009
Mean:   2,007
Median: 2,008
p25:    2,007
Min:    2,006

Data Valuation

In [4]:
df.kxy.data_valuation(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[4]:
Achievable R-Squared Achievable Log-Likelihood Per Sample Achievable RMSE
0 1.00 -8.67 1,747

Automatic (Model-Free) Variable Selection

In [5]:
df.kxy.variable_selection(y_column, problem_type=problem_type)
[====================================================================================================] 100% ETA: 0s    Duration: 0s
Out[5]:
Variable Running Achievable R-Squared Running Achievable RMSE
Selection Order
0 No Variable 0.00 7.94e+04
1 OverallQual 0.65 4.70e+04
2 GrLivArea 0.78 3.70e+04
3 YearBuilt 0.84 3.17e+04
4 TotalBsmtSF 0.85 3.12e+04
... ... ... ...
75 BsmtFullBath 1.00 1.75e+03
76 Street 1.00 1.75e+03
77 Fence 1.00 1.75e+03
78 TotRmsAbvGrd 1.00 1.75e+03
79 3SsnPorch 1.00 1.75e+03

80 rows × 3 columns