1. Business Understanding¶
Zuvor musste das Maklerbüro die Vermieter erst als Kunden gewinnen und deren Mietobjekte von einem Gutachter bewerten lassen. Anschließend wurde mit der persönlichen Expertise der Makler ein Exposé über die Stuttgarter Mietsituation erstellt. Schließlich mussten geeignete Kunden für die Mietobjekte gefunden werden. Nun will die Dehner Immobilien GmbH expandieren, aber nicht in jeder Stadt ein neues Büro mit zusätzlichen Maklern eröffnen. Deshalb soll der Prozessschritt der Bewertung und Kundenakquise digitalisiert werden. Deshalb soll die Bewertung von Immobilien nun über künstliche Intelligenz erfolgen. Hier kann ein Interessent bequem eine Immobilie bewerten lassen, ohne dass ein Gutachter oder ein Makler das Objekt vor Ort besichtigen muss. Die anschließende Erstellung der Exposés für die deutschlandweiten Objekte erfolgt ebenfalls zentral von der Zentrale in Stuttgart aus auf Basis der Bewertungsergebnisse. Die zentrale Frage lautet: Wie lassen sich mit ML im Rahmen der Immobilienbewertung optimale Mietpreise vorhersagen?
2. Data Understanding¶
Die Daten stammen von Immoscout24, der größten Immobilienplattform in Deutschland. Immoscout24 listet sowohl Miet- als auch Kaufobjekte. Die Daten, die dieser Arbeit zugrunde liegen, enthalten jedoch nur Angebote für Mietobjekte. Der Datensatz enthält die wichtigsten Objekte. Der Datensatz enthält insgesamt 268.850 Einzeldaten mit jeweils 49 Variablen. Die 49 hier abgebildeten Objekte spiegeln die Beschreibungsmöglichkeiten von Immobilienscout24 wider. Allerdings eignen sich nicht alle Variablen als Prädiktoren für eine multiple lineare Regression. Nur Daten, die einen direkten Zusammenhang oder Einfluss auf den Mietpreis haben und somit ein Muster abbilden können, sind dafür geeignet. Dies schließt zum Beispiel die Art der Heizung oder die Stärke der Internetverbindung aus. Daher muss die Menge der Variablen in der Datenaufbereitung auf die für die Preisvorhersage relevanten Eigenschaften reduziert werden.
2.1 Bibliotheken Importieren¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
%matplotlib inline
2.2 Auslesen der Daten¶
raw_data = pd.read_csv("https://storage.googleapis.com/ml-service-repository-datastorage/Digital_Valuation_of_Real_Estate_data.csv")
raw_data.head()
regio1 | serviceCharge | heatingType | telekomTvOffer | telekomHybridUploadSpeed | newlyConst | balcony | picturecount | pricetrend | telekomUploadSpeed | ... | regio2 | regio3 | description | facilities | heatingCosts | energyEfficiencyClass | lastRefurbish | electricityBasePrice | electricityKwhPrice | date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Nordrhein_Westfalen | 245.00 | central_heating | ONE_YEAR_FREE | NaN | False | False | 6 | 4.62 | 10.0 | ... | Dortmund | Schüren | Die ebenerdig zu erreichende Erdgeschosswohnun... | Die Wohnung ist mit Laminat ausgelegt. Das Bad... | NaN | NaN | NaN | NaN | NaN | May19 |
1 | Rheinland_Pfalz | 134.00 | self_contained_central_heating | ONE_YEAR_FREE | NaN | False | True | 8 | 3.47 | 10.0 | ... | Rhein_Pfalz_Kreis | Böhl_Iggelheim | Alles neu macht der Mai – so kann es auch für ... | NaN | NaN | NaN | 2019.0 | NaN | NaN | May19 |
2 | Sachsen | 255.00 | floor_heating | ONE_YEAR_FREE | 10.0 | True | True | 8 | 2.72 | 2.4 | ... | Dresden | Äußere_Neustadt_Antonstadt | Der Neubau entsteht im Herzen der Dresdner Neu... | * 9 m² Balkon\n* Bad mit bodengleicher Dusche,... | NaN | NaN | NaN | NaN | NaN | Oct19 |
3 | Sachsen | 58.15 | district_heating | ONE_YEAR_FREE | NaN | False | True | 9 | 1.53 | 40.0 | ... | Mittelsachsen_Kreis | Freiberg | Abseits von Lärm und Abgasen in Ihre neue Wohn... | NaN | 87.23 | NaN | NaN | NaN | NaN | May19 |
4 | Bremen | 138.00 | self_contained_central_heating | NaN | NaN | False | True | 19 | 2.46 | NaN | ... | Bremen | Neu_Schwachhausen | Es handelt sich hier um ein saniertes Mehrfami... | Diese Wohnung wurde neu saniert und ist wie fo... | NaN | NaN | NaN | NaN | NaN | Feb20 |
5 rows × 49 columns
def attribute_description(data):
longestColumnName = len(max(np.array(data.columns), key=len))
print("| Feature | Data Type|")
print("|-----|------|")
for col in data.columns:
description = ''
col_dropna = data[col].dropna()
example = col_dropna.sample(1).values[0]
if type(example) == str:
description = 'str '
if len(col_dropna.unique()) < 10:
description += '{'
description += '; '.join([ f'"{name}"' for name in col_dropna.unique()])
description += '}'
else:
description += '[ example: "'+ example + '" ]'
elif (type(example) == np.int32) and (len(col_dropna.unique()) < 10) :
description += 'dummy int32 {'
description += '; '.join([ f'{name}' for name in sorted(col_dropna.unique())])
description += '}'
else:
try:
description = example.dtype
except:
description = type(example)
print("|" + col.ljust(longestColumnName)+ f'| {description} |')
attribute_description(raw_data)
2.3 Datenerkundung¶
raw_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 268850 entries, 0 to 268849 Data columns (total 49 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 regio1 268850 non-null object 1 serviceCharge 261941 non-null float64 2 heatingType 223994 non-null object 3 telekomTvOffer 236231 non-null object 4 telekomHybridUploadSpeed 45020 non-null float64 5 newlyConst 268850 non-null bool 6 balcony 268850 non-null bool 7 picturecount 268850 non-null int64 8 pricetrend 267018 non-null float64 9 telekomUploadSpeed 235492 non-null float64 10 totalRent 228333 non-null float64 11 yearConstructed 211805 non-null float64 12 scoutId 268850 non-null int64 13 noParkSpaces 93052 non-null float64 14 firingTypes 211886 non-null object 15 hasKitchen 268850 non-null bool 16 geo_bln 268850 non-null object 17 cellar 268850 non-null bool 18 yearConstructedRange 211805 non-null float64 19 baseRent 268850 non-null float64 20 houseNumber 197832 non-null object 21 livingSpace 268850 non-null float64 22 geo_krs 268850 non-null object 23 condition 200361 non-null object 24 interiorQual 156185 non-null object 25 petsAllowed 154277 non-null object 26 street 268850 non-null object 27 streetPlain 197837 non-null object 28 lift 268850 non-null bool 29 baseRentRange 268850 non-null int64 30 typeOfFlat 232236 non-null object 31 geo_plz 268850 non-null int64 32 noRooms 268850 non-null float64 33 thermalChar 162344 non-null float64 34 floor 217541 non-null float64 35 numberOfFloors 171118 non-null float64 36 noRoomsRange 268850 non-null int64 37 garden 268850 non-null bool 38 livingSpaceRange 268850 non-null int64 39 regio2 268850 non-null object 40 regio3 268850 non-null object 41 description 249103 non-null object 42 facilities 215926 non-null object 43 heatingCosts 85518 non-null float64 44 energyEfficiencyClass 77787 non-null object 45 lastRefurbish 80711 non-null float64 46 electricityBasePrice 46846 non-null float64 47 electricityKwhPrice 46846 non-null float64 48 date 268850 non-null object dtypes: bool(6), float64(18), int64(6), object(19) memory usage: 89.7+ MB
2.4 Deskriptive Statistik erkunden¶
raw_data.describe(include="all")
regio1 | serviceCharge | heatingType | telekomTvOffer | telekomHybridUploadSpeed | newlyConst | balcony | picturecount | pricetrend | telekomUploadSpeed | ... | regio2 | regio3 | description | facilities | heatingCosts | energyEfficiencyClass | lastRefurbish | electricityBasePrice | electricityKwhPrice | date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 268850 | 261941.000000 | 223994 | 236231 | 45020.0 | 268850 | 268850 | 268850.000000 | 267018.000000 | 235492.000000 | ... | 268850 | 268850 | 249103 | 215926 | 85518.000000 | 77787 | 80711.000000 | 46846.000000 | 46846.000000 | 268850 |
unique | 16 | NaN | 13 | 3 | NaN | 2 | 2 | NaN | NaN | NaN | ... | 419 | 8684 | 212621 | 189526 | NaN | 10 | NaN | NaN | NaN | 4 |
top | Nordrhein_Westfalen | NaN | central_heating | ONE_YEAR_FREE | NaN | False | True | NaN | NaN | NaN | ... | Leipzig | Innenstadt | + MEIN GUENTZ - Quartier der Generationen\r\n+... | Laminat | NaN | C | NaN | NaN | NaN | Feb20 |
freq | 62863 | NaN | 128977 | 227632 | NaN | 247679 | 165734 | NaN | NaN | NaN | ... | 13723 | 4751 | 216 | 207 | NaN | 14613 | NaN | NaN | NaN | 79276 |
mean | NaN | 151.206113 | NaN | NaN | 10.0 | NaN | NaN | 9.791958 | 3.389001 | 28.804928 | ... | NaN | NaN | NaN | NaN | 76.990866 | NaN | 2013.904536 | 89.113612 | 0.199769 | NaN |
std | NaN | 308.295790 | NaN | NaN | 0.0 | NaN | NaN | 6.408399 | 1.964874 | 16.337151 | ... | NaN | NaN | NaN | NaN | 147.716278 | NaN | 10.963125 | 5.395805 | 0.009667 | NaN |
min | NaN | 0.000000 | NaN | NaN | 10.0 | NaN | NaN | 0.000000 | -12.330000 | 1.000000 | ... | NaN | NaN | NaN | NaN | 0.000000 | NaN | 1015.000000 | 71.430000 | 0.170500 | NaN |
25% | NaN | 95.000000 | NaN | NaN | 10.0 | NaN | NaN | 6.000000 | 2.000000 | 10.000000 | ... | NaN | NaN | NaN | NaN | 54.000000 | NaN | 2012.000000 | 90.760000 | 0.191500 | NaN |
50% | NaN | 135.000000 | NaN | NaN | 10.0 | NaN | NaN | 9.000000 | 3.390000 | 40.000000 | ... | NaN | NaN | NaN | NaN | 70.000000 | NaN | 2017.000000 | 90.760000 | 0.198500 | NaN |
75% | NaN | 190.000000 | NaN | NaN | 10.0 | NaN | NaN | 13.000000 | 4.570000 | 40.000000 | ... | NaN | NaN | NaN | NaN | 90.000000 | NaN | 2019.000000 | 90.760000 | 0.205500 | NaN |
max | NaN | 146118.000000 | NaN | NaN | 10.0 | NaN | NaN | 121.000000 | 14.920000 | 100.000000 | ... | NaN | NaN | NaN | NaN | 12613.000000 | NaN | 2919.000000 | 90.760000 | 0.227600 | NaN |
11 rows × 49 columns
raw_data.isnull().sum()
regio1 0 serviceCharge 6909 heatingType 44856 telekomTvOffer 32619 telekomHybridUploadSpeed 223830 newlyConst 0 balcony 0 picturecount 0 pricetrend 1832 telekomUploadSpeed 33358 totalRent 40517 yearConstructed 57045 scoutId 0 noParkSpaces 175798 firingTypes 56964 hasKitchen 0 geo_bln 0 cellar 0 yearConstructedRange 57045 baseRent 0 houseNumber 71018 livingSpace 0 geo_krs 0 condition 68489 interiorQual 112665 petsAllowed 114573 street 0 streetPlain 71013 lift 0 baseRentRange 0 typeOfFlat 36614 geo_plz 0 noRooms 0 thermalChar 106506 floor 51309 numberOfFloors 97732 noRoomsRange 0 garden 0 livingSpaceRange 0 regio2 0 regio3 0 description 19747 facilities 52924 heatingCosts 183332 energyEfficiencyClass 191063 lastRefurbish 188139 electricityBasePrice 222004 electricityKwhPrice 222004 date 0 dtype: int64
2.5 Datenbereinigung¶
Eliminierung von Daten, die sich nicht für ein Preisvorhersagemodell eignen
data1 = raw_data.drop(["heatingType",
"telekomTvOffer",
"telekomHybridUploadSpeed",
"picturecount",
"pricetrend",
"telekomUploadSpeed",
"scoutId",
"noParkSpaces",
"firingTypes",
"geo_bln",
"yearConstructedRange",
"houseNumber",
"condition",
"interiorQual",
"petsAllowed",
"street",
"streetPlain",
"baseRentRange",
"thermalChar",
"numberOfFloors",
"noRoomsRange",
"livingSpaceRange",
"description",
"facilities",
"heatingCosts",
"energyEfficiencyClass",
"lastRefurbish",
"electricityKwhPrice",
"electricityBasePrice",
"date"], axis=1)
data1.describe(include="all")
regio1 | serviceCharge | newlyConst | balcony | totalRent | yearConstructed | hasKitchen | cellar | baseRent | livingSpace | geo_krs | lift | typeOfFlat | geo_plz | noRooms | floor | garden | regio2 | regio3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 268850 | 261941.000000 | 268850 | 268850 | 2.283330e+05 | 211805.000000 | 268850 | 268850 | 2.688500e+05 | 268850.000000 | 268850 | 268850 | 232236 | 268850.000000 | 268850.000000 | 217541.000000 | 268850 | 268850 | 268850 |
unique | 16 | NaN | 2 | 2 | NaN | NaN | 2 | 2 | NaN | NaN | 419 | 2 | 10 | NaN | NaN | NaN | 2 | 419 | 8684 |
top | Nordrhein_Westfalen | NaN | False | True | NaN | NaN | False | True | NaN | NaN | Leipzig | False | apartment | NaN | NaN | NaN | False | Leipzig | Innenstadt |
freq | 62863 | NaN | 247679 | 165734 | NaN | NaN | 176794 | 172235 | NaN | NaN | 13723 | 205528 | 131522 | NaN | NaN | NaN | 216093 | 13723 | 4751 |
mean | NaN | 151.206113 | NaN | NaN | 9.013315e+02 | 1966.400590 | NaN | NaN | 6.941294e+02 | 74.355548 | NaN | NaN | NaN | 37283.022235 | 2.641261 | 2.122405 | NaN | NaN | NaN |
std | NaN | 308.295790 | NaN | NaN | 3.323833e+04 | 46.992207 | NaN | NaN | 1.953602e+04 | 254.759208 | NaN | NaN | NaN | 27798.037296 | 2.633440 | 3.634934 | NaN | NaN | NaN |
min | NaN | 0.000000 | NaN | NaN | 0.000000e+00 | 1000.000000 | NaN | NaN | 0.000000e+00 | 0.000000 | NaN | NaN | NaN | 852.000000 | 1.000000 | -1.000000 | NaN | NaN | NaN |
25% | NaN | 95.000000 | NaN | NaN | 4.698000e+02 | 1950.000000 | NaN | NaN | 3.380000e+02 | 54.000000 | NaN | NaN | NaN | 9128.000000 | 2.000000 | 1.000000 | NaN | NaN | NaN |
50% | NaN | 135.000000 | NaN | NaN | 6.500000e+02 | 1973.000000 | NaN | NaN | 4.900000e+02 | 67.320000 | NaN | NaN | NaN | 38667.000000 | 3.000000 | 2.000000 | NaN | NaN | NaN |
75% | NaN | 190.000000 | NaN | NaN | 9.850000e+02 | 1996.000000 | NaN | NaN | 7.990000e+02 | 87.000000 | NaN | NaN | NaN | 57072.000000 | 3.000000 | 3.000000 | NaN | NaN | NaN |
max | NaN | 146118.000000 | NaN | NaN | 1.575154e+07 | 2090.000000 | NaN | NaN | 9.999999e+06 | 111111.000000 | NaN | NaN | NaN | 99998.000000 | 999.990000 | 999.000000 | NaN | NaN | NaN |
data1.isnull().sum()
regio1 0 serviceCharge 6909 newlyConst 0 balcony 0 totalRent 40517 yearConstructed 57045 hasKitchen 0 cellar 0 baseRent 0 livingSpace 0 geo_krs 0 lift 0 typeOfFlat 36614 geo_plz 0 noRooms 0 floor 51309 garden 0 regio2 0 regio3 0 dtype: int64
Eliminierung einiger weiterer Daten, die nicht in dieses Modell passen
data2 = data1.drop(["typeOfFlat",
"totalRent",
"floor"], axis=1)
data2.describe(include="all")
regio1 | serviceCharge | newlyConst | balcony | yearConstructed | hasKitchen | cellar | baseRent | livingSpace | geo_krs | lift | geo_plz | noRooms | garden | regio2 | regio3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 268850 | 261941.000000 | 268850 | 268850 | 211805.000000 | 268850 | 268850 | 2.688500e+05 | 268850.000000 | 268850 | 268850 | 268850.000000 | 268850.000000 | 268850 | 268850 | 268850 |
unique | 16 | NaN | 2 | 2 | NaN | 2 | 2 | NaN | NaN | 419 | 2 | NaN | NaN | 2 | 419 | 8684 |
top | Nordrhein_Westfalen | NaN | False | True | NaN | False | True | NaN | NaN | Leipzig | False | NaN | NaN | False | Leipzig | Innenstadt |
freq | 62863 | NaN | 247679 | 165734 | NaN | 176794 | 172235 | NaN | NaN | 13723 | 205528 | NaN | NaN | 216093 | 13723 | 4751 |
mean | NaN | 151.206113 | NaN | NaN | 1966.400590 | NaN | NaN | 6.941294e+02 | 74.355548 | NaN | NaN | 37283.022235 | 2.641261 | NaN | NaN | NaN |
std | NaN | 308.295790 | NaN | NaN | 46.992207 | NaN | NaN | 1.953602e+04 | 254.759208 | NaN | NaN | 27798.037296 | 2.633440 | NaN | NaN | NaN |
min | NaN | 0.000000 | NaN | NaN | 1000.000000 | NaN | NaN | 0.000000e+00 | 0.000000 | NaN | NaN | 852.000000 | 1.000000 | NaN | NaN | NaN |
25% | NaN | 95.000000 | NaN | NaN | 1950.000000 | NaN | NaN | 3.380000e+02 | 54.000000 | NaN | NaN | 9128.000000 | 2.000000 | NaN | NaN | NaN |
50% | NaN | 135.000000 | NaN | NaN | 1973.000000 | NaN | NaN | 4.900000e+02 | 67.320000 | NaN | NaN | 38667.000000 | 3.000000 | NaN | NaN | NaN |
75% | NaN | 190.000000 | NaN | NaN | 1996.000000 | NaN | NaN | 7.990000e+02 | 87.000000 | NaN | NaN | 57072.000000 | 3.000000 | NaN | NaN | NaN |
max | NaN | 146118.000000 | NaN | NaN | 2090.000000 | NaN | NaN | 9.999999e+06 | 111111.000000 | NaN | NaN | 99998.000000 | 999.990000 | NaN | NaN | NaN |
data2.isnull().sum()
regio1 0 serviceCharge 6909 newlyConst 0 balcony 0 yearConstructed 57045 hasKitchen 0 cellar 0 baseRent 0 livingSpace 0 geo_krs 0 lift 0 geo_plz 0 noRooms 0 garden 0 regio2 0 regio3 0 dtype: int64
Eliminierung der Standortdaten; zip code left!
data3 = data2.drop(["regio1",
"geo_krs",
"regio2",
"regio3"], axis=1)
data3.describe(include="all")
serviceCharge | newlyConst | balcony | yearConstructed | hasKitchen | cellar | baseRent | livingSpace | lift | geo_plz | noRooms | garden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 261941.000000 | 268850 | 268850 | 211805.000000 | 268850 | 268850 | 2.688500e+05 | 268850.000000 | 268850 | 268850.000000 | 268850.000000 | 268850 |
unique | NaN | 2 | 2 | NaN | 2 | 2 | NaN | NaN | 2 | NaN | NaN | 2 |
top | NaN | False | True | NaN | False | True | NaN | NaN | False | NaN | NaN | False |
freq | NaN | 247679 | 165734 | NaN | 176794 | 172235 | NaN | NaN | 205528 | NaN | NaN | 216093 |
mean | 151.206113 | NaN | NaN | 1966.400590 | NaN | NaN | 6.941294e+02 | 74.355548 | NaN | 37283.022235 | 2.641261 | NaN |
std | 308.295790 | NaN | NaN | 46.992207 | NaN | NaN | 1.953602e+04 | 254.759208 | NaN | 27798.037296 | 2.633440 | NaN |
min | 0.000000 | NaN | NaN | 1000.000000 | NaN | NaN | 0.000000e+00 | 0.000000 | NaN | 852.000000 | 1.000000 | NaN |
25% | 95.000000 | NaN | NaN | 1950.000000 | NaN | NaN | 3.380000e+02 | 54.000000 | NaN | 9128.000000 | 2.000000 | NaN |
50% | 135.000000 | NaN | NaN | 1973.000000 | NaN | NaN | 4.900000e+02 | 67.320000 | NaN | 38667.000000 | 3.000000 | NaN |
75% | 190.000000 | NaN | NaN | 1996.000000 | NaN | NaN | 7.990000e+02 | 87.000000 | NaN | 57072.000000 | 3.000000 | NaN |
max | 146118.000000 | NaN | NaN | 2090.000000 | NaN | NaN | 9.999999e+06 | 111111.000000 | NaN | 99998.000000 | 999.990000 | NaN |
data3.isnull().sum()
serviceCharge 6909 newlyConst 0 balcony 0 yearConstructed 57045 hasKitchen 0 cellar 0 baseRent 0 livingSpace 0 lift 0 geo_plz 0 noRooms 0 garden 0 dtype: int64
Fehlende Werte¶
Eliminierung von Null-Werten
data_no_mv = data3.dropna(axis=0)
data_no_mv.describe(include="all")
serviceCharge | newlyConst | balcony | yearConstructed | hasKitchen | cellar | baseRent | livingSpace | lift | geo_plz | noRooms | garden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 208195.000000 | 208195 | 208195 | 208195.000000 | 208195 | 208195 | 208195.00000 | 208195.000000 | 208195 | 208195.000000 | 208195.000000 | 208195 |
unique | NaN | 2 | 2 | NaN | 2 | 2 | NaN | NaN | 2 | NaN | NaN | 2 |
top | NaN | False | True | NaN | False | True | NaN | NaN | False | NaN | NaN | False |
freq | NaN | 187880 | 134169 | NaN | 139914 | 137208 | NaN | NaN | 154725 | NaN | NaN | 169100 |
mean | 152.703730 | NaN | NaN | 1966.208526 | NaN | NaN | 653.92228 | 74.172720 | NaN | 36991.365917 | 2.650684 | NaN |
std | 343.201422 | NaN | NaN | 46.979715 | NaN | NaN | 582.88770 | 155.398388 | NaN | 27655.475222 | 2.516210 | NaN |
min | 0.000000 | NaN | NaN | 1000.000000 | NaN | NaN | 0.00000 | 0.000000 | NaN | 1057.000000 | 1.000000 | NaN |
25% | 95.000000 | NaN | NaN | 1950.000000 | NaN | NaN | 335.00000 | 54.850000 | NaN | 9130.000000 | 2.000000 | NaN |
50% | 136.000000 | NaN | NaN | 1972.000000 | NaN | NaN | 490.00000 | 67.350000 | NaN | 38368.000000 | 3.000000 | NaN |
75% | 190.000000 | NaN | NaN | 1996.000000 | NaN | NaN | 800.00000 | 86.510000 | NaN | 56068.000000 | 3.000000 | NaN |
max | 146118.000000 | NaN | NaN | 2090.000000 | NaN | NaN | 120000.00000 | 66100.000000 | NaN | 99994.000000 | 999.990000 | NaN |
data_no_mv.isnull().sum()
serviceCharge 0 newlyConst 0 balcony 0 yearConstructed 0 hasKitchen 0 cellar 0 baseRent 0 livingSpace 0 lift 0 geo_plz 0 noRooms 0 garden 0 dtype: int64
Doppelte Werte¶
Eliminierung von Werten, die zweimal im Datensatz vorkommen
data_no_mv[data_no_mv.duplicated(keep=False)]
serviceCharge | newlyConst | balcony | yearConstructed | hasKitchen | cellar | baseRent | livingSpace | lift | geo_plz | noRooms | garden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
17 | 50.00 | False | False | 1897.0 | False | False | 200.00 | 50.00 | False | 4720 | 2.0 | True |
19 | 290.00 | False | True | 2013.0 | False | True | 1343.48 | 127.95 | True | 37073 | 5.0 | False |
50 | 50.00 | False | True | 1980.0 | False | True | 285.00 | 36.29 | True | 15517 | 1.0 | False |
52 | 170.00 | False | False | 1905.0 | False | True | 365.00 | 79.00 | False | 9337 | 3.0 | False |
57 | 113.80 | False | False | 1969.0 | False | True | 340.00 | 42.52 | False | 19057 | 2.0 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
268801 | 68.00 | False | True | 1999.0 | False | True | 190.00 | 28.54 | False | 17098 | 1.0 | False |
268815 | 79.20 | False | True | 1984.0 | False | True | 329.87 | 65.45 | False | 6124 | 4.0 | False |
268821 | 80.00 | True | True | 2019.0 | False | True | 680.00 | 79.50 | False | 56281 | 2.0 | False |
268826 | 59.54 | False | True | 1978.0 | False | False | 309.61 | 59.54 | False | 4746 | 3.0 | False |
268841 | 98.00 | False | True | 1994.0 | False | True | 302.52 | 57.08 | False | 39114 | 2.5 | False |
16755 rows × 12 columns
data_no_dv = data_no_mv.drop_duplicates(keep="first")
data_no_dv[data_no_dv.duplicated(keep=False)]
serviceCharge | newlyConst | balcony | yearConstructed | hasKitchen | cellar | baseRent | livingSpace | lift | geo_plz | noRooms | garden |
---|
Distribution¶
data_no_dv.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 198749 entries, 0 to 268849 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 serviceCharge 198749 non-null float64 1 newlyConst 198749 non-null bool 2 balcony 198749 non-null bool 3 yearConstructed 198749 non-null float64 4 hasKitchen 198749 non-null bool 5 cellar 198749 non-null bool 6 baseRent 198749 non-null float64 7 livingSpace 198749 non-null float64 8 lift 198749 non-null bool 9 geo_plz 198749 non-null int64 10 noRooms 198749 non-null float64 11 garden 198749 non-null bool dtypes: bool(6), float64(5), int64(1) memory usage: 11.8 MB
data_no_dv.describe(include="all")
serviceCharge | newlyConst | balcony | yearConstructed | hasKitchen | cellar | baseRent | livingSpace | lift | geo_plz | noRooms | garden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 198749.000000 | 198749 | 198749 | 198749.000000 | 198749 | 198749 | 198749.000000 | 198749.000000 | 198749 | 198749.000000 | 198749.000000 | 198749 |
unique | NaN | 2 | 2 | NaN | 2 | 2 | NaN | NaN | 2 | NaN | NaN | 2 |
top | NaN | False | True | NaN | False | True | NaN | NaN | False | NaN | NaN | False |
freq | NaN | 179561 | 127983 | NaN | 132463 | 131153 | NaN | NaN | 147937 | NaN | NaN | 161044 |
mean | 153.694116 | NaN | NaN | 1966.017343 | NaN | NaN | 658.968301 | 74.517406 | NaN | 37554.190391 | 2.654986 | NaN |
std | 350.688667 | NaN | NaN | 47.346554 | NaN | NaN | 586.128842 | 158.910643 | NaN | 27623.931360 | 2.567069 | NaN |
min | 0.000000 | NaN | NaN | 1000.000000 | NaN | NaN | 0.000000 | 0.000000 | NaN | 1057.000000 | 1.000000 | NaN |
25% | 97.000000 | NaN | NaN | 1950.000000 | NaN | NaN | 340.000000 | 55.000000 | NaN | 9328.000000 | 2.000000 | NaN |
50% | 138.110000 | NaN | NaN | 1972.000000 | NaN | NaN | 495.000000 | 68.000000 | NaN | 39104.000000 | 3.000000 | NaN |
75% | 190.000000 | NaN | NaN | 1996.000000 | NaN | NaN | 810.000000 | 87.000000 | NaN | 56566.000000 | 3.000000 | NaN |
max | 146118.000000 | NaN | NaN | 2090.000000 | NaN | NaN | 120000.000000 | 66100.000000 | NaN | 99994.000000 | 999.990000 | NaN |
Untersuchung der Verteilung der übrigen Variablen
sns.distplot(data_no_dv["serviceCharge"])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='serviceCharge', ylabel='Density'>
Ausreißer eliminieren¶
q = data_no_dv["serviceCharge"].quantile(0.99)
q
450.0
data_1 = data_no_dv[data_no_dv["serviceCharge"]<q]
sns.distplot(data_1["serviceCharge"])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='serviceCharge', ylabel='Density'>
sns.distplot(data_1["baseRent"])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='baseRent', ylabel='Density'>
q_1 = data_1["baseRent"].quantile(0.99)
q_1
2220.0
data_2 = data_1[data_1["baseRent"]<q_1]
q_2 = data_no_mv["baseRent"].quantile(0.01)
q_2
170.0
data_3 = data_2[data_2["baseRent"]>q_2]
sns.distplot(data_3["baseRent"])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='baseRent', ylabel='Density'>
Falsche Werte¶
Bereinigung der Daten von falschen Werten
sns.distplot(data_3["livingSpace"])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='livingSpace', ylabel='Density'>
q_3 = data_3["livingSpace"].quantile(0.995)
q_3
175.0
data_4 = data_3[data_3["livingSpace"]<q_3]
q_4 = data_4["livingSpace"].quantile(0.005)
q_4
20.0
data_5 = data_4[data_4["livingSpace"]>q_4]
sns.distplot(data_5["livingSpace"])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='livingSpace', ylabel='Density'>
data_5["geo_plz"].min()
1057
data_5["geo_plz"].max()
99994
data_5 = data_5[data_5["geo_plz"]>1066] #Dresden hat die niedrigste PLZ mit 01067
sns.distplot(data_5["noRooms"])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='noRooms', ylabel='Density'>
q_5 = data_5["noRooms"].quantile(0.995)
q_5
5.0
data_6 = data_5[data_5["noRooms"]<q_5]
q_6 = data_6["noRooms"].quantile(0.005)
q_6
1.0
data_7 = data_6[data_6["noRooms"]>q_6]
sns.distplot(data_7["noRooms"])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='noRooms', ylabel='Density'>
Letzte Vorbereitungen¶
Überprüfen Sie die übrigen Variablen auf die Verteilung der Daten
sns.distplot(data_7["yearConstructed"])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='yearConstructed', ylabel='Density'>
data_8 = data_7[data_7["yearConstructed"]<2021.0]
q_7 = data_8["yearConstructed"].quantile(0.005)
q_7
1823.0
data_9 = data_8[data_8["yearConstructed"]>q_7]
sns.distplot(data_9["yearConstructed"])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='yearConstructed', ylabel='Density'>
data_9.describe(include="all")
serviceCharge | newlyConst | balcony | yearConstructed | hasKitchen | cellar | baseRent | livingSpace | lift | geo_plz | noRooms | garden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 169772.000000 | 169772 | 169772 | 169772.000000 | 169772 | 169772 | 169772.000000 | 169772.000000 | 169772 | 169772.000000 | 169772.000000 | 169772 |
unique | NaN | 2 | 2 | NaN | 2 | 2 | NaN | NaN | 2 | NaN | NaN | 2 |
top | NaN | False | True | NaN | False | True | NaN | NaN | False | NaN | NaN | False |
freq | NaN | 152899 | 111973 | NaN | 117171 | 115113 | NaN | NaN | 128697 | NaN | NaN | 137553 |
mean | 150.577057 | NaN | NaN | 1967.520816 | NaN | NaN | 627.400543 | 74.150277 | NaN | 37258.531772 | 2.722473 | NaN |
std | 69.131265 | NaN | NaN | 38.220866 | NaN | NaN | 369.127139 | 23.007341 | NaN | 27225.560201 | 0.697763 | NaN |
min | 0.000000 | NaN | NaN | 1824.000000 | NaN | NaN | 170.460000 | 20.320000 | NaN | 1067.000000 | 1.100000 | NaN |
25% | 100.000000 | NaN | NaN | 1950.000000 | NaN | NaN | 350.000000 | 58.000000 | NaN | 9235.000000 | 2.000000 | NaN |
50% | 140.000000 | NaN | NaN | 1972.000000 | NaN | NaN | 500.000000 | 69.265000 | NaN | 39104.000000 | 3.000000 | NaN |
75% | 190.000000 | NaN | NaN | 1996.000000 | NaN | NaN | 800.000000 | 86.000000 | NaN | 55152.750000 | 3.000000 | NaN |
max | 449.000000 | NaN | NaN | 2020.000000 | NaN | NaN | 2218.000000 | 174.800000 | NaN | 99994.000000 | 4.500000 | NaN |
die booleschen Daten in ganzzahlige Werte umwandeln¶
data_9['newlyConst'] = data_9['newlyConst']*1
data_9['balcony'] = data_9['balcony']*1
data_9['hasKitchen'] = data_9['hasKitchen']*1
data_9['cellar'] = data_9['cellar']*1
data_9['lift'] = data_9['lift']*1
data_9['garden'] = data_9['garden']*1
<ipython-input-58-8fd0b50daa89>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data_9['newlyConst'] = data_9['newlyConst']*1 <ipython-input-58-8fd0b50daa89>:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data_9['balcony'] = data_9['balcony']*1 <ipython-input-58-8fd0b50daa89>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data_9['hasKitchen'] = data_9['hasKitchen']*1 <ipython-input-58-8fd0b50daa89>:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data_9['cellar'] = data_9['cellar']*1 <ipython-input-58-8fd0b50daa89>:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data_9['lift'] = data_9['lift']*1 <ipython-input-58-8fd0b50daa89>:6: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy data_9['garden'] = data_9['garden']*1
data_9.head()
serviceCharge | newlyConst | balcony | yearConstructed | hasKitchen | cellar | baseRent | livingSpace | lift | geo_plz | noRooms | garden | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 245.00 | 0 | 0 | 1965.0 | 0 | 1 | 595.0 | 86.00 | 0 | 44269 | 4.0 | 1 |
1 | 134.00 | 0 | 1 | 1871.0 | 0 | 0 | 800.0 | 89.00 | 0 | 67459 | 3.0 | 0 |
2 | 255.00 | 1 | 1 | 2019.0 | 0 | 1 | 965.0 | 83.80 | 1 | 1097 | 3.0 | 0 |
3 | 58.15 | 0 | 1 | 1964.0 | 0 | 0 | 343.0 | 58.15 | 0 | 9599 | 3.0 | 0 |
4 | 138.00 | 0 | 1 | 1950.0 | 0 | 0 | 765.0 | 84.97 | 0 | 28213 | 3.0 | 0 |
data_9.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 169772 entries, 0 to 268849 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 serviceCharge 169772 non-null float64 1 newlyConst 169772 non-null int64 2 balcony 169772 non-null int64 3 yearConstructed 169772 non-null float64 4 hasKitchen 169772 non-null int64 5 cellar 169772 non-null int64 6 baseRent 169772 non-null float64 7 livingSpace 169772 non-null float64 8 lift 169772 non-null int64 9 geo_plz 169772 non-null int64 10 noRooms 169772 non-null float64 11 garden 169772 non-null int64 dtypes: float64(5), int64(7) memory usage: 16.8 MB
3. Data Preparation¶
sns.jointplot(x='serviceCharge',y='baseRent', kind='reg', data=data_9)
# correlation detected --> keep service Charge
<seaborn.axisgrid.JointGrid at 0x11ca779a0>
sns.jointplot(x='yearConstructed',y='baseRent', kind='reg', data=data_9)
#correlation detected --> keep year Constructed
<seaborn.axisgrid.JointGrid at 0x1228f7f70>
sns.jointplot(x='newlyConst',y='baseRent', kind='reg', data=data_9)
# correlation detected --> keep newly Const
<seaborn.axisgrid.JointGrid at 0x12272b4c0>
sns.jointplot(x='balcony',y='baseRent', kind='reg', data=data_9)
# no significant correlation --> drop balcony
<seaborn.axisgrid.JointGrid at 0x12649fc10>
sns.jointplot(x='hasKitchen',y='baseRent', kind='reg', data=data_9)
# correlation detected --> keep hasKitchen
<seaborn.axisgrid.JointGrid at 0x127050d00>
sns.jointplot(x='cellar',y='baseRent', kind='reg', data=data_9)
# no correlation --> drop cellar
<seaborn.axisgrid.JointGrid at 0x136baa5b0>
sns.jointplot(x='livingSpace',y='baseRent', kind='reg', data=data_9)
# correlation detected --> keep livingSpace
<seaborn.axisgrid.JointGrid at 0x138e9e430>
sns.jointplot(x='lift',y='baseRent', kind='reg', data=data_9)
# correlation detected --> keep lift
<seaborn.axisgrid.JointGrid at 0x13b364e50>
sns.jointplot(x='geo_plz',y='baseRent', kind='reg', data=data_9)
# correlation detected --> keep geo_plz
<seaborn.axisgrid.JointGrid at 0x13b415a00>
sns.jointplot(x='noRooms',y='baseRent', kind='reg', data=data_9)
# correlation detected --> keep noRooms
<seaborn.axisgrid.JointGrid at 0x13b030d30>
sns.jointplot(x='garden',y='baseRent', kind='reg', data=data_9)
# no correlation --> drop garden!
<seaborn.axisgrid.JointGrid at 0x1354de700>
Eliminieren Sie diejenigen Variablen, die keinen signifikanten Einfluss auf die Zielvariable haben.
data_cleaned = data_9.drop(["balcony",
"cellar",
"garden"], axis=1)
data_cleaned.describe(include='all')
serviceCharge | newlyConst | yearConstructed | hasKitchen | baseRent | livingSpace | lift | geo_plz | noRooms | |
---|---|---|---|---|---|---|---|---|---|
count | 169772.000000 | 169772.000000 | 169772.000000 | 169772.000000 | 169772.000000 | 169772.000000 | 169772.000000 | 169772.000000 | 169772.000000 |
mean | 150.577057 | 0.099386 | 1967.520816 | 0.309833 | 627.400543 | 74.150277 | 0.241942 | 37258.531772 | 2.722473 |
std | 69.131265 | 0.299181 | 38.220866 | 0.462426 | 369.127139 | 23.007341 | 0.428261 | 27225.560201 | 0.697763 |
min | 0.000000 | 0.000000 | 1824.000000 | 0.000000 | 170.460000 | 20.320000 | 0.000000 | 1067.000000 | 1.100000 |
25% | 100.000000 | 0.000000 | 1950.000000 | 0.000000 | 350.000000 | 58.000000 | 0.000000 | 9235.000000 | 2.000000 |
50% | 140.000000 | 0.000000 | 1972.000000 | 0.000000 | 500.000000 | 69.265000 | 0.000000 | 39104.000000 | 3.000000 |
75% | 190.000000 | 0.000000 | 1996.000000 | 1.000000 | 800.000000 | 86.000000 | 0.000000 | 55152.750000 | 3.000000 |
max | 449.000000 | 1.000000 | 2020.000000 | 1.000000 | 2218.000000 | 174.800000 | 1.000000 | 99994.000000 | 4.500000 |
3.1 OLS-Annahmen¶
Aus den oberen Jointplots lässt sich bereits ableiten, dass die Linearität nicht für alle Daten gewährleistet ist.
Daher wird die Zielvariable über den Logarithmus transformiert, um die Linearität der Daten herzustellen.
log_price = np.log(data_cleaned["baseRent"])
data_cleaned["log_price"] = log_price
data_cleaned
serviceCharge | newlyConst | yearConstructed | hasKitchen | baseRent | livingSpace | lift | geo_plz | noRooms | log_price | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 245.00 | 0 | 1965.0 | 0 | 595.0 | 86.00 | 0 | 44269 | 4.0 | 6.388561 |
1 | 134.00 | 0 | 1871.0 | 0 | 800.0 | 89.00 | 0 | 67459 | 3.0 | 6.684612 |
2 | 255.00 | 1 | 2019.0 | 0 | 965.0 | 83.80 | 1 | 1097 | 3.0 | 6.872128 |
3 | 58.15 | 0 | 1964.0 | 0 | 343.0 | 58.15 | 0 | 9599 | 3.0 | 5.837730 |
4 | 138.00 | 0 | 1950.0 | 0 | 765.0 | 84.97 | 0 | 28213 | 3.0 | 6.639876 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
268845 | 90.00 | 0 | 2016.0 | 0 | 820.0 | 90.00 | 0 | 82390 | 3.0 | 6.709304 |
268846 | 220.00 | 0 | 1983.0 | 1 | 930.0 | 115.00 | 0 | 68519 | 3.5 | 6.835185 |
268847 | 220.00 | 0 | 1965.0 | 0 | 650.0 | 95.00 | 0 | 65552 | 4.0 | 6.476972 |
268848 | 175.00 | 1 | 2019.0 | 0 | 840.0 | 70.00 | 1 | 51069 | 2.0 | 6.733402 |
268849 | 315.00 | 0 | 1972.0 | 1 | 935.0 | 77.00 | 1 | 65936 | 3.0 | 6.840547 |
169772 rows × 10 columns
Überprüfung der Wirksamkeit der Log-Transformation¶
plt.figure(figsize=(10,10))
plt.subplot(221)
plt.scatter(data_cleaned["serviceCharge"], data_cleaned["baseRent"])
plt.title("baseRent")
plt.subplot(222)
plt.scatter(data_cleaned["serviceCharge"], data_cleaned["log_price"])
plt.title("log_price")
Text(0.5, 1.0, 'log_price')
plt.figure(figsize=(10,10))
plt.subplot(221)
plt.scatter(data_cleaned["newlyConst"], data_cleaned["baseRent"])
plt.title("baseRent")
plt.subplot(222)
plt.scatter(data_cleaned["newlyConst"], data_cleaned["log_price"])
plt.title("log_price")
Text(0.5, 1.0, 'log_price')
plt.figure(figsize=(10,10))
plt.subplot(221)
plt.scatter(data_cleaned["yearConstructed"], data_cleaned["baseRent"])
plt.title("baseRent")
plt.subplot(222)
plt.scatter(data_cleaned["yearConstructed"], data_cleaned["log_price"])
plt.title("log_price")
Text(0.5, 1.0, 'log_price')
plt.figure(figsize=(10,10))
plt.subplot(221)
plt.scatter(data_cleaned["hasKitchen"], data_cleaned["baseRent"])
plt.title("baseRent")
plt.subplot(222)
plt.scatter(data_cleaned["hasKitchen"], data_cleaned["log_price"])
plt.title("log_price")
Text(0.5, 1.0, 'log_price')
plt.figure(figsize=(10,10))
plt.subplot(221)
plt.scatter(data_cleaned["livingSpace"], data_cleaned["baseRent"])
plt.title("baseRent")
plt.subplot(222)
plt.scatter(data_cleaned["livingSpace"], data_cleaned["log_price"])
plt.title("log_price")
Text(0.5, 1.0, 'log_price')
plt.figure(figsize=(10,10))
plt.subplot(221)
plt.scatter(data_cleaned["lift"], data_cleaned["baseRent"])
plt.title("baseRent")
plt.subplot(222)
plt.scatter(data_cleaned["lift"], data_cleaned["log_price"])
plt.title("log_price")
Text(0.5, 1.0, 'log_price')
plt.figure(figsize=(10,10))
plt.subplot(221)
plt.scatter(data_cleaned["geo_plz"], data_cleaned["baseRent"])
plt.title("baseRent")
plt.subplot(222)
plt.scatter(data_cleaned["geo_plz"], data_cleaned["log_price"])
plt.title("log_price")
Text(0.5, 1.0, 'log_price')
plt.figure(figsize=(10,10))
plt.subplot(221)
plt.scatter(data_cleaned["noRooms"], data_cleaned["baseRent"])
plt.title("baseRent")
plt.subplot(222)
plt.scatter(data_cleaned["noRooms"], data_cleaned["log_price"])
plt.title("log_price")
Text(0.5, 1.0, 'log_price')
data_lin = data_cleaned.drop(["baseRent"], axis=1)
3.2 Annahme Keine Multikollinearität¶
data_lin.columns.values
array(['serviceCharge', 'newlyConst', 'yearConstructed', 'hasKitchen', 'livingSpace', 'lift', 'geo_plz', 'noRooms', 'log_price'], dtype=object)
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = data_lin.drop(["log_price"], axis = 1)
vif = pd.DataFrame()
vif["VIF"]= [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif["Features"] = variables.columns
vif
VIF | Features | |
---|---|---|
0 | 10.349237 | serviceCharge |
1 | 1.310333 | newlyConst |
2 | 18.502446 | yearConstructed |
3 | 1.551697 | hasKitchen |
4 | 30.402962 | livingSpace |
5 | 1.676574 | lift |
6 | 3.080676 | geo_plz |
7 | 30.696412 | noRooms |
data_nomulti = data_lin.drop(["serviceCharge",
"yearConstructed",
"noRooms"], axis=1)
variables_2 = data_nomulti.drop(["log_price"], axis = 1)
vif = pd.DataFrame()
vif["VIF"]= [variance_inflation_factor(variables_2.values, i) for i in range(variables_2.shape[1])]
vif["Features"] = variables_2.columns
vif
VIF | Features | |
---|---|---|
0 | 1.295990 | newlyConst |
1 | 1.513116 | hasKitchen |
2 | 3.418712 | livingSpace |
3 | 1.560634 | lift |
4 | 2.920054 | geo_plz |
data_nomulti.head()
newlyConst | hasKitchen | livingSpace | lift | geo_plz | log_price | |
---|---|---|---|---|---|---|
0 | 0 | 0 | 86.00 | 0 | 44269 | 6.388561 |
1 | 0 | 0 | 89.00 | 0 | 67459 | 6.684612 |
2 | 1 | 0 | 83.80 | 1 | 1097 | 6.872128 |
3 | 0 | 0 | 58.15 | 0 | 9599 | 5.837730 |
4 | 0 | 0 | 84.97 | 0 | 28213 | 6.639876 |
3.3 Neue Indexierung¶
data_final = data_nomulti.reset_index(drop=True)
data_final.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 169772 entries, 0 to 169771 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 newlyConst 169772 non-null int64 1 hasKitchen 169772 non-null int64 2 livingSpace 169772 non-null float64 3 lift 169772 non-null int64 4 geo_plz 169772 non-null int64 5 log_price 169772 non-null float64 dtypes: float64(2), int64(4) memory usage: 7.8 MB
data_final.describe()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-11-00ab303c1af6> in <module> ----> 1 data_final.describe() NameError: name 'data_final' is not defined
3.4 Zusammenhänge der endgültigen Variablen für die Modellierung¶
sns.set_palette("GnBu_d")
sns.set_style('whitegrid')
sns.pairplot(data_final)
<seaborn.axisgrid.PairGrid at 0x102eb8a00>
targets = data_final["log_price"]
inputs = data_final.drop(["log_price"],axis=1)
Skalieren der Daten¶
Skalieren Sie den Datensatz mit scikit learn StandardScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(inputs)
inputs_scaled = scaler.transform(inputs)
Zug Test Split¶
Unterteilen Sie die Stichprobe in eine Trainings- und eine Testmenge. Definieren Sie eine Zufallsstichprobe von 20% als Testdaten mit dem Status 365
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(inputs_scaled, targets, test_size=0.2, random_state=365)
4.2 Erstellen des Regressionsmodells¶
Definieren Sie die Regression mit Testdaten und passen Sie das Modell an.
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train,y_train)
LinearRegression()
Anwendung des Regressionsmodells auf die Trainingsdaten¶
y_hat = reg.predict(x_train)
Auswertung des Trainingsdatenmodells¶
Vergleich der Vorhersagen mit den tatsächlichen Werten (Zielen)
plt.scatter(y_train, y_hat)
plt.xlabel('Targets',size=14)
plt.ylabel('Predictions',size=14)
plt.show()
Wahrscheinlichkeitsverteilungsfunktion der Abweichung der vorhergesagten Werte
sns.distplot(y_train - y_hat)
plt.title("Residuals PDF", size=18)
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Residuals PDF')
R-Quadrat als Qualitätsmaß für das Modell
print('training performance:')
print(reg.score(x_train,y_train))
training performance: 0.674935582171114
print('reg intercept:')
print(reg.intercept_)
reg intercept: 6.29323730269735
Gewichtung der Prädiktoren
reg_summary = pd.DataFrame(inputs.columns.values, columns=['Features'])
reg_summary['Weights'] = reg.coef_
reg_summary
Features | Weights | |
---|---|---|
0 | newlyConst | 0.084699 |
1 | hasKitchen | 0.114759 |
2 | livingSpace | 0.307220 |
3 | lift | 0.087225 |
4 | geo_plz | 0.104857 |
Anwendung des Regressionsmodells auf die Testdaten¶
y_hat_test = reg.predict(x_test)
4.3 Auswertung¶
R-Quadrat als Qualitätsmaß für das Modell
Eine neue Indizierung zuweisen
y_test = y_test.reset_index(drop=True)
y_test.head()
0 6.577861 1 5.934894 2 6.696120 3 7.265430 4 5.646153 Name: log_price, dtype: float64
print('test performance')
print(reg.score(x_test,y_test))
test performance 0.6769948772700947
Zuordnung der Vorhersagen der ersten 50 Testdaten zu den tatsächlichen Werten (Zielvorgaben).
diag = plt.figure()
ax = diag.add_axes([0,0,2,2])
ax.plot(y_hat_test[:50], "b--")
ax.plot(y_test[:50], "green")
ax.legend(["Predictions","Targets"])
<matplotlib.legend.Legend at 0x1470414c0>
Vergleich der Vorhersagen mit den tatsächlichen Werten (Zielvorgaben)
plt.scatter(y_test, y_hat_test, alpha=0.1)
plt.xlabel("Targets",size=18)
plt.ylabel("Predictions",size=18)
plt.show()
Die voraussichtlichen Mietpreise anzeigen
df_pf = pd.DataFrame(np.exp(y_hat_test), columns=['Prediction'])
df_pf.head()
Prediction | |
---|---|
0 | 621.075400 |
1 | 340.960659 |
2 | 617.850377 |
3 | 1133.747688 |
4 | 325.762741 |
Vergleichen Sie die vorhergesagten Werte mit den tatsächlichen Werten
df_pf['Target'] = np.exp(y_test)
df_pf.head()
Prediction | Target | |
---|---|---|
0 | 621.075400 | 719.00 |
1 | 340.960659 | 378.00 |
2 | 617.850377 | 809.26 |
3 | 1133.747688 | 1430.00 |
4 | 325.762741 | 283.20 |
Berechnung der absoluten Restwerte
df_pf['Residual'] = df_pf['Target'] - df_pf['Prediction']
df_pf.head()
Prediction | Target | Residual | |
---|---|---|---|
0 | 621.075400 | 719.00 | 97.924600 |
1 | 340.960659 | 378.00 | 37.039341 |
2 | 617.850377 | 809.26 | 191.409623 |
3 | 1133.747688 | 1430.00 | 296.252312 |
4 | 325.762741 | 283.20 | -42.562741 |
Berechnung der relativen Residuen
df_pf['Difference%'] = np.absolute(df_pf['Residual']/df_pf['Target']*100)
df_pf.head()
Prediction | Target | Residual | Difference% | |
---|---|---|---|---|
0 | 621.075400 | 719.00 | 97.924600 | 13.619555 |
1 | 340.960659 | 378.00 | 37.039341 | 9.798767 |
2 | 617.850377 | 809.26 | 191.409623 | 23.652426 |
3 | 1133.747688 | 1430.00 | 296.252312 | 20.716945 |
4 | 325.762741 | 283.20 | -42.562741 | 15.029216 |
Deskriptive Statistik der vorhergesagten Werte, der tatsächlichen Werte, ihrer absoluten Residuen und der relativen Differenzen
df_pf.describe()
Prediction | Target | Residual | Difference% | |
---|---|---|---|---|
count | 33955.000000 | 33955.000000 | 33955.000000 | 33955.000000 |
mean | 601.365238 | 627.011027 | 25.645789 | 23.629761 |
std | 321.794265 | 370.254917 | 236.334701 | 19.533865 |
min | 218.137127 | 171.000000 | -1836.208033 | 0.000113 |
25% | 386.495694 | 350.000000 | -87.654706 | 9.147192 |
50% | 497.164793 | 500.000000 | -14.770987 | 19.292079 |
75% | 708.334011 | 799.000000 | 105.920747 | 33.114130 |
max | 3816.208033 | 2205.000000 | 1600.039097 | 428.115674 |
MAE, MSE and RMSE
from sklearn import metrics
print('MAE:', metrics.mean_absolute_error(y_test, y_hat_test))
print('MSE:', metrics.mean_squared_error(y_test, y_hat_test))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_hat_test)))
MAE: 0.2350186650540862 MSE: 0.09181514789246176 RMSE: 0.3030101448672334