1. Business Understanding¶
Gastgeber können einen Preis für die Unterkunft festlegen. Allerdings wissen die Gastgeber oft nicht, welchen Wert die von ihnen angebotene Unterkunft hat. Es wäre hilfreich, wenn Airbnb Inc. einen marktgerechten Preis für Unterkünfte berechnen und vorschlagen könnte. Die Möglichkeit einer automatischen Preisberechnung wird in dieser Fallstudie näher untersucht. Es wird analysiert, ob die Bereitstellung einer intelligenten Preisgestaltung seitens Airbnb Inc. möglich ist. In der Analyse werden die Merkmale einer Unterkunft analysiert und es wird ermittelt, welche Merkmale einen Einfluss auf den Preis einer Unterkunft haben. Auf diese Weise soll eine möglichst gute Preisvorhersage für Unterkünfte, die in der Zukunft gebucht werden, getroffen werden können.
2. Data Understanding¶
Der Abschnitt Datenverständnis gliedert sich in einen umfassenden Überblick über den Datensatz, gefolgt von einer explorativen Datenanalyse, die sich auf relevante Merkmale konzentriert. Die Merkmale werden auf ihre Tauglichkeit hin bewertet und es erfolgt eine Merkmalsauswahl. Der Airbnb-Berlin-Datensatz besteht aus 16 Spalten und 22552 Zeilen. Die 16 Spalten enthalten die Merkmale. Die Datentypen der Merkmale sind sieben Integer, fünf Objects (davon vier Strings und ein Date) und vier Floats.
2.1 Bibliotheken importieren¶
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import folium as fm
from folium import plugins # needs to be imported even if editor says this is unused
from scipy.stats import normaltest
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import keras
sns.set()
2.2 Daten auslesen¶
raw_data = pd.read_csv('https://storage.googleapis.com/ml-service-repository-datastorage/Accommondation_rating_data.csv')
raw_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 22552 entries, 0 to 22551 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 22552 non-null int64 1 name 22493 non-null object 2 host_id 22552 non-null int64 3 host_name 22526 non-null object 4 neighbourhood_group 22552 non-null object 5 neighbourhood 22552 non-null object 6 latitude 22552 non-null float64 7 longitude 22552 non-null float64 8 room_type 22552 non-null object 9 price 22552 non-null int64 10 minimum_nights 22552 non-null int64 11 number_of_reviews 22552 non-null int64 12 last_review 18644 non-null object 13 reviews_per_month 18638 non-null float64 14 calculated_host_listings_count 22552 non-null int64 15 availability_365 22552 non-null int64 dtypes: float64(3), int64(7), object(6) memory usage: 2.8+ MB
raw_data.shape
(22552, 16)
raw_data.count()
id 22552 name 22493 host_id 22552 host_name 22526 neighbourhood_group 22552 neighbourhood 22552 latitude 22552 longitude 22552 room_type 22552 price 22552 minimum_nights 22552 number_of_reviews 22552 last_review 18644 reviews_per_month 18638 calculated_host_listings_count 22552 availability_365 22552 dtype: int64
raw_data.head()
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015 | Berlin-Mitte Value! Quiet courtyard/very central | 2217 | Ian | Mitte | Brunnenstr. Süd | 52.534537 | 13.402557 | Entire home/apt | 60 | 4 | 118 | 2018-10-28 | 3.76 | 4 | 141 |
1 | 2695 | Prenzlauer Berg close to Mauerpark | 2986 | Michael | Pankow | Prenzlauer Berg Nordwest | 52.548513 | 13.404553 | Private room | 17 | 2 | 6 | 2018-10-01 | 1.42 | 1 | 0 |
2 | 3176 | Fabulous Flat in great Location | 3718 | Britta | Pankow | Prenzlauer Berg Südwest | 52.534996 | 13.417579 | Entire home/apt | 90 | 62 | 143 | 2017-03-20 | 1.25 | 1 | 220 |
3 | 3309 | BerlinSpot Schöneberg near KaDeWe | 4108 | Jana | Tempelhof - Schöneberg | Schöneberg-Nord | 52.498855 | 13.349065 | Private room | 26 | 5 | 25 | 2018-08-16 | 0.39 | 1 | 297 |
4 | 7071 | BrightRoom with sunny greenview! | 17391 | Bright | Pankow | Helmholtzplatz | 52.543157 | 13.415091 | Private room | 42 | 2 | 197 | 2018-11-04 | 1.75 | 1 | 26 |
raw_data.dtypes
id int64 name object host_id int64 host_name object neighbourhood_group object neighbourhood object latitude float64 longitude float64 room_type object price int64 minimum_nights int64 number_of_reviews int64 last_review object reviews_per_month float64 calculated_host_listings_count int64 availability_365 int64 dtype: object
2.3 Deskriptive Analyse¶
raw_data.describe(include='all')
id | name | host_id | host_name | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | last_review | reviews_per_month | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2.255200e+04 | 22493 | 2.255200e+04 | 22526 | 22552 | 22552 | 22552.000000 | 22552.000000 | 22552 | 22552.000000 | 22552.000000 | 22552.000000 | 18644 | 18638.000000 | 22552.000000 | 22552.000000 |
unique | NaN | 21873 | NaN | 5997 | 12 | 136 | NaN | NaN | 3 | NaN | NaN | NaN | 1312 | NaN | NaN | NaN |
top | NaN | Berlin Wohnung | NaN | Anna | Friedrichshain-Kreuzberg | Tempelhofer Vorstadt | NaN | NaN | Private room | NaN | NaN | NaN | 2018-11-04 | NaN | NaN | NaN |
freq | NaN | 14 | NaN | 216 | 5497 | 1325 | NaN | NaN | 11534 | NaN | NaN | NaN | 618 | NaN | NaN | NaN |
mean | 1.571560e+07 | NaN | 5.403355e+07 | NaN | NaN | NaN | 52.509824 | 13.406107 | NaN | 67.143668 | 7.157059 | 17.840679 | NaN | 1.135525 | 1.918233 | 79.852829 |
std | 8.552069e+06 | NaN | 5.816290e+07 | NaN | NaN | NaN | 0.030825 | 0.057964 | NaN | 220.266210 | 40.665073 | 36.769624 | NaN | 1.507082 | 3.667257 | 119.368162 |
min | 2.015000e+03 | NaN | 2.217000e+03 | NaN | NaN | NaN | 52.345803 | 13.103557 | NaN | 0.000000 | 1.000000 | 0.000000 | NaN | 0.010000 | 1.000000 | 0.000000 |
25% | 8.065954e+06 | NaN | 9.240002e+06 | NaN | NaN | NaN | 52.489065 | 13.375411 | NaN | 30.000000 | 2.000000 | 1.000000 | NaN | 0.180000 | 1.000000 | 0.000000 |
50% | 1.686638e+07 | NaN | 3.126711e+07 | NaN | NaN | NaN | 52.509079 | 13.416779 | NaN | 45.000000 | 2.000000 | 5.000000 | NaN | 0.540000 | 1.000000 | 4.000000 |
75% | 2.258393e+07 | NaN | 8.067518e+07 | NaN | NaN | NaN | 52.532669 | 13.439259 | NaN | 70.000000 | 4.000000 | 16.000000 | NaN | 1.500000 | 1.000000 | 129.000000 |
max | 2.986735e+07 | NaN | 2.245081e+08 | NaN | NaN | NaN | 52.651670 | 13.757642 | NaN | 9000.000000 | 5000.000000 | 498.000000 | NaN | 36.670000 | 45.000000 | 365.000000 |
Explorative Datenanalyse¶
In diesem Kapitel werden Analysen und Visualisierungen der verfügbaren Daten vorgenommen.
features = ["price", "neighbourhood_group","number_of_reviews", "reviews_per_month","last_review","minimum_nights","calculated_host_listings_count","availability_365"]
ax = sns.pairplot(raw_data[features], hue="neighbourhood_group")
ax.fig.suptitle("Comparison of the pairwise relationships (hue = neighbourhood_group)", y=1.04)
Text(0.5, 1.04, 'Comparison of the pairwise relationships (hue = neighbourhood_group)')
features = ["price","number_of_reviews", "reviews_per_month","last_review","minimum_nights","calculated_host_listings_count","availability_365","room_type"]
ay = sns.pairplot(raw_data[features], hue="room_type")
ay.fig.suptitle("Comparison of the pairwise relationships (hue = room_type)", y=1.03)
Text(0.5, 1.03, 'Comparison of the pairwise relationships (hue = room_type)')
Viewing: Numerische Merkmale und Datum¶
# At the beginning the date is converted into a numerical value,
# then the correlation of the numerical features is considered.
# The correlation with the output feature, the price, is in the center of the consideration.
# Create a temporary DataFrame with date as numeric value.
temp_raw_data = raw_data
temp_raw_data.last_review=pd.to_datetime(temp_raw_data.last_review)
temp_raw_data.last_review=pd.to_numeric(temp_raw_data.last_review)
# Korrelation: Heatmap
plt.figure(figsize=(16, 6))
heatmap = sns.heatmap(temp_raw_data.corr(), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Heatmap', fontdict={'fontsize':18}, pad=12)
Text(0.5, 1.0, 'Heatmap')
#Heatmap with focus on the price
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(temp_raw_data.corr()[['price']].sort_values(by='price', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Feature correlation with the price', fontdict={'fontsize':18}, pad=16)
Text(0.5, 1.0, 'Feature correlation with the price')
# Looking at the numerical features, it is noticeable that there is hardly any correlation with price.
# In the next step, the categorical variables are now examined # with regard to their correlation with price.
# are examined with regard to their correlation with the price.
raw_data.columns
Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365'], dtype='object')
temp_test_date = raw_data
temp_test_date.last_review=pd.to_datetime(temp_test_date.last_review)
temp_test_date.last_review=pd.to_numeric(temp_test_date.last_review)
print("Correlation value between last_review and {}:".format(temp_test_date.corr().iloc[4:5,7]))
Correlation value between last_review and price -0.046731 Name: last_review, dtype: float64:
raw_data.hist(figsize=(25,25), bins=100)
array([[<AxesSubplot:title={'center':'id'}>, <AxesSubplot:title={'center':'host_id'}>, <AxesSubplot:title={'center':'latitude'}>], [<AxesSubplot:title={'center':'longitude'}>, <AxesSubplot:title={'center':'price'}>, <AxesSubplot:title={'center':'minimum_nights'}>], [<AxesSubplot:title={'center':'number_of_reviews'}>, <AxesSubplot:title={'center':'last_review'}>, <AxesSubplot:title={'center':'reviews_per_month'}>], [<AxesSubplot:title={'center':'calculated_host_listings_count'}>, <AxesSubplot:title={'center':'availability_365'}>, <AxesSubplot:>]], dtype=object)
Consideration: Kategorische Variablen¶
# The consideration of the categorical variables (including longitude and latitude) follows.
Viewing: neighbourhood_group und Preis¶
raw_data['neighbourhood_group'].describe()
count 22552 unique 12 top Friedrichshain-Kreuzberg freq 5497 Name: neighbourhood_group, dtype: object
raw_data['neighbourhood_group'].unique()
array(['Mitte', 'Pankow', 'Tempelhof - Schöneberg', 'Friedrichshain-Kreuzberg', 'Neukölln', 'Charlottenburg-Wilm.', 'Treptow - Köpenick', 'Steglitz - Zehlendorf', 'Reinickendorf', 'Lichtenberg', 'Marzahn - Hellersdorf', 'Spandau'], dtype=object)
raw_data['neighbourhood_group'].value_counts()
Friedrichshain-Kreuzberg 5497 Mitte 4631 Pankow 3541 Neukölln 3499 Charlottenburg-Wilm. 1592 Tempelhof - Schöneberg 1560 Lichtenberg 688 Treptow - Köpenick 595 Steglitz - Zehlendorf 437 Reinickendorf 247 Marzahn - Hellersdorf 141 Spandau 124 Name: neighbourhood_group, dtype: int64
raw_data.neighbourhood_group = raw_data.neighbourhood_group.str.replace(" ", "")
raw_data.neighbourhood_group = raw_data.neighbourhood_group.str.replace("Charlottenburg-Wilm.", "Charlottenburg-Wilmersdorf")
C:\Users\du-wa\AppData\Local\Temp/ipykernel_26192/3425726677.py:2: FutureWarning: The default value of regex will change from True to False in a future version. raw_data.neighbourhood_group = raw_data.neighbourhood_group.str.replace("Charlottenburg-Wilm.", "Charlottenburg-Wilmersdorf")
raw_data['neighbourhood_group'].unique()
array(['Mitte', 'Pankow', 'Tempelhof-Schöneberg', 'Friedrichshain-Kreuzberg', 'Neukölln', 'Charlottenburg-Wilmersdorf', 'Treptow-Köpenick', 'Steglitz-Zehlendorf', 'Reinickendorf', 'Lichtenberg', 'Marzahn-Hellersdorf', 'Spandau'], dtype=object)
# Consideration of the price median related to the city districts.
# The geojson file used here I uploaded on my github.
# https://github.com/LeaCorinna/AI
berlin_lat = raw_data.latitude.mean()
berlin_long = raw_data.longitude.mean()
colors = ["#3333DD", "#B00000"]
berlin_map = fm.Map(location=[berlin_lat, berlin_long], zoom_start=11)
belin_boroughs = "https://raw.githubusercontent.com/LeaCorinna/AI/main/berlin.geojson"
berlin_price = raw_data.groupby(by="neighbourhood_group").median().reset_index()
fm.Choropleth(
geo_data=belin_boroughs,
name='choropleth',
data=berlin_price,
columns=['neighbourhood_group', 'price'],
key_on='feature.properties.name',
fill_color='RdBu_r',
fill_opacity=0.7,
line_opacity=0.2,
legend = "Median Price (Euro)"
).add_to(berlin_map)
fm.LayerControl().add_to(berlin_map)
berlin_map
# Consideration of the price average related to the city districts.
# The geojson file used here I uploaded on my github.
# https://github.com/LeaCorinna/AI
berlin_map = fm.Map(location=[berlin_lat, berlin_long], zoom_start=11)
belin_boroughs = "https://raw.githubusercontent.com/LeaCorinna/AI/main/berlin.geojson"
berlin_price = raw_data.groupby(by="neighbourhood_group").mean().reset_index()
fm.Choropleth(
geo_data=belin_boroughs,
name='choropleth',
data=berlin_price,
columns=['neighbourhood_group', 'price'],
key_on='feature.properties.name',
fill_color='RdBu_r',
fill_opacity=0.7,
line_opacity=0.2,
legend = "Mean Average Price (Euro)"
).add_to(berlin_map)
fm.LayerControl().add_to(berlin_map)
berlin_map
# Looking at absolute price differences across neighborhoods.
plt.figure(figsize=(40,10))
sns.barplot(x=raw_data['neighbourhood_group'], y=raw_data['price'], palette=sns.color_palette('magma', n_colors=12))
<AxesSubplot:xlabel='neighbourhood_group', ylabel='price'>
Viewing: Viertel und Preis¶
raw_data.neighbourhood.describe()
count 22552 unique 136 top Tempelhofer Vorstadt freq 1325 Name: neighbourhood, dtype: object
# Determine the average area per neighborhood
raw_data.groupby('neighbourhood').price.mean()
neighbourhood Adlershof 48.481481 Albrechtstr. 45.205882 Alexanderplatz 93.199817 Allende-Viertel 41.666667 Alt Treptow 53.852071 ... Wilhelmstadt 44.454545 Zehlendorf Nord 72.363636 Zehlendorf Südwest 88.450000 nördliche Luisenstadt 61.613636 südliche Luisenstadt 59.598756 Name: price, Length: 136, dtype: float64
plt.figure(figsize=(14,100))
sns.barplot(x=raw_data['price'], y=raw_data['neighbourhood'], palette=sns.color_palette('magma', n_colors=6))
<AxesSubplot:xlabel='price', ylabel='neighbourhood'>
Viewing: room_type und neughborhood_group¶
colors = ["purple", "green", "lightblue"]
raw_data.groupby(by="room_type").count().id.plot(kind="bar", color=colors)
<AxesSubplot:xlabel='room_type'>
raw_data.groupby('room_type')
raw_data.groupby('room_type').price.mean()
room_type Entire home/apt 83.348909 Private room 52.479105 Shared room 51.564189 Name: price, dtype: float64
plt.figure(figsize=(27,9))
plt.title("Accommodation offer in the districts",fontsize=20)
sns.countplot(raw_data.neighbourhood_group,hue=raw_data["room_type"], palette=sns.color_palette('magma', n_colors=3))
plt.show()
c:\Git_Repos\ml-services\venv\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
sns.jointplot(x="room_type", y="price", data=raw_data)
plt.show()
# Heatmap: Entire home/apt
heatmap = fm.Map(location=[berlin_lat, berlin_long], zoom_start=11)
heatmap.add_children(fm.plugins.HeatMap(raw_data[raw_data.room_type == "Entire home/apt"][['latitude', 'longitude']].values, radius=15))
heatmap
C:\Users\du-wa\AppData\Local\Temp/ipykernel_26192/606403850.py:3: FutureWarning: Method `add_children` is deprecated. Please use `add_child` instead. heatmap.add_children(fm.plugins.HeatMap(raw_data[raw_data.room_type == "Entire home/apt"][['latitude', 'longitude']].values, radius=15))
# Heatmap: Private Room
heatmap = fm.Map(location=[berlin_lat, berlin_long], zoom_start=11)
heatmap.add_children(fm.plugins.HeatMap(raw_data[raw_data.room_type == "Private room"][['latitude', 'longitude']].values, radius=15))
heatmap
C:\Users\du-wa\AppData\Local\Temp/ipykernel_26192/3755830379.py:3: FutureWarning: Method `add_children` is deprecated. Please use `add_child` instead. heatmap.add_children(fm.plugins.HeatMap(raw_data[raw_data.room_type == "Private room"][['latitude', 'longitude']].values, radius=15))
# Heatmap: Shared Room
heatmap = fm.Map(location=[berlin_lat, berlin_long], zoom_start=11)
heatmap.add_children(fm.plugins.HeatMap(raw_data[raw_data.room_type == "Shared room"][['latitude', 'longitude']].values, radius=15))
heatmap
C:\Users\du-wa\AppData\Local\Temp/ipykernel_26192/2793628682.py:3: FutureWarning: Method `add_children` is deprecated. Please use `add_child` instead. heatmap.add_children(fm.plugins.HeatMap(raw_data[raw_data.room_type == "Shared room"][['latitude', 'longitude']].values, radius=15))
2.4 Datenbereinigung¶
Hier sollten die ersten Lesefehler korrigiert werden, bevor die eigentliche Datenaufbereitung erfolgt.
Fehlende Werte¶
raw_data.isnull().sum()
id 0 name 59 host_id 0 host_name 26 neighbourhood_group 0 neighbourhood 0 latitude 0 longitude 0 room_type 0 price 0 minimum_nights 0 number_of_reviews 0 last_review 0 reviews_per_month 3914 calculated_host_listings_count 0 availability_365 0 dtype: int64
# It turned out that the features which contain zero values are relevant for the modeling.
# Therefore, not the corresponding rows but the corresponding columns were removed here.
data_no_mv = raw_data.drop(['host_name','name','last_review','reviews_per_month'], axis=1)
data_no_mv.isnull().sum()
id 0 host_id 0 neighbourhood_group 0 neighbourhood 0 latitude 0 longitude 0 room_type 0 price 0 minimum_nights 0 number_of_reviews 0 calculated_host_listings_count 0 availability_365 0 dtype: int64
# format of the DataFrame
data_no_mv.shape
(22552, 12)
data_no_mv.describe(include ='all')
id | host_id | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2.255200e+04 | 2.255200e+04 | 22552 | 22552 | 22552.000000 | 22552.000000 | 22552 | 22552.000000 | 22552.000000 | 22552.000000 | 22552.000000 | 22552.000000 |
unique | NaN | NaN | 12 | 136 | NaN | NaN | 3 | NaN | NaN | NaN | NaN | NaN |
top | NaN | NaN | Friedrichshain-Kreuzberg | Tempelhofer Vorstadt | NaN | NaN | Private room | NaN | NaN | NaN | NaN | NaN |
freq | NaN | NaN | 5497 | 1325 | NaN | NaN | 11534 | NaN | NaN | NaN | NaN | NaN |
mean | 1.571560e+07 | 5.403355e+07 | NaN | NaN | 52.509824 | 13.406107 | NaN | 67.143668 | 7.157059 | 17.840679 | 1.918233 | 79.852829 |
std | 8.552069e+06 | 5.816290e+07 | NaN | NaN | 0.030825 | 0.057964 | NaN | 220.266210 | 40.665073 | 36.769624 | 3.667257 | 119.368162 |
min | 2.015000e+03 | 2.217000e+03 | NaN | NaN | 52.345803 | 13.103557 | NaN | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
25% | 8.065954e+06 | 9.240002e+06 | NaN | NaN | 52.489065 | 13.375411 | NaN | 30.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 |
50% | 1.686638e+07 | 3.126711e+07 | NaN | NaN | 52.509079 | 13.416779 | NaN | 45.000000 | 2.000000 | 5.000000 | 1.000000 | 4.000000 |
75% | 2.258393e+07 | 8.067518e+07 | NaN | NaN | 52.532669 | 13.439259 | NaN | 70.000000 | 4.000000 | 16.000000 | 1.000000 | 129.000000 |
max | 2.986735e+07 | 2.245081e+08 | NaN | NaN | 52.651670 | 13.757642 | NaN | 9000.000000 | 5000.000000 | 498.000000 | 45.000000 | 365.000000 |
Duplikate prüfen¶
## the dataset has no duplicates.
data_no_mv[data_no_mv.duplicated()].sum()
id 0.0 host_id 0.0 neighbourhood_group 0.0 neighbourhood 0.0 latitude 0.0 longitude 0.0 room_type 0.0 price 0.0 minimum_nights 0.0 number_of_reviews 0.0 calculated_host_listings_count 0.0 availability_365 0.0 dtype: float64
Falsche Werte¶
## A minimum overnight stay of more than 13 years is to be considered erroneous.
## The corresponding lines will be deleted.
data_no_mv.drop(data_no_mv.loc[data_no_mv['minimum_nights']>=5000].index, inplace=True)
data_no_mv.describe()
id | host_id | latitude | longitude | price | minimum_nights | number_of_reviews | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|
count | 2.255100e+04 | 2.255100e+04 | 22551.000000 | 22551.000000 | 22551.000000 | 22551.000000 | 22551.000000 | 22551.000000 | 22551.000000 |
mean | 1.571595e+07 | 5.403410e+07 | 52.509823 | 13.406110 | 67.135338 | 6.935657 | 17.841293 | 1.918230 | 79.840495 |
std | 8.552100e+06 | 5.816413e+07 | 0.030826 | 0.057963 | 220.267542 | 23.413599 | 36.770324 | 3.667338 | 119.356437 |
min | 2.015000e+03 | 2.217000e+03 | 52.345803 | 13.103557 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
25% | 8.066394e+06 | 9.239060e+06 | 52.489065 | 13.375413 | 30.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 |
50% | 1.686665e+07 | 3.126622e+07 | 52.509078 | 13.416780 | 45.000000 | 2.000000 | 5.000000 | 1.000000 | 4.000000 |
75% | 2.258403e+07 | 8.067518e+07 | 52.532669 | 13.439259 | 70.000000 | 4.000000 | 16.000000 | 1.000000 | 129.000000 |
max | 2.986735e+07 | 2.245081e+08 | 52.651670 | 13.757642 | 9000.000000 | 1000.000000 | 498.000000 | 45.000000 | 365.000000 |
## An overnight rate of 0 Euro or 1 Euro is to be considered as incorrect.
## The corresponding lines will be deleted.
data_no_mv = data_no_mv[data_no_mv.price > 1]
data_no_mv.describe(include ='all')
id | host_id | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2.254200e+04 | 2.254200e+04 | 22542 | 22542 | 22542.000000 | 22542.000000 | 22542 | 22542.000000 | 22542.000000 | 22542.000000 | 22542.000000 | 22542.000000 |
unique | NaN | NaN | 12 | 136 | NaN | NaN | 3 | NaN | NaN | NaN | NaN | NaN |
top | NaN | NaN | Friedrichshain-Kreuzberg | Tempelhofer Vorstadt | NaN | NaN | Private room | NaN | NaN | NaN | NaN | NaN |
freq | NaN | NaN | 5493 | 1324 | NaN | NaN | 11528 | NaN | NaN | NaN | NaN | NaN |
mean | 1.571401e+07 | 5.403000e+07 | NaN | NaN | 52.509825 | 13.406102 | NaN | 67.162097 | 6.937583 | 17.839899 | 1.917443 | 79.840786 |
std | 8.553230e+06 | 5.816590e+07 | NaN | NaN | 0.030825 | 0.057964 | NaN | 220.307438 | 23.418064 | 36.772123 | 3.666631 | 119.356156 |
min | 2.015000e+03 | 2.217000e+03 | NaN | NaN | 52.345803 | 13.103557 | NaN | 8.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 |
25% | 8.065068e+06 | 9.236164e+06 | NaN | NaN | 52.489064 | 13.375412 | NaN | 30.000000 | 2.000000 | 1.000000 | 1.000000 | 0.000000 |
50% | 1.686508e+07 | 3.126015e+07 | NaN | NaN | 52.509071 | 13.416779 | NaN | 45.000000 | 2.000000 | 5.000000 | 1.000000 | 4.000000 |
75% | 2.258574e+07 | 8.064741e+07 | NaN | NaN | 52.532670 | 13.439256 | NaN | 70.000000 | 4.000000 | 16.000000 | 1.000000 | 129.000000 |
max | 2.986735e+07 | 2.245081e+08 | NaN | NaN | 52.651670 | 13.757642 | NaN | 9000.000000 | 1000.000000 | 498.000000 | 45.000000 | 365.000000 |
Implementierung: Iterative Merkmalsauswahl¶
data_no_mv.head()
id | host_id | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | minimum_nights | number_of_reviews | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2015 | 2217 | Mitte | Brunnenstr. Süd | 52.534537 | 13.402557 | Entire home/apt | 60 | 4 | 118 | 4 | 141 |
1 | 2695 | 2986 | Pankow | Prenzlauer Berg Nordwest | 52.548513 | 13.404553 | Private room | 17 | 2 | 6 | 1 | 0 |
2 | 3176 | 3718 | Pankow | Prenzlauer Berg Südwest | 52.534996 | 13.417579 | Entire home/apt | 90 | 62 | 143 | 1 | 220 |
3 | 3309 | 4108 | Tempelhof-Schöneberg | Schöneberg-Nord | 52.498855 | 13.349065 | Private room | 26 | 5 | 25 | 1 | 297 |
4 | 7071 | 17391 | Pankow | Helmholtzplatz | 52.543157 | 13.415091 | Private room | 42 | 2 | 197 | 1 | 26 |
# Important!!! Intermediate result.
# This dataset data_feature_selection will be used later for the Randomn Forest classification.
data_feature_auswahl = data_no_mv.drop(['id','host_id','latitude', 'longitude','number_of_reviews','neighbourhood','minimum_nights'],axis=1)
data_feature_auswahl.head()
neighbourhood_group | room_type | price | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|
0 | Mitte | Entire home/apt | 60 | 4 | 141 |
1 | Pankow | Private room | 17 | 1 | 0 |
2 | Pankow | Entire home/apt | 90 | 1 | 220 |
3 | Tempelhof-Schöneberg | Private room | 26 | 1 | 297 |
4 | Pankow | Private room | 42 | 1 | 26 |
Ausreißer¶
# Unproblematic to cut off at the 99% quantile, stretchable to the 95% quantiel.
# Procedure: Calculate quantile boundary, clean data set.
Preis¶
# It is particularly important that the target variable is normally distributed,
# which is why a best possible adjustment of the target variable is carried out.
sns.distplot(data_feature_auswahl['price'])
plt.title("Preisverteilung")
c:\Git_Repos\ml-services\venv\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Preisverteilung')
# At the beginning: truncation at the 99% quantile.
q = data_feature_auswahl['price'].quantile(0.99)
data_1 = data_feature_auswahl[data_feature_auswahl['price']<q]
sns.distplot(data_1['price'])
plt.title("Preisverteilung (99%)")
c:\Git_Repos\ml-services\venv\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Preisverteilung (99%)')
qqplot(data_1['price'], line='s')
plt.show()
## Truncation at the 99% quantile does not yet correspond to the normal distribution
## Truncation at the 95% quantile to approach the normal distribution
q = data_feature_auswahl['price'].quantile(0.95)
data_2 = data_feature_auswahl[data_feature_auswahl['price']<q]
sns.distplot(data_2['price'])
plt.title("Preisverteilung (95%)")
c:\Git_Repos\ml-services\venv\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Preisverteilung (95%)')
qqplot(data_2['price'], line='s')
plt.show()
ax = sns.boxplot(x=data_2["price"])
calculated_host_listings_count¶
sns.distplot(data_2['calculated_host_listings_count'])
plt.title("Verteilung der host listings")
c:\Git_Repos\ml-services\venv\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Verteilung der host listings')
# Unproblematic truncation at 99% quantile, extensible up to 95% quantile
# Truncation at the 99% quantile does not yet correspond to the normal distribution
q = data_2['calculated_host_listings_count'].quantile(0.97)
data_3 = data_2[data_2['calculated_host_listings_count']<q]
sns.distplot(data_3['calculated_host_listings_count'])
plt.title("Preisverteilung")
c:\Git_Repos\ml-services\venv\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Preisverteilung')
qqplot(data_3['calculated_host_listings_count'], line='s')
plt.show()
availability_365¶
sns.distplot(data_3['availability_365'])
plt.title("Verteilung: Unterkunftverfügbarkeit")
c:\Git_Repos\ml-services\venv\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Verteilung: Unterkunftverfügbarkeit')
# Cut off 99% quantile unproblematically, can be stretched to 95% quantile.
# Truncate at 99% quantile does not yet conform to normal distribution.
q = data_3['availability_365'].quantile(0.96)
data_4 = data_3[data_3['availability_365']<q]
sns.distplot(data_4['availability_365'])
plt.title("Preisverteilung")
c:\Git_Repos\ml-services\venv\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Preisverteilung')
qqplot(data_4['availability_365'], line='s')
plt.show()
data_4.describe(include='all')
neighbourhood_group | room_type | price | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|
count | 19820 | 19820 | 19820.000000 | 19820.00000 | 19820.000000 |
unique | 12 | 3 | NaN | NaN | NaN |
top | Friedrichshain-Kreuzberg | Private room | NaN | NaN | NaN |
freq | 4964 | 10686 | NaN | NaN | NaN |
mean | NaN | NaN | 49.799950 | 1.29773 | 58.003380 |
std | NaN | NaN | 25.553927 | 0.75429 | 98.235961 |
min | NaN | NaN | 8.000000 | 1.00000 | 0.000000 |
25% | NaN | NaN | 30.000000 | 1.00000 | 0.000000 |
50% | NaN | NaN | 45.000000 | 1.00000 | 0.000000 |
75% | NaN | NaN | 65.000000 | 1.00000 | 72.000000 |
max | NaN | NaN | 139.000000 | 6.00000 | 340.000000 |
3. Data Preparation¶
3.1 Index zurücksetzen¶
berlin_airbnb = data_4.reset_index(drop=True)
berlin_airbnb.describe(include = 'all')
neighbourhood_group | room_type | price | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|
count | 19820 | 19820 | 19820.000000 | 19820.00000 | 19820.000000 |
unique | 12 | 3 | NaN | NaN | NaN |
top | Friedrichshain-Kreuzberg | Private room | NaN | NaN | NaN |
freq | 4964 | 10686 | NaN | NaN | NaN |
mean | NaN | NaN | 49.799950 | 1.29773 | 58.003380 |
std | NaN | NaN | 25.553927 | 0.75429 | 98.235961 |
min | NaN | NaN | 8.000000 | 1.00000 | 0.000000 |
25% | NaN | NaN | 30.000000 | 1.00000 | 0.000000 |
50% | NaN | NaN | 45.000000 | 1.00000 | 0.000000 |
75% | NaN | NaN | 65.000000 | 1.00000 | 72.000000 |
max | NaN | NaN | 139.000000 | 6.00000 | 340.000000 |
berlin_airbnb.shape
(19820, 5)
berlin_airbnb
neighbourhood_group | room_type | price | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|
0 | Mitte | Entire home/apt | 60 | 4 | 141 |
1 | Pankow | Private room | 17 | 1 | 0 |
2 | Pankow | Entire home/apt | 90 | 1 | 220 |
3 | Tempelhof-Schöneberg | Private room | 26 | 1 | 297 |
4 | Pankow | Private room | 42 | 1 | 26 |
... | ... | ... | ... | ... | ... |
19815 | Mitte | Entire home/apt | 60 | 1 | 314 |
19816 | Tempelhof-Schöneberg | Shared room | 20 | 6 | 78 |
19817 | Pankow | Entire home/apt | 85 | 2 | 15 |
19818 | Mitte | Private room | 99 | 3 | 6 |
19819 | Neukölln | Private room | 45 | 1 | 21 |
19820 rows × 5 columns
3.2 Überprüfung der OLS-Annahmen¶
3.2.1 Annahme der Linearität¶
# Check linearity of price and calculated_host_listings_count
plt.scatter(berlin_airbnb['calculated_host_listings_count'],berlin_airbnb['price'])
<matplotlib.collections.PathCollection at 0x1d1b5439850>
plt.scatter(berlin_airbnb['availability_365'],berlin_airbnb['price'])
<matplotlib.collections.PathCollection at 0x1d1b5315d60>
# There is a violation of the OLS assumption of linearity.
# A possible solution for this is a log transformation.
# A Log Transformation is generally recommended if the data is exponentially distributed.
# Also an exponential distribution is not to be determined here.
# However, in the iterative test it is to be determined that the Accuracy of the Multiple Linear Regression
# turns out better, if one operates on the logorithmized data.
# ==> Log transformation of the price is made.
log_price = np.log(berlin_airbnb['price'])
berlin_airbnb['log_price'] = log_price
berlin_airbnb
neighbourhood_group | room_type | price | calculated_host_listings_count | availability_365 | log_price | |
---|---|---|---|---|---|---|
0 | Mitte | Entire home/apt | 60 | 4 | 141 | 4.094345 |
1 | Pankow | Private room | 17 | 1 | 0 | 2.833213 |
2 | Pankow | Entire home/apt | 90 | 1 | 220 | 4.499810 |
3 | Tempelhof-Schöneberg | Private room | 26 | 1 | 297 | 3.258097 |
4 | Pankow | Private room | 42 | 1 | 26 | 3.737670 |
... | ... | ... | ... | ... | ... | ... |
19815 | Mitte | Entire home/apt | 60 | 1 | 314 | 4.094345 |
19816 | Tempelhof-Schöneberg | Shared room | 20 | 6 | 78 | 2.995732 |
19817 | Pankow | Entire home/apt | 85 | 2 | 15 | 4.442651 |
19818 | Mitte | Private room | 99 | 3 | 6 | 4.595120 |
19819 | Neukölln | Private room | 45 | 1 | 21 | 3.806662 |
19820 rows × 6 columns
plt.scatter(berlin_airbnb['calculated_host_listings_count'],berlin_airbnb['log_price'])
<matplotlib.collections.PathCollection at 0x1d1b550a1c0>
plt.scatter(berlin_airbnb['availability_365'],berlin_airbnb['log_price'])
<matplotlib.collections.PathCollection at 0x1d1b5530370>
berlin_airbnb = berlin_airbnb.drop(['price'], axis=1)
3.2.2 Annahme "Keine Multikollinearität"¶
berlin_airbnb.columns.values
array(['neighbourhood_group', 'room_type', 'calculated_host_listings_count', 'availability_365', 'log_price'], dtype=object)
# All VIF values are < 10 -> no multicolinearity
variables = berlin_airbnb[['calculated_host_listings_count','availability_365','log_price']]
vif = pd.DataFrame()
vif['VIF']= [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif['Features'] = variables.columns
vif
VIF | Features | |
---|---|---|
0 | 4.021130 | calculated_host_listings_count |
1 | 1.440053 | availability_365 |
2 | 3.923877 | log_price |
Bereinigung Einzelwerte neighbourhood_group¶
# There are no individual values available.
berlin_airbnb.groupby('neighbourhood_group').count()
room_type | calculated_host_listings_count | availability_365 | log_price | |
---|---|---|---|---|
neighbourhood_group | ||||
Charlottenburg-Wilmersdorf | 1302 | 1302 | 1302 | 1302 |
Friedrichshain-Kreuzberg | 4964 | 4964 | 4964 | 4964 |
Lichtenberg | 625 | 625 | 625 | 625 |
Marzahn-Hellersdorf | 110 | 110 | 110 | 110 |
Mitte | 3848 | 3848 | 3848 | 3848 |
Neukölln | 3313 | 3313 | 3313 | 3313 |
Pankow | 3110 | 3110 | 3110 | 3110 |
Reinickendorf | 206 | 206 | 206 | 206 |
Spandau | 78 | 78 | 78 | 78 |
Steglitz-Zehlendorf | 368 | 368 | 368 | 368 |
Tempelhof-Schöneberg | 1357 | 1357 | 1357 | 1357 |
Treptow-Köpenick | 539 | 539 | 539 | 539 |
3.3 Dummy-Variablen erstellen¶
berlin_airbnb.head()
neighbourhood_group | room_type | calculated_host_listings_count | availability_365 | log_price | |
---|---|---|---|---|---|
0 | Mitte | Entire home/apt | 4 | 141 | 4.094345 |
1 | Pankow | Private room | 1 | 0 | 2.833213 |
2 | Pankow | Entire home/apt | 1 | 220 | 4.499810 |
3 | Tempelhof-Schöneberg | Private room | 1 | 297 | 3.258097 |
4 | Pankow | Private room | 1 | 26 | 3.737670 |
#berlin_airbnb_with_dummies = berlin_airbnb.drop(['availability_365','minimum_nights'], axis=1)
berlin_airbnb_with_dummies=pd.get_dummies(berlin_airbnb, drop_first=True)
berlin_airbnb_with_dummies.head()
calculated_host_listings_count | availability_365 | log_price | neighbourhood_group_Friedrichshain-Kreuzberg | neighbourhood_group_Lichtenberg | neighbourhood_group_Marzahn-Hellersdorf | neighbourhood_group_Mitte | neighbourhood_group_Neukölln | neighbourhood_group_Pankow | neighbourhood_group_Reinickendorf | neighbourhood_group_Spandau | neighbourhood_group_Steglitz-Zehlendorf | neighbourhood_group_Tempelhof-Schöneberg | neighbourhood_group_Treptow-Köpenick | room_type_Private room | room_type_Shared room | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4 | 141 | 4.094345 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 2.833213 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 1 | 220 | 4.499810 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 297 | 3.258097 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
4 | 1 | 26 | 3.737670 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
berlin_airbnb_with_dummies.shape
(19820, 16)
berlin_airbnb_with_dummies.columns.values
array(['calculated_host_listings_count', 'availability_365', 'log_price', 'neighbourhood_group_Friedrichshain-Kreuzberg', 'neighbourhood_group_Lichtenberg', 'neighbourhood_group_Marzahn-Hellersdorf', 'neighbourhood_group_Mitte', 'neighbourhood_group_Neukölln', 'neighbourhood_group_Pankow', 'neighbourhood_group_Reinickendorf', 'neighbourhood_group_Spandau', 'neighbourhood_group_Steglitz-Zehlendorf', 'neighbourhood_group_Tempelhof-Schöneberg', 'neighbourhood_group_Treptow-Köpenick', 'room_type_Private room', 'room_type_Shared room'], dtype=object)
cols = ['log_price', 'calculated_host_listings_count', 'availability_365',
'neighbourhood_group_Friedrichshain-Kreuzberg',
'neighbourhood_group_Lichtenberg',
'neighbourhood_group_Marzahn-Hellersdorf',
'neighbourhood_group_Mitte', 'neighbourhood_group_Neukölln',
'neighbourhood_group_Pankow', 'neighbourhood_group_Reinickendorf',
'neighbourhood_group_Spandau',
'neighbourhood_group_Steglitz-Zehlendorf',
'neighbourhood_group_Tempelhof-Schöneberg',
'neighbourhood_group_Treptow-Köpenick', 'room_type_Private room',
'room_type_Shared room']
# Actual vs. Predicted plotted
lin_reg_berlin_airbnb_with_dummies = berlin_airbnb_with_dummies[cols]
lin_reg_berlin_airbnb_with_dummies.head()
log_price | calculated_host_listings_count | availability_365 | neighbourhood_group_Friedrichshain-Kreuzberg | neighbourhood_group_Lichtenberg | neighbourhood_group_Marzahn-Hellersdorf | neighbourhood_group_Mitte | neighbourhood_group_Neukölln | neighbourhood_group_Pankow | neighbourhood_group_Reinickendorf | neighbourhood_group_Spandau | neighbourhood_group_Steglitz-Zehlendorf | neighbourhood_group_Tempelhof-Schöneberg | neighbourhood_group_Treptow-Köpenick | room_type_Private room | room_type_Shared room | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4.094345 | 4 | 141 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 2.833213 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 4.499810 | 1 | 220 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 3.258097 | 1 | 297 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
4 | 3.737670 | 1 | 26 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4. Modellierung und Evaluation¶
Residuen geben Auskunft darüber, wie genau die abhängige Variable in einer Regression geschätzt werden kann. Sie geben an, wie weit der vorhergesagte Wert mit dem tatsächlichen Wert übereinstimmt. Je kleiner die Residuen sind, desto besser.
4.1 Multiple lineare Regression¶
targets = lin_reg_berlin_airbnb_with_dummies['log_price']
inputs = lin_reg_berlin_airbnb_with_dummies.drop(['log_price'],axis= 1)
Feature Skalierung¶
scaler = StandardScaler()
scaler.fit(inputs)
input_scaled = scaler.transform(inputs)
Train Test Split (= präsentiert Overfitting)¶
x_train, x_test, y_train, y_test = train_test_split(input_scaled, targets,test_size=0.2, random_state=365)
Produktion von Regression¶
reg = LinearRegression()
reg.fit(x_train,y_train)
LinearRegression()
y_hat = reg.predict(x_train)
plt.scatter(y_train,y_hat)
plt.xlabel('Targets')
plt.ylabel('Predictions')
plt.show()
print(" Intercept:\n", reg.intercept_)
Intercept: 3.781225116168546
print("Koeffizienten Weights: \n", reg.coef_)
Koeffizienten Weights: [ 0.00598836 0.06202005 0.04087519 -0.0213382 -0.01248334 0.02006909 -0.01683975 0.02953642 -0.01720474 -0.00819149 -0.00947675 0.0019844 -0.0109827 -0.28451957 -0.08382892]
reg_summary = pd.DataFrame(inputs.columns, columns=['Features'])
reg_summary['Weights']= reg.coef_
reg_summary
Features | Weights | |
---|---|---|
0 | calculated_host_listings_count | 0.005988 |
1 | availability_365 | 0.062020 |
2 | neighbourhood_group_Friedrichshain-Kreuzberg | 0.040875 |
3 | neighbourhood_group_Lichtenberg | -0.021338 |
4 | neighbourhood_group_Marzahn-Hellersdorf | -0.012483 |
5 | neighbourhood_group_Mitte | 0.020069 |
6 | neighbourhood_group_Neukölln | -0.016840 |
7 | neighbourhood_group_Pankow | 0.029536 |
8 | neighbourhood_group_Reinickendorf | -0.017205 |
9 | neighbourhood_group_Spandau | -0.008191 |
10 | neighbourhood_group_Steglitz-Zehlendorf | -0.009477 |
11 | neighbourhood_group_Tempelhof-Schöneberg | 0.001984 |
12 | neighbourhood_group_Treptow-Köpenick | -0.010983 |
13 | room_type_Private room | -0.284520 |
14 | room_type_Shared room | -0.083829 |
R^2 Wert¶
## Coefficient of determination, which reflects the relationship between independent and dependent variable.
## R^2 lies in the range between 0 and 1.
## If the R^2 value is equal to 1, all points lie ideally on the regression line.
#R^2 value
R_2=reg.score(x_train, y_train)
print('Train Performance')
print('R2: ',R_2)
Train Performance R2: 0.37528899462129495
# Residuen (=error terms)
sns.distplot(y_train - y_hat)
plt.title("Residuals")
c:\Git_Repos\ml-services\venv\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Residuals')
Testing¶
y_hat_test = reg.predict(x_test)
plt.scatter(y_test,y_hat_test, alpha=0.2)
plt.xlabel('Targets Test')
plt.ylabel('Predictions Test')
plt.show()
R^2 wert¶
## Coefficient of determination, which reflects the relationship between independent and dependent variable.
## R^2 lies in the range between 0 and 1.
## The closer the R^2 value is to 1, the better is the
R_2=reg.score(x_test, y_test)
print('Test Performance')
print('R2: ',R_2)
Test Performance R2: 0.374654184475112
Mittlerer absoluter Fehler¶
## Average absolute deviation of the predicted value from the value that occurred.
## Undirected accumulation of over- and underestimations.
## MAE = 0 can be interpreted as a perfect forecast.
print("MAE: ", mean_absolute_error(y_test, y_hat_test))
MAE: 0.3113972654675436
Mittlerer quadratischer Fehler¶
## The squared ME
## Large deviations receive a disproportionate weight compared to small deviations.
## MSE = 0 can be interpreted as a perfect forecast.
print("MSE: ", mean_squared_error(y_test, y_hat_test))
MSE: 0.15559724128358376
Vorhersagen, Zielvorgaben, Residuen, Differenz%¶
df_pf = pd.DataFrame(np.exp(y_hat_test), columns=['Predictions'])
df_pf.head()
Predictions | |
---|---|
0 | 35.086682 |
1 | 36.611418 |
2 | 35.460856 |
3 | 35.259933 |
4 | 34.746644 |
#df_pf['Targets']=np.exp(y_test)
#df_pf
y_test = y_test.reset_index(drop=True)
df_pf['Targets']=np.exp(y_test)
df_pf
Predictions | Targets | |
---|---|---|
0 | 35.086682 | 58.0 |
1 | 36.611418 | 25.0 |
2 | 35.460856 | 25.0 |
3 | 35.259933 | 36.0 |
4 | 34.746644 | 22.0 |
... | ... | ... |
3959 | 38.419789 | 55.0 |
3960 | 61.810220 | 35.0 |
3961 | 34.471874 | 15.0 |
3962 | 31.783167 | 35.0 |
3963 | 34.471874 | 50.0 |
3964 rows × 2 columns
df_pf['Residuals'] = df_pf['Targets'] - df_pf['Predictions']
df_pf
Predictions | Targets | Residuals | |
---|---|---|---|
0 | 35.086682 | 58.0 | 22.913318 |
1 | 36.611418 | 25.0 | -11.611418 |
2 | 35.460856 | 25.0 | -10.460856 |
3 | 35.259933 | 36.0 | 0.740067 |
4 | 34.746644 | 22.0 | -12.746644 |
... | ... | ... | ... |
3959 | 38.419789 | 55.0 | 16.580211 |
3960 | 61.810220 | 35.0 | -26.810220 |
3961 | 34.471874 | 15.0 | -19.471874 |
3962 | 31.783167 | 35.0 | 3.216833 |
3963 | 34.471874 | 50.0 | 15.528126 |
3964 rows × 3 columns
df_pf['Difference%'] = np.absolute(df_pf['Residuals'] / df_pf['Targets']*100)
df_pf
Predictions | Targets | Residuals | Difference% | |
---|---|---|---|---|
0 | 35.086682 | 58.0 | 22.913318 | 39.505721 |
1 | 36.611418 | 25.0 | -11.611418 | 46.445670 |
2 | 35.460856 | 25.0 | -10.460856 | 41.843425 |
3 | 35.259933 | 36.0 | 0.740067 | 2.055743 |
4 | 34.746644 | 22.0 | -12.746644 | 57.939292 |
... | ... | ... | ... | ... |
3959 | 38.419789 | 55.0 | 16.580211 | 30.145838 |
3960 | 61.810220 | 35.0 | -26.810220 | 76.600629 |
3961 | 34.471874 | 15.0 | -19.471874 | 129.812490 |
3962 | 31.783167 | 35.0 | 3.216833 | 9.190951 |
3963 | 34.471874 | 50.0 | 15.528126 | 31.056253 |
3964 rows × 4 columns
df_pf.describe(include ='all')
Predictions | Targets | Residuals | Difference% | |
---|---|---|---|---|
count | 3964.000000 | 3964.000000 | 3964.000000 | 3964.000000 |
mean | 45.798296 | 49.726791 | 3.928495 | 32.477736 |
std | 14.211519 | 25.308897 | 20.494517 | 30.655094 |
min | 21.093085 | 9.000000 | -52.302901 | 0.021569 |
25% | 33.437396 | 30.000000 | -9.246543 | 12.345680 |
50% | 37.021530 | 45.000000 | -0.004178 | 24.734889 |
75% | 60.595064 | 65.000000 | 13.498191 | 43.312768 |
max | 78.612874 | 139.000000 | 94.864778 | 523.029014 |
Fehlerterme¶
sns.distplot(y_test - y_hat_test)
plt.title("Residuals")
c:\Git_Repos\ml-services\venv\lib\site-packages\seaborn\distributions.py:2619: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Residuals')
Faktenwerte vs. Vorhersage aufgezeichnet¶
test = pd.DataFrame({'Predicted':reg.predict(x_test),'Actual':y_test})
fig= plt.figure(figsize=(16,8))
test = test.reset_index()
test = test.drop(['index'],axis=1)
plt.plot(test[:50])
plt.legend(['Actual','Predicted'])
sns.jointplot(x='Actual',y='Predicted',data=test,kind='reg',)
<seaborn.axisgrid.JointGrid at 0x1d1b5ab78b0>
4.2 Random Forrest Regressor¶
# starting point is the dataset data_feature_selection,
# this dataset contains only the relevant features,
# an approximation to the normal distribution has not yet taken place.
random_forest_berlin_airbnb = data_feature_auswahl
random_forest_berlin_airbnb.head()
neighbourhood_group | room_type | price | calculated_host_listings_count | availability_365 | |
---|---|---|---|---|---|
0 | Mitte | Entire home/apt | 60 | 4 | 141 |
1 | Pankow | Private room | 17 | 1 | 0 |
2 | Pankow | Entire home/apt | 90 | 1 | 220 |
3 | Tempelhof-Schöneberg | Private room | 26 | 1 | 297 |
4 | Pankow | Private room | 42 | 1 | 26 |
random_forest_berlin_airbnb_with_dummies = pd.get_dummies(random_forest_berlin_airbnb)
Feature-Skalierung¶
targets = random_forest_berlin_airbnb_with_dummies['price']
inputs = random_forest_berlin_airbnb_with_dummies.drop(['price'],axis= 1)
scaler = StandardScaler()
scaler.fit(inputs)
inputs_scaled = scaler.transform(inputs)
Train Test Split (= präsentiert Overfitting)¶
x_train, x_test, y_train, y_test = train_test_split(inputs_scaled, targets,test_size=0.2, random_state=365)
Produktion von Regression¶
rfr = RandomForestRegressor(bootstrap=True, max_depth=None, max_features='auto',
min_samples_leaf=4, min_samples_split=2)
rfr.fit(x_train, y_train)
y_predictions=rfr.predict(x_test)
R_2 = rfr.score(x_train,y_train)
print('Train Performance')
print('R2: ',R_2)
Train Performance R2: 0.6994935521191867
Testing¶
## Coefficient of determination, which reflects the relationship between independent and dependent variable.
## R^2 lies in the range between 0 and 1.
## If the R^2 value is equal to 1, all points lie ideally on the regression line.
R_2 = rfr.score(x_test,y_test)
print('Test Performance')
print('R2: ',R_2)
Test Performance R2: 0.6186421639510178
feature_list = list(inputs.columns)
importances = list(rfr.feature_importances_)
feature_importances = [(feature, round(importance,2)) for feature, importance in zip(feature_list, importances)]
feature_importances = sorted(feature_importances, key = lambda x: x[1])
df = pd.DataFrame(feature_importances,columns=['Features','Weights'])
print (df)
mean_importance_room_type = round(df[df['Features'].str.contains('room_type')]['Weights'].mean(),5)
mean_importance_neighbourhood_group = round(df[df['Features'].str.contains('neighbourhood_group')]['Weights'].mean(),5)
mean_importance_availability_365 = round(df[df['Features'].str.contains('availability_365')]['Weights'].mean(),5)
mean_importance_calculated_host_listings_count = round(df[df['Features'].str.contains('calculated_host_listings_count')]['Weights'].mean(),5)
print('\nDurchschnitt Weights:')
print('{:20} Weight of neighbourhood_group'.format(mean_importance_neighbourhood_group))
print('{:20} Weight of room_type'.format(mean_importance_room_type))
print('{:20} Weight of calculated_host_listings_count'.format(mean_importance_calculated_host_listings_count))
print('{:20} Weight of availability_365'.format(mean_importance_availability_365))
Features Weights 0 neighbourhood_group_Friedrichshain-Kreuzberg 0.00 1 neighbourhood_group_Lichtenberg 0.00 2 neighbourhood_group_Marzahn-Hellersdorf 0.00 3 neighbourhood_group_Mitte 0.00 4 neighbourhood_group_Neukölln 0.00 5 neighbourhood_group_Pankow 0.00 6 neighbourhood_group_Reinickendorf 0.00 7 neighbourhood_group_Spandau 0.00 8 neighbourhood_group_Steglitz-Zehlendorf 0.00 9 neighbourhood_group_Treptow-Köpenick 0.00 10 room_type_Shared room 0.00 11 neighbourhood_group_Charlottenburg-Wilmersdorf 0.01 12 neighbourhood_group_Tempelhof-Schöneberg 0.02 13 availability_365 0.10 14 room_type_Entire home/apt 0.20 15 room_type_Private room 0.25 16 calculated_host_listings_count 0.41 Durchschnitt Weights: 0.0025 Weight of neighbourhood_group 0.15 Weight of room_type 0.41 Weight of calculated_host_listings_count 0.1 Weight of availability_365
Mittlerer absoluter Fehler¶
## Average absolute deviation of the predicted value from the value that occurred.
## Undirected accumulation of over- and underestimations.
## MAE = 0 can be interpreted as a perfect forecast.
print("MAE: ", mean_absolute_error(y_test, y_predictions))
MAE: 28.16271419814525
Mittlerer quadratischer Fehler¶
## The squared ME
## Large deviations receive a disproportionate weight compared to small deviations.
## MSE = 0 can be interpreted as a perfect forecast.
print("MAE: ", mean_squared_error(y_test, y_predictions))
MAE: 9828.498150732032
Predictions, Targets, Residuals, Difference%¶
df_rf = pd.DataFrame(np.exp(y_predictions), columns=['Predictions'])
df_rf.head()
C:\Users\du-wa\AppData\Local\Temp/ipykernel_26192/2419183680.py:1: RuntimeWarning: overflow encountered in exp df_rf = pd.DataFrame(np.exp(y_predictions), columns=['Predictions'])
Predictions | |
---|---|
0 | 3.004024e+16 |
1 | 2.534324e+26 |
2 | 2.171677e+18 |
3 | 6.141742e+14 |
4 | 2.340858e+13 |
y_test = y_test.reset_index(drop=True)
df_rf['Targets']=np.exp(y_test)
df_rf
c:\Git_Repos\ml-services\venv\lib\site-packages\pandas\core\arraylike.py:364: RuntimeWarning: overflow encountered in exp result = getattr(ufunc, method)(*inputs, **kwargs)
Predictions | Targets | |
---|---|---|
0 | 3.004024e+16 | 1.171914e+16 |
1 | 2.534324e+26 | 1.142007e+26 |
2 | 2.171677e+18 | 5.184706e+21 |
3 | 6.141742e+14 | 1.586013e+15 |
4 | 2.340858e+13 | 2.202647e+04 |
... | ... | ... |
4504 | 4.450291e+27 | 4.851652e+08 |
4505 | 1.854274e+17 | 3.584913e+09 |
4506 | 1.778611e+40 | 1.142007e+26 |
4507 | 7.299648e+33 | 4.727839e+18 |
4508 | 1.716382e+16 | 2.146436e+14 |
4509 rows × 2 columns
df_rf['Residuals'] = df_rf['Targets'] - df_rf['Predictions']
df_rf
Predictions | Targets | Residuals | |
---|---|---|---|
0 | 3.004024e+16 | 1.171914e+16 | -1.832110e+16 |
1 | 2.534324e+26 | 1.142007e+26 | -1.392316e+26 |
2 | 2.171677e+18 | 5.184706e+21 | 5.182534e+21 |
3 | 6.141742e+14 | 1.586013e+15 | 9.718393e+14 |
4 | 2.340858e+13 | 2.202647e+04 | -2.340858e+13 |
... | ... | ... | ... |
4504 | 4.450291e+27 | 4.851652e+08 | -4.450291e+27 |
4505 | 1.854274e+17 | 3.584913e+09 | -1.854274e+17 |
4506 | 1.778611e+40 | 1.142007e+26 | -1.778611e+40 |
4507 | 7.299648e+33 | 4.727839e+18 | -7.299648e+33 |
4508 | 1.716382e+16 | 2.146436e+14 | -1.694918e+16 |
4509 rows × 3 columns
df_rf['Difference%'] = np.absolute(df_rf['Residuals'] / df_rf['Targets']*100)
df_rf
Predictions | Targets | Residuals | Difference% | |
---|---|---|---|---|
0 | 3.004024e+16 | 1.171914e+16 | -1.832110e+16 | 1.563348e+02 |
1 | 2.534324e+26 | 1.142007e+26 | -1.392316e+26 | 1.219183e+02 |
2 | 2.171677e+18 | 5.184706e+21 | 5.182534e+21 | 9.995811e+01 |
3 | 6.141742e+14 | 1.586013e+15 | 9.718393e+14 | 6.127560e+01 |
4 | 2.340858e+13 | 2.202647e+04 | -2.340858e+13 | 1.062748e+11 |
... | ... | ... | ... | ... |
4504 | 4.450291e+27 | 4.851652e+08 | -4.450291e+27 | 9.172733e+20 |
4505 | 1.854274e+17 | 3.584913e+09 | -1.854274e+17 | 5.172439e+09 |
4506 | 1.778611e+40 | 1.142007e+26 | -1.778611e+40 | 1.557442e+16 |
4507 | 7.299648e+33 | 4.727839e+18 | -7.299648e+33 | 1.543971e+17 |
4508 | 1.716382e+16 | 2.146436e+14 | -1.694918e+16 | 7.896430e+03 |
4509 rows × 4 columns
df_rf.describe(include ='all')
Predictions | Targets | Residuals | Difference% | |
---|---|---|---|---|
count | 4.509000e+03 | 4.509000e+03 | 4.507000e+03 | 4.499000e+03 |
mean | inf | inf | inf | 1.353409e+204 |
std | NaN | NaN | NaN | inf |
min | 1.385562e+05 | 8.103084e+03 | -5.572802e+244 | 7.805810e-02 |
25% | 1.716382e+16 | 1.068647e+13 | -3.885309e+29 | 1.000000e+02 |
50% | 2.875709e+24 | 2.581313e+20 | -1.285259e+16 | 8.333046e+04 |
75% | 7.866260e+30 | 2.515439e+30 | 7.694783e+23 | 9.616541e+09 |
max | inf | inf | inf | 6.088054e+207 |
4.3 Neural network mit Keras¶
neuronales_netz_berlin_airbnb = data_feature_auswahl
neuronales_netz_berlin_airbnb_with_dummies = pd.get_dummies(neuronales_netz_berlin_airbnb)
targets = neuronales_netz_berlin_airbnb_with_dummies['price']
inputs = neuronales_netz_berlin_airbnb_with_dummies.drop(['price'],axis= 1)
Feature Skalierung¶
scaler = StandardScaler()
scaler.fit(inputs)
inputs_scaled = scaler.transform(inputs)
Train Test Split (= verhindert Overfitting)¶
x_train, x_test, y_train, y_test = train_test_split(inputs_scaled, targets,test_size=0.2, random_state=365)
Produktion der models¶
# Defining the keras model: the sequential model is used
model = keras.Sequential([keras.layers.Dense(512, input_dim = x_train.shape[1], kernel_initializer="normal", activation="relu"),
keras.layers.Dense(512, kernel_initializer="normal", activation="relu", kernel_regularizer=keras.regularizers.l1(0.1)),
keras.layers.Dense(256, kernel_initializer="normal", activation="relu", kernel_regularizer=keras.regularizers.l1(0.1)),
keras.layers.Dense(128, kernel_initializer="normal", activation="relu", kernel_regularizer=keras.regularizers.l1(0.1)),
keras.layers.Dense(64, kernel_initializer="normal", activation="relu", kernel_regularizer=keras.regularizers.l1(0.1)),
keras.layers.Dense(32, kernel_initializer="normal", activation="relu", kernel_regularizer=keras.regularizers.l1(0.1)),
keras.layers.Dense(1, kernel_initializer="normal", activation="linear")])
#model.compile(optimizer='adam', loss='mean_squared_error')
model.compile(loss='mse', optimizer='adam', metrics=['mse', 'mae', 'mape'])
history = model.fit(x_train, y_train, epochs=100, validation_split=0.25, shuffle=True)
Epoch 1/100 423/423 [==============================] - 3s 6ms/step - loss: 52293.3281 - mse: 51679.7109 - mae: 41.9098 - mape: 66.6437 - val_loss: 60160.3984 - val_mse: 59678.9336 - val_mae: 37.0176 - val_mape: 48.9715 Epoch 2/100 423/423 [==============================] - 3s 7ms/step - loss: 50323.0234 - mse: 49811.2852 - mae: 40.1004 - mape: 64.6385 - val_loss: 50109.9961 - val_mse: 49417.6133 - val_mae: 41.6000 - val_mape: 63.4633 Epoch 3/100 423/423 [==============================] - 2s 6ms/step - loss: 47327.1953 - mse: 46472.9922 - mae: 40.1668 - mape: 63.5447 - val_loss: 35793.2812 - val_mse: 34904.3281 - val_mae: 47.5639 - val_mape: 91.9253 Epoch 4/100 423/423 [==============================] - 2s 6ms/step - loss: 37613.8594 - mse: 36620.3633 - mae: 38.1361 - mape: 61.4937 - val_loss: 65582.8203 - val_mse: 64661.3555 - val_mae: 56.5424 - val_mape: 71.7409 Epoch 5/100 423/423 [==============================] - 2s 6ms/step - loss: 51873.2305 - mse: 50971.9766 - mae: 36.8921 - mape: 57.2477 - val_loss: 58424.3047 - val_mse: 57536.7148 - val_mae: 38.3821 - val_mape: 50.7613 Epoch 6/100 423/423 [==============================] - 2s 6ms/step - loss: 44660.4570 - mse: 43756.8984 - mae: 39.0159 - mape: 62.5984 - val_loss: 28785.3887 - val_mse: 27864.2656 - val_mae: 40.2119 - val_mape: 69.1342 Epoch 7/100 423/423 [==============================] - 2s 6ms/step - loss: 50004.3828 - mse: 49094.9805 - mae: 38.4008 - mape: 60.9216 - val_loss: 58432.8477 - val_mse: 57541.4648 - val_mae: 41.5801 - val_mape: 64.1060 Epoch 8/100 423/423 [==============================] - 2s 6ms/step - loss: 40808.0352 - mse: 39920.8555 - mae: 37.0300 - mape: 57.7393 - val_loss: 54740.3281 - val_mse: 53866.2266 - val_mae: 35.8979 - val_mape: 49.6711 Epoch 9/100 423/423 [==============================] - 3s 6ms/step - loss: 34173.5273 - mse: 33306.4023 - mae: 34.1722 - mape: 58.0062 - val_loss: 18571.2930 - val_mse: 17711.9570 - val_mae: 32.7611 - val_mape: 59.6569 Epoch 10/100 423/423 [==============================] - 2s 6ms/step - loss: 25154.3594 - mse: 24312.6270 - mae: 30.9568 - mape: 52.0851 - val_loss: 20163.3105 - val_mse: 19342.6387 - val_mae: 33.5411 - val_mape: 59.3138 Epoch 11/100 423/423 [==============================] - 3s 6ms/step - loss: 24786.3809 - mse: 23993.9004 - mae: 30.3907 - mape: 51.3523 - val_loss: 32159.5215 - val_mse: 31389.2715 - val_mae: 33.4724 - val_mape: 55.5347 Epoch 12/100 423/423 [==============================] - 2s 6ms/step - loss: 28230.5352 - mse: 27493.5137 - mae: 31.1535 - mape: 51.5877 - val_loss: 15377.5752 - val_mse: 14659.0166 - val_mae: 31.9807 - val_mape: 59.0547 Epoch 13/100 423/423 [==============================] - 3s 6ms/step - loss: 25546.8945 - mse: 24853.3574 - mae: 30.3278 - mape: 49.2608 - val_loss: 19614.4863 - val_mse: 18944.6074 - val_mae: 34.9710 - val_mape: 67.4844 Epoch 14/100 423/423 [==============================] - 2s 6ms/step - loss: 24069.3594 - mse: 23423.2012 - mae: 29.9839 - mape: 50.7390 - val_loss: 31546.3262 - val_mse: 30934.7988 - val_mae: 33.3991 - val_mape: 51.3432 Epoch 15/100 423/423 [==============================] - 3s 6ms/step - loss: 23306.0059 - mse: 22712.5801 - mae: 29.8233 - mape: 50.1048 - val_loss: 17770.7031 - val_mse: 17205.6211 - val_mae: 29.9857 - val_mape: 48.6256 Epoch 16/100 423/423 [==============================] - 2s 5ms/step - loss: 39370.7852 - mse: 38834.6953 - mae: 32.9705 - mape: 51.7740 - val_loss: 16907.8496 - val_mse: 16395.3438 - val_mae: 29.0541 - val_mape: 46.7519 Epoch 17/100 423/423 [==============================] - 3s 6ms/step - loss: 22942.4434 - mse: 22443.5234 - mae: 30.0097 - mape: 50.6736 - val_loss: 25841.3438 - val_mse: 25377.4805 - val_mae: 31.3007 - val_mape: 34.1006 Epoch 18/100 423/423 [==============================] - 2s 5ms/step - loss: 22652.2891 - mse: 22197.5742 - mae: 29.4728 - mape: 49.5637 - val_loss: 34043.7852 - val_mse: 33617.9219 - val_mae: 31.2155 - val_mape: 41.8524 Epoch 19/100 423/423 [==============================] - 3s 6ms/step - loss: 29248.0918 - mse: 28832.5859 - mae: 30.9227 - mape: 50.8924 - val_loss: 16737.5684 - val_mse: 16321.6162 - val_mae: 35.0754 - val_mape: 63.0560 Epoch 20/100 423/423 [==============================] - 2s 6ms/step - loss: 21709.6816 - mse: 21305.4727 - mae: 29.5762 - mape: 50.6055 - val_loss: 14978.3486 - val_mse: 14581.7051 - val_mae: 27.5680 - val_mape: 43.1923 Epoch 21/100 423/423 [==============================] - 3s 6ms/step - loss: 20798.2441 - mse: 20412.9375 - mae: 28.7860 - mape: 49.4263 - val_loss: 14677.9014 - val_mse: 14296.3232 - val_mae: 26.9458 - val_mape: 42.5543 Epoch 22/100 423/423 [==============================] - 2s 6ms/step - loss: 20910.9727 - mse: 20545.1562 - mae: 28.7406 - mape: 48.9938 - val_loss: 15795.0342 - val_mse: 15441.5791 - val_mae: 29.2079 - val_mape: 47.3047 Epoch 23/100 423/423 [==============================] - 2s 6ms/step - loss: 22164.7520 - mse: 21814.6914 - mae: 29.7377 - mape: 49.7561 - val_loss: 17772.8340 - val_mse: 17427.7949 - val_mae: 28.8176 - val_mape: 42.8588 Epoch 24/100 423/423 [==============================] - 2s 6ms/step - loss: 21428.3535 - mse: 21079.9844 - mae: 29.5922 - mape: 51.0474 - val_loss: 19102.7012 - val_mse: 18754.4785 - val_mae: 30.8114 - val_mape: 45.9917 Epoch 25/100 423/423 [==============================] - 2s 6ms/step - loss: 21981.6699 - mse: 21643.3691 - mae: 29.6914 - mape: 49.9696 - val_loss: 14473.6650 - val_mse: 14138.2842 - val_mae: 30.8235 - val_mape: 58.1141 Epoch 26/100 423/423 [==============================] - 2s 5ms/step - loss: 27717.5371 - mse: 27389.0449 - mae: 30.9160 - mape: 51.6513 - val_loss: 15395.3711 - val_mse: 15067.1670 - val_mae: 30.4381 - val_mape: 52.4866 Epoch 27/100 423/423 [==============================] - 2s 6ms/step - loss: 21713.8027 - mse: 21391.6016 - mae: 29.6187 - mape: 50.5753 - val_loss: 15701.7695 - val_mse: 15370.7363 - val_mae: 29.3914 - val_mape: 47.2328 Epoch 28/100 423/423 [==============================] - 2s 6ms/step - loss: 25450.6152 - mse: 25131.7148 - mae: 30.5900 - mape: 49.7339 - val_loss: 14446.6191 - val_mse: 14127.1143 - val_mae: 28.8260 - val_mape: 51.2092 Epoch 29/100 423/423 [==============================] - 2s 6ms/step - loss: 20418.6855 - mse: 20102.6387 - mae: 28.5522 - mape: 49.3056 - val_loss: 14659.6279 - val_mse: 14348.4141 - val_mae: 29.8975 - val_mape: 55.2522 Epoch 30/100 423/423 [==============================] - 2s 6ms/step - loss: 20410.7930 - mse: 20103.7910 - mae: 28.7888 - mape: 50.1068 - val_loss: 14680.2109 - val_mse: 14375.8906 - val_mae: 28.4882 - val_mape: 47.2602 Epoch 31/100 423/423 [==============================] - 3s 6ms/step - loss: 20645.7402 - mse: 20343.2754 - mae: 28.9168 - mape: 49.6477 - val_loss: 18702.6699 - val_mse: 18397.6719 - val_mae: 29.1756 - val_mape: 42.2295 Epoch 32/100 423/423 [==============================] - 2s 6ms/step - loss: 35976.8125 - mse: 35683.4180 - mae: 32.2055 - mape: 50.8269 - val_loss: 16274.7217 - val_mse: 15979.6426 - val_mae: 29.3639 - val_mape: 48.3138 Epoch 33/100 423/423 [==============================] - 2s 6ms/step - loss: 20771.5938 - mse: 20475.0078 - mae: 28.9377 - mape: 50.0094 - val_loss: 15103.5010 - val_mse: 14810.2139 - val_mae: 27.5190 - val_mape: 41.3460 Epoch 34/100 423/423 [==============================] - 2s 6ms/step - loss: 20755.4121 - mse: 20462.9102 - mae: 29.0116 - mape: 49.5906 - val_loss: 14490.4258 - val_mse: 14201.3213 - val_mae: 28.1967 - val_mape: 47.5464 Epoch 35/100 423/423 [==============================] - 2s 6ms/step - loss: 21023.4824 - mse: 20737.6914 - mae: 28.7462 - mape: 48.5123 - val_loss: 14421.3037 - val_mse: 14140.3262 - val_mae: 30.2873 - val_mape: 57.2205 Epoch 36/100 423/423 [==============================] - 3s 6ms/step - loss: 20197.0664 - mse: 19917.3574 - mae: 28.7770 - mape: 50.4598 - val_loss: 14515.9561 - val_mse: 14239.7061 - val_mae: 28.2014 - val_mape: 48.3462 Epoch 37/100 423/423 [==============================] - 3s 6ms/step - loss: 23521.9492 - mse: 23246.3340 - mae: 29.4766 - mape: 49.4750 - val_loss: 17011.4785 - val_mse: 16742.3047 - val_mae: 29.3379 - val_mape: 45.7826 Epoch 38/100 423/423 [==============================] - 3s 6ms/step - loss: 23986.8730 - mse: 23715.6543 - mae: 29.6040 - mape: 48.7383 - val_loss: 22963.8652 - val_mse: 22692.8613 - val_mae: 31.9251 - val_mape: 51.3167 Epoch 39/100 423/423 [==============================] - 2s 6ms/step - loss: 22888.5488 - mse: 22616.4688 - mae: 30.2043 - mape: 51.4971 - val_loss: 17375.3398 - val_mse: 17105.5332 - val_mae: 31.6645 - val_mape: 54.1982 Epoch 40/100 423/423 [==============================] - 2s 6ms/step - loss: 21133.0996 - mse: 20863.0195 - mae: 29.4495 - mape: 50.8739 - val_loss: 14721.0586 - val_mse: 14454.7402 - val_mae: 27.8326 - val_mape: 43.9761 Epoch 41/100 423/423 [==============================] - 2s 6ms/step - loss: 20316.7324 - mse: 20051.9863 - mae: 28.6843 - mape: 49.5134 - val_loss: 15175.8672 - val_mse: 14911.2246 - val_mae: 30.1084 - val_mape: 53.5160 Epoch 42/100 423/423 [==============================] - 3s 6ms/step - loss: 20848.7168 - mse: 20589.6133 - mae: 29.0006 - mape: 49.3057 - val_loss: 22664.5273 - val_mse: 22412.8418 - val_mae: 30.0435 - val_mape: 43.4602 Epoch 43/100 423/423 [==============================] - 2s 6ms/step - loss: 21873.9141 - mse: 21618.7617 - mae: 29.2983 - mape: 49.7372 - val_loss: 15271.5352 - val_mse: 15015.4834 - val_mae: 31.9142 - val_mape: 59.2043 Epoch 44/100 423/423 [==============================] - 3s 6ms/step - loss: 26698.9082 - mse: 26448.4766 - mae: 30.2999 - mape: 48.3501 - val_loss: 17245.4922 - val_mse: 16991.6465 - val_mae: 32.2861 - val_mape: 58.4351 Epoch 45/100 423/423 [==============================] - 3s 6ms/step - loss: 23317.5938 - mse: 23058.5020 - mae: 30.1012 - mape: 50.7395 - val_loss: 15355.7383 - val_mse: 15093.3281 - val_mae: 32.7868 - val_mape: 61.9204 Epoch 46/100 423/423 [==============================] - 3s 6ms/step - loss: 21099.3203 - mse: 20840.7051 - mae: 29.2305 - mape: 49.8126 - val_loss: 14678.6875 - val_mse: 14421.4951 - val_mae: 28.0172 - val_mape: 44.7647 Epoch 47/100 423/423 [==============================] - 2s 6ms/step - loss: 20210.5195 - mse: 19952.8574 - mae: 28.8082 - mape: 50.2263 - val_loss: 14644.3770 - val_mse: 14387.1934 - val_mae: 30.0444 - val_mape: 52.2357 Epoch 48/100 423/423 [==============================] - 3s 6ms/step - loss: 21291.0801 - mse: 21036.5078 - mae: 29.3914 - mape: 49.5671 - val_loss: 15100.0342 - val_mse: 14849.9688 - val_mae: 29.4926 - val_mape: 51.6967 Epoch 49/100 423/423 [==============================] - 3s 6ms/step - loss: 21473.0820 - mse: 21225.3613 - mae: 28.5235 - mape: 47.8220 - val_loss: 24894.0332 - val_mse: 24654.3223 - val_mae: 32.9103 - val_mape: 53.1064 Epoch 50/100 423/423 [==============================] - 2s 6ms/step - loss: 23040.4629 - mse: 22793.7461 - mae: 29.6238 - mape: 51.2178 - val_loss: 14723.0322 - val_mse: 14477.1338 - val_mae: 34.5898 - val_mape: 69.3889 Epoch 51/100 423/423 [==============================] - 2s 6ms/step - loss: 20025.9434 - mse: 19780.7852 - mae: 28.2611 - mape: 48.9969 - val_loss: 14597.6582 - val_mse: 14354.8496 - val_mae: 32.1881 - val_mape: 60.4433 Epoch 52/100 423/423 [==============================] - 2s 6ms/step - loss: 20256.0215 - mse: 20013.0527 - mae: 28.8855 - mape: 49.7588 - val_loss: 18597.4824 - val_mse: 18361.0977 - val_mae: 33.8168 - val_mape: 62.1756 Epoch 53/100 423/423 [==============================] - 2s 6ms/step - loss: 20852.2617 - mse: 20614.1211 - mae: 29.0814 - mape: 50.2013 - val_loss: 16610.5469 - val_mse: 16370.4043 - val_mae: 38.1355 - val_mape: 74.6527 Epoch 54/100 423/423 [==============================] - 3s 7ms/step - loss: 26623.8008 - mse: 26380.3906 - mae: 31.3673 - mape: 51.6542 - val_loss: 15429.2041 - val_mse: 15181.9951 - val_mae: 29.4988 - val_mape: 48.9629 Epoch 55/100 423/423 [==============================] - 3s 7ms/step - loss: 20979.9180 - mse: 20733.9570 - mae: 29.1757 - mape: 50.1208 - val_loss: 14385.1631 - val_mse: 14139.0879 - val_mae: 28.4694 - val_mape: 48.9548 Epoch 56/100 423/423 [==============================] - 3s 6ms/step - loss: 20416.3320 - mse: 20172.4570 - mae: 28.8964 - mape: 49.9928 - val_loss: 14628.3154 - val_mse: 14386.8877 - val_mae: 30.8079 - val_mape: 56.1596 Epoch 57/100 423/423 [==============================] - 3s 6ms/step - loss: 22204.0254 - mse: 21962.1133 - mae: 29.5983 - mape: 49.7724 - val_loss: 16012.7979 - val_mse: 15769.5918 - val_mae: 29.9539 - val_mape: 51.1113 Epoch 58/100 423/423 [==============================] - 3s 7ms/step - loss: 20217.6797 - mse: 19976.1914 - mae: 28.5106 - mape: 48.8859 - val_loss: 14652.4502 - val_mse: 14412.7695 - val_mae: 28.8912 - val_mape: 50.5594 Epoch 59/100 423/423 [==============================] - 3s 7ms/step - loss: 20842.9180 - mse: 20604.6973 - mae: 29.0870 - mape: 50.3492 - val_loss: 15369.7617 - val_mse: 15134.7227 - val_mae: 28.6237 - val_mape: 43.6643 Epoch 60/100 423/423 [==============================] - 3s 7ms/step - loss: 20552.0371 - mse: 20318.5566 - mae: 28.8993 - mape: 49.2077 - val_loss: 14461.8730 - val_mse: 14228.4307 - val_mae: 27.9734 - val_mape: 47.3240 Epoch 61/100 423/423 [==============================] - 5s 13ms/step - loss: 20280.1270 - mse: 20049.4062 - mae: 28.3409 - mape: 48.3621 - val_loss: 14889.2441 - val_mse: 14660.4893 - val_mae: 29.0339 - val_mape: 48.8065 Epoch 62/100 423/423 [==============================] - 3s 7ms/step - loss: 20741.0801 - mse: 20513.0957 - mae: 28.9828 - mape: 49.5035 - val_loss: 16687.6719 - val_mse: 16458.8418 - val_mae: 32.0397 - val_mape: 55.9549 Epoch 63/100 423/423 [==============================] - 3s 8ms/step - loss: 22557.3848 - mse: 22333.7500 - mae: 30.0405 - mape: 49.7111 - val_loss: 19990.1816 - val_mse: 19768.4922 - val_mae: 28.8156 - val_mape: 40.5144 Epoch 64/100 423/423 [==============================] - 3s 6ms/step - loss: 23138.3125 - mse: 22911.1758 - mae: 29.5725 - mape: 49.5173 - val_loss: 16966.5781 - val_mse: 16742.5742 - val_mae: 30.0983 - val_mape: 51.0998 Epoch 65/100 423/423 [==============================] - 3s 7ms/step - loss: 21223.6758 - mse: 20998.3984 - mae: 29.3412 - mape: 49.6874 - val_loss: 14643.8711 - val_mse: 14419.7139 - val_mae: 27.7643 - val_mape: 44.8387 Epoch 66/100 423/423 [==============================] - 3s 7ms/step - loss: 20655.1367 - mse: 20432.3145 - mae: 28.6555 - mape: 48.3599 - val_loss: 16692.0508 - val_mse: 16474.5645 - val_mae: 29.2540 - val_mape: 48.0245 Epoch 67/100 423/423 [==============================] - 3s 7ms/step - loss: 20430.9922 - mse: 20211.5508 - mae: 29.0443 - mape: 50.4390 - val_loss: 14773.5840 - val_mse: 14555.0723 - val_mae: 29.2714 - val_mape: 49.5768 Epoch 68/100 423/423 [==============================] - 3s 6ms/step - loss: 20362.4570 - mse: 20145.5645 - mae: 28.8003 - mape: 49.6152 - val_loss: 15822.7529 - val_mse: 15600.0195 - val_mae: 29.1019 - val_mape: 46.9614 Epoch 69/100 423/423 [==============================] - 3s 7ms/step - loss: 21091.6113 - mse: 20873.9473 - mae: 28.8828 - mape: 49.1302 - val_loss: 15335.6963 - val_mse: 15119.8457 - val_mae: 29.3558 - val_mape: 48.0501 Epoch 70/100 423/423 [==============================] - 3s 6ms/step - loss: 22121.8496 - mse: 21901.9395 - mae: 29.0079 - mape: 48.2195 - val_loss: 20414.8789 - val_mse: 20195.4414 - val_mae: 31.0625 - val_mape: 51.4546 Epoch 71/100 423/423 [==============================] - 3s 6ms/step - loss: 23913.6934 - mse: 23689.2422 - mae: 30.0587 - mape: 49.7639 - val_loss: 19649.0625 - val_mse: 19426.5234 - val_mae: 32.8959 - val_mape: 57.9857 Epoch 72/100 423/423 [==============================] - 3s 6ms/step - loss: 20414.5566 - mse: 20188.5918 - mae: 28.8796 - mape: 49.9616 - val_loss: 14551.9893 - val_mse: 14328.3086 - val_mae: 28.5485 - val_mape: 48.8508 Epoch 73/100 423/423 [==============================] - 2s 6ms/step - loss: 20131.3203 - mse: 19909.1465 - mae: 28.5526 - mape: 49.7763 - val_loss: 14469.1689 - val_mse: 14247.7881 - val_mae: 28.5767 - val_mape: 48.2724 Epoch 74/100 423/423 [==============================] - 2s 6ms/step - loss: 19988.3105 - mse: 19768.5352 - mae: 28.3517 - mape: 49.2138 - val_loss: 14359.5537 - val_mse: 14141.4814 - val_mae: 26.7576 - val_mape: 42.7634 Epoch 75/100 423/423 [==============================] - 3s 7ms/step - loss: 21674.2070 - mse: 21453.1484 - mae: 29.4821 - mape: 49.8803 - val_loss: 16643.1270 - val_mse: 16416.0293 - val_mae: 29.5233 - val_mape: 46.5118 Epoch 76/100 423/423 [==============================] - 3s 6ms/step - loss: 22149.6855 - mse: 21928.5859 - mae: 29.3435 - mape: 48.8290 - val_loss: 15193.3574 - val_mse: 14978.0488 - val_mae: 28.5719 - val_mape: 45.0408 Epoch 77/100 423/423 [==============================] - 2s 6ms/step - loss: 20125.7852 - mse: 19910.4316 - mae: 28.4137 - mape: 49.0522 - val_loss: 17394.1016 - val_mse: 17180.2070 - val_mae: 29.8581 - val_mape: 48.9369 Epoch 78/100 423/423 [==============================] - 2s 6ms/step - loss: 20398.0996 - mse: 20183.7617 - mae: 28.8855 - mape: 50.1274 - val_loss: 14572.5742 - val_mse: 14357.8809 - val_mae: 27.8695 - val_mape: 45.7242 Epoch 79/100 423/423 [==============================] - 3s 7ms/step - loss: 21216.4473 - mse: 21006.8809 - mae: 28.4213 - mape: 47.7684 - val_loss: 14486.0283 - val_mse: 14276.7363 - val_mae: 32.6263 - val_mape: 63.0260 Epoch 80/100 423/423 [==============================] - 2s 6ms/step - loss: 20789.2852 - mse: 20576.6582 - mae: 29.0302 - mape: 49.9656 - val_loss: 14323.6123 - val_mse: 14110.3984 - val_mae: 27.5960 - val_mape: 47.1912 Epoch 81/100 423/423 [==============================] - 2s 6ms/step - loss: 20571.8281 - mse: 20359.2031 - mae: 28.8013 - mape: 49.1503 - val_loss: 14281.8379 - val_mse: 14068.6152 - val_mae: 28.3498 - val_mape: 49.8457 Epoch 82/100 423/423 [==============================] - 2s 6ms/step - loss: 20095.7676 - mse: 19883.6211 - mae: 28.4804 - mape: 49.2786 - val_loss: 14647.7969 - val_mse: 14435.4521 - val_mae: 28.8789 - val_mape: 49.7184 Epoch 83/100 423/423 [==============================] - 2s 6ms/step - loss: 20643.5605 - mse: 20432.6191 - mae: 28.9028 - mape: 49.2457 - val_loss: 14534.9756 - val_mse: 14324.6924 - val_mae: 28.5280 - val_mape: 48.8147 Epoch 84/100 423/423 [==============================] - 3s 6ms/step - loss: 23297.3340 - mse: 23082.5215 - mae: 29.5458 - mape: 49.7758 - val_loss: 36891.3789 - val_mse: 36677.1289 - val_mae: 31.7425 - val_mape: 41.3507 Epoch 85/100 423/423 [==============================] - 2s 5ms/step - loss: 24952.0332 - mse: 24733.0586 - mae: 30.1889 - mape: 49.3295 - val_loss: 15900.6221 - val_mse: 15679.1943 - val_mae: 28.5293 - val_mape: 45.4901 Epoch 86/100 423/423 [==============================] - 2s 6ms/step - loss: 20198.8125 - mse: 19978.7969 - mae: 28.0755 - mape: 48.2949 - val_loss: 15212.1035 - val_mse: 14996.1006 - val_mae: 31.1072 - val_mape: 54.8842 Epoch 87/100 423/423 [==============================] - 2s 6ms/step - loss: 20152.1641 - mse: 19935.3320 - mae: 28.9088 - mape: 50.3536 - val_loss: 15085.6777 - val_mse: 14867.3193 - val_mae: 28.6502 - val_mape: 47.7400 Epoch 88/100 423/423 [==============================] - 2s 6ms/step - loss: 23356.5020 - mse: 23141.0898 - mae: 29.8449 - mape: 48.6017 - val_loss: 15345.4287 - val_mse: 15131.5215 - val_mae: 28.0360 - val_mape: 45.2547 Epoch 89/100 423/423 [==============================] - 2s 5ms/step - loss: 23079.0742 - mse: 22865.8418 - mae: 29.3493 - mape: 50.0543 - val_loss: 14466.6406 - val_mse: 14252.9629 - val_mae: 27.9644 - val_mape: 47.2775 Epoch 90/100 423/423 [==============================] - 2s 6ms/step - loss: 20472.4824 - mse: 20259.2793 - mae: 28.7366 - mape: 49.7447 - val_loss: 14281.1289 - val_mse: 14069.6777 - val_mae: 27.8273 - val_mape: 47.3117 Epoch 91/100 423/423 [==============================] - 2s 6ms/step - loss: 20114.9746 - mse: 19905.1055 - mae: 28.2109 - mape: 48.7518 - val_loss: 14420.4736 - val_mse: 14211.0264 - val_mae: 30.1907 - val_mape: 55.8769 Epoch 92/100 423/423 [==============================] - 2s 6ms/step - loss: 20221.8086 - mse: 20010.5703 - mae: 28.4359 - mape: 48.8558 - val_loss: 14280.2314 - val_mse: 14066.6992 - val_mae: 28.2507 - val_mape: 48.4596 Epoch 93/100 423/423 [==============================] - 2s 6ms/step - loss: 20476.2637 - mse: 20264.1348 - mae: 28.3959 - mape: 48.5550 - val_loss: 14565.0781 - val_mse: 14354.1689 - val_mae: 33.3879 - val_mape: 64.4384 Epoch 94/100 423/423 [==============================] - 3s 6ms/step - loss: 20472.0156 - mse: 20261.1035 - mae: 29.4321 - mape: 50.6671 - val_loss: 17717.8125 - val_mse: 17504.9961 - val_mae: 33.3757 - val_mape: 60.0157 Epoch 95/100 423/423 [==============================] - 2s 6ms/step - loss: 21217.6406 - mse: 21009.6465 - mae: 29.1853 - mape: 50.3124 - val_loss: 14383.0371 - val_mse: 14176.0615 - val_mae: 27.5212 - val_mape: 44.9724 Epoch 96/100 423/423 [==============================] - 2s 6ms/step - loss: 20050.3906 - mse: 19844.9941 - mae: 28.1842 - mape: 48.5673 - val_loss: 14392.6553 - val_mse: 14188.3545 - val_mae: 27.1755 - val_mape: 44.2124 Epoch 97/100 423/423 [==============================] - 2s 5ms/step - loss: 21200.8965 - mse: 20996.3008 - mae: 28.5144 - mape: 48.3619 - val_loss: 15044.9375 - val_mse: 14834.1982 - val_mae: 33.3696 - val_mape: 64.9047 Epoch 98/100 423/423 [==============================] - 3s 6ms/step - loss: 24476.2305 - mse: 24266.9688 - mae: 30.8577 - mape: 51.5919 - val_loss: 14332.9014 - val_mse: 14120.1211 - val_mae: 29.0312 - val_mape: 51.0840 Epoch 99/100 423/423 [==============================] - 2s 6ms/step - loss: 19989.4238 - mse: 19776.9590 - mae: 28.4632 - mape: 49.3879 - val_loss: 14420.9678 - val_mse: 14210.0049 - val_mae: 28.6441 - val_mape: 50.4853 Epoch 100/100 423/423 [==============================] - 3s 6ms/step - loss: 20149.7188 - mse: 19939.8672 - mae: 28.4345 - mape: 49.4230 - val_loss: 14407.6641 - val_mse: 14198.5283 - val_mae: 28.0850 - val_mape: 47.2621
# Consider validation accuracy to avoid overfitting as well as underfitting.
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()