1. Business Understanding¶

Ein Einzelhändler braucht einen sehr hohen Umsatz und eine Just-in-time-Lieferung von Produkten, um erfolgreich zu sein. Dazu werden sehr genaue Prognosen benötigt, die genaue Aussagen darüber erlauben, welche Filialen in welchen Abteilungen wie viel Umsatz pro Woche machen. Die Fragestellung lautet daher wie folgt: Wie hoch ist die wöchentliche Umsatzprognose für alle Filialen eines fiktiven Marktes, wenn wir die Daten der letzten drei Jahre, demographische Informationen (VPI, Arbeitslosigkeit, Temperatur, etc.) berücksichtigen? Wie hoch ist dann der Wochenumsatz pro Abteilung in einer dieser Filialen?

2. Daten und Datenverständnis¶

Die für das Notizbuch verwendeten Daten sind in drei Dateien unterteilt. Sie enthalten verschiedene Merkmale und die gemeinsamen Attributsspeicher.

2.1. Import von relevanten Modulen¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import sklearn  
from sklearn.linear_model import LinearRegression
%matplotlib inline
sns.set()

2.2. Trainingsdaten einlesen¶

In [2]:
walmart_features_data = pd.read_csv('https://storage.googleapis.com/ml-service-repository-datastorage/Sales_Forecast_for_retail_store_features.csv') 
walmart_train_data = pd.read_csv('https://storage.googleapis.com/ml-service-repository-datastorage/Sales_Forecast_for_retail_store_train.csv')
walmart_stores_data = pd.read_csv('https://storage.googleapis.com/ml-service-repository-datastorage/Sales_Forecast_for_retail_store_stores.csv')
In [3]:
walmart_train_data.head(5)
Out[3]:
Store Dept Date Weekly_Sales IsHoliday
0 1 1 2010-02-05 24924.50 False
1 1 1 2010-02-12 46039.49 True
2 1 1 2010-02-19 41595.55 False
3 1 1 2010-02-26 19403.54 False
4 1 1 2010-03-05 21827.90 False
In [4]:
walmart_train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Store         421570 non-null  int64  
 1   Dept          421570 non-null  int64  
 2   Date          421570 non-null  object 
 3   Weekly_Sales  421570 non-null  float64
 4   IsHoliday     421570 non-null  bool   
dtypes: bool(1), float64(1), int64(2), object(1)
memory usage: 13.3+ MB
In [5]:
walmart_train_data.describe(include='all')
Out[5]:
Store Dept Date Weekly_Sales IsHoliday
count 421570.000000 421570.000000 421570 421570.000000 421570
unique NaN NaN 143 NaN 2
top NaN NaN 2011-12-23 NaN False
freq NaN NaN 3027 NaN 391909
mean 22.200546 44.260317 NaN 15981.258123 NaN
std 12.785297 30.492054 NaN 22711.183519 NaN
min 1.000000 1.000000 NaN -4988.940000 NaN
25% 11.000000 18.000000 NaN 2079.650000 NaN
50% 22.000000 37.000000 NaN 7612.030000 NaN
75% 33.000000 74.000000 NaN 20205.852500 NaN
max 45.000000 99.000000 NaN 693099.360000 NaN
In [6]:
walmart_train_data.columns
Out[6]:
Index(['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday'], dtype='object')
In [7]:
walmart_train_data[walmart_train_data.duplicated(keep=False)] # existieren duplizierte Reihen ?
Out[7]:
Store Dept Date Weekly_Sales IsHoliday
In [8]:
walmart_train_data.isnull().sum() # existieren Nullwerte ? --> keine vorhanden
Out[8]:
Store           0
Dept            0
Date            0
Weekly_Sales    0
IsHoliday       0
dtype: int64

2.2.1. Deskriptive Analyse¶

In [9]:
sns.distplot(walmart_train_data['Weekly_Sales'])
C:\Users\eebal\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[9]:
<AxesSubplot:xlabel='Weekly_Sales', ylabel='Density'>
No description has been provided for this image
In [10]:
sns.heatmap(walmart_train_data.corr())  # Betrachtung der Korrelation über eine heatmap
Out[10]:
<AxesSubplot:>
No description has been provided for this image

2.3. Merkmaldaten einlesen¶

In [11]:
walmart_features_data.head(5)
Out[11]:
Store Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday
0 1 2010-02-05 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 False
1 1 2010-02-12 38.51 2.548 NaN NaN NaN NaN NaN 211.242170 8.106 True
2 1 2010-02-19 39.93 2.514 NaN NaN NaN NaN NaN 211.289143 8.106 False
3 1 2010-02-26 46.63 2.561 NaN NaN NaN NaN NaN 211.319643 8.106 False
4 1 2010-03-05 46.50 2.625 NaN NaN NaN NaN NaN 211.350143 8.106 False
In [12]:
walmart_features_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8190 entries, 0 to 8189
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Store         8190 non-null   int64  
 1   Date          8190 non-null   object 
 2   Temperature   8190 non-null   float64
 3   Fuel_Price    8190 non-null   float64
 4   MarkDown1     4032 non-null   float64
 5   MarkDown2     2921 non-null   float64
 6   MarkDown3     3613 non-null   float64
 7   MarkDown4     3464 non-null   float64
 8   MarkDown5     4050 non-null   float64
 9   CPI           7605 non-null   float64
 10  Unemployment  7605 non-null   float64
 11  IsHoliday     8190 non-null   bool   
dtypes: bool(1), float64(9), int64(1), object(1)
memory usage: 712.0+ KB
In [13]:
walmart_features_data.describe(include='all')
Out[13]:
Store Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday
count 8190.000000 8190 8190.000000 8190.000000 4032.000000 2921.000000 3613.000000 3464.000000 4050.000000 7605.000000 7605.000000 8190
unique NaN 182 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2
top NaN 2011-09-23 NaN NaN NaN NaN NaN NaN NaN NaN NaN False
freq NaN 45 NaN NaN NaN NaN NaN NaN NaN NaN NaN 7605
mean 23.000000 NaN 59.356198 3.405992 7032.371786 3384.176594 1760.100180 3292.935886 4132.216422 172.460809 7.826821 NaN
std 12.987966 NaN 18.678607 0.431337 9262.747448 8793.583016 11276.462208 6792.329861 13086.690278 39.738346 1.877259 NaN
min 1.000000 NaN -7.290000 2.472000 -2781.450000 -265.760000 -179.260000 0.220000 -185.170000 126.064000 3.684000 NaN
25% 12.000000 NaN 45.902500 3.041000 1577.532500 68.880000 6.600000 304.687500 1440.827500 132.364839 6.634000 NaN
50% 23.000000 NaN 60.710000 3.513000 4743.580000 364.570000 36.260000 1176.425000 2727.135000 182.764003 7.806000 NaN
75% 34.000000 NaN 73.880000 3.743000 8923.310000 2153.350000 163.150000 3310.007500 4832.555000 213.932412 8.567000 NaN
max 45.000000 NaN 101.950000 4.468000 103184.980000 104519.540000 149483.310000 67474.850000 771448.100000 228.976456 14.313000 NaN
In [14]:
walmart_features_data.columns
Out[14]:
Index(['Store', 'Date', 'Temperature', 'Fuel_Price', 'MarkDown1', 'MarkDown2',
       'MarkDown3', 'MarkDown4', 'MarkDown5', 'CPI', 'Unemployment',
       'IsHoliday'],
      dtype='object')
In [15]:
walmart_features_data.isnull().sum() # existieren Nullwerte ?
Out[15]:
Store              0
Date               0
Temperature        0
Fuel_Price         0
MarkDown1       4158
MarkDown2       5269
MarkDown3       4577
MarkDown4       4726
MarkDown5       4140
CPI              585
Unemployment     585
IsHoliday          0
dtype: int64
In [16]:
null = pd.DataFrame(walmart_features_data.dtypes).T.rename(index={0:'column Type'}) 
null = null.append(pd.DataFrame(walmart_features_data.isnull().sum()).T.rename(index={0:'null values (nb)'}))
null = null.append(pd.DataFrame(walmart_features_data.isnull().sum()/walmart_features_data.shape[0]*100).T.
                                       rename(index={0: 'null values (%)'}))
null
Out[16]:
Store Date Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment IsHoliday
column Type int64 object float64 float64 float64 float64 float64 float64 float64 float64 float64 bool
null values (nb) 0 0 0 0 4158 5269 4577 4726 4140 585 585 0
null values (%) 0 0 0 0 50.7692 64.3346 55.8852 57.7045 50.5495 7.14286 7.14286 0
In [17]:
from statistics import mean
walmart_features_data['CPI'] = walmart_features_data['CPI'].fillna(mean(walmart_features_data['CPI']))
walmart_features_data['Unemployment'] = walmart_features_data['Unemployment'].fillna(mean(walmart_features_data['Unemployment']))
walmart_features_data['MarkDown1'] = walmart_features_data['MarkDown1'].fillna(0)
walmart_features_data['MarkDown2'] = walmart_features_data['MarkDown2'].fillna(0)
walmart_features_data['MarkDown3'] = walmart_features_data['MarkDown3'].fillna(0)
walmart_features_data['MarkDown4'] = walmart_features_data['MarkDown4'].fillna(0)
walmart_features_data['MarkDown5'] = walmart_features_data['MarkDown5'].fillna(0)

2.3.2. Deskriptive Analyse¶

In [18]:
sns.heatmap(walmart_features_data.corr())  # Betrachtung der Korrelation über eine heatmap
Out[18]:
<AxesSubplot:>
No description has been provided for this image

2.4. Ladendaten einlesen¶

In [19]:
walmart_stores_data.head(5)
Out[19]:
Store Type Size
0 1 A 151315
1 2 A 202307
2 3 B 37392
3 4 A 205863
4 5 B 34875
In [20]:
walmart_stores_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45 entries, 0 to 44
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Store   45 non-null     int64 
 1   Type    45 non-null     object
 2   Size    45 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 1.2+ KB
In [21]:
walmart_stores_data.describe(include='all')
Out[21]:
Store Type Size
count 45.000000 45 45.000000
unique NaN 3 NaN
top NaN A NaN
freq NaN 22 NaN
mean 23.000000 NaN 130287.600000
std 13.133926 NaN 63825.271991
min 1.000000 NaN 34875.000000
25% 12.000000 NaN 70713.000000
50% 23.000000 NaN 126512.000000
75% 34.000000 NaN 202307.000000
max 45.000000 NaN 219622.000000

2.4.1. Deskriptive Analyse¶

In [22]:
sns.pairplot(walmart_stores_data, vars=['Store', 'Size'], hue='Type')   # Betrachtung, ob der Typ an der Size hängt
Out[22]:
<seaborn.axisgrid.PairGrid at 0x215682e9670>
No description has been provided for this image
In [23]:
sns.swarmplot(x='Type',y='Size',data=walmart_stores_data,palette='Set2')
Out[23]:
<AxesSubplot:xlabel='Type', ylabel='Size'>
No description has been provided for this image
In [24]:
walmart_stores_data.loc[walmart_stores_data.Size < 50000, "Type"] = "C"
walmart_stores_data.loc[walmart_stores_data.Size >= 50000, "Type"] = "B"
walmart_stores_data.loc[walmart_stores_data.Size >= 150000, "Type"] = "A"
In [25]:
sns.swarmplot(x='Type',y='Size',data=walmart_stores_data,palette='Set2')
Out[25]:
<AxesSubplot:xlabel='Type', ylabel='Size'>
No description has been provided for this image
In [26]:
sns.countplot(x='Type',data=walmart_stores_data)   # Betrachtung, wieviele Läden wir von welchem Type haben
Out[26]:
<AxesSubplot:xlabel='Type', ylabel='count'>
No description has been provided for this image
In [27]:
sizes=walmart_stores_data.count()['Size'].round(1)
print(sizes)
45
In [28]:
g = sns.FacetGrid(data=walmart_stores_data,col='Type')   # Betrachtung, wie sich inerhalb eines Types dann noch die Größe verteilt
g.map(plt.hist,'Size')
Out[28]:
<seaborn.axisgrid.FacetGrid at 0x2156962b730>
No description has been provided for this image
In [29]:
walmart_stores_data.columns
Out[29]:
Index(['Store', 'Type', 'Size'], dtype='object')
In [30]:
walmart_stores_data.isnull().sum() # existieren Nullwerte ?
Out[30]:
Store    0
Type     0
Size     0
dtype: int64

3. Datenaufbereitung¶

In [31]:
festo_data = walmart_features_data.merge(walmart_stores_data, how='inner', on='Store')
comb_data = walmart_train_data.merge(festo_data, how='inner', on=['Store','Date','IsHoliday']).sort_values(by=['Store',
                                                                            'Dept', 'Date']).reset_index(drop=True)
In [32]:
comb_data.head(5)
Out[32]:
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Type Size
0 1 1 2010-02-05 24924.50 False 42.31 2.572 0.0 0.0 0.0 0.0 0.0 211.096358 8.106 A 151315
1 1 1 2010-02-12 46039.49 True 38.51 2.548 0.0 0.0 0.0 0.0 0.0 211.242170 8.106 A 151315
2 1 1 2010-02-19 41595.55 False 39.93 2.514 0.0 0.0 0.0 0.0 0.0 211.289143 8.106 A 151315
3 1 1 2010-02-26 19403.54 False 46.63 2.561 0.0 0.0 0.0 0.0 0.0 211.319643 8.106 A 151315
4 1 1 2010-03-05 21827.90 False 46.50 2.625 0.0 0.0 0.0 0.0 0.0 211.350143 8.106 A 151315
In [33]:
comb_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421570 entries, 0 to 421569
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Store         421570 non-null  int64  
 1   Dept          421570 non-null  int64  
 2   Date          421570 non-null  object 
 3   Weekly_Sales  421570 non-null  float64
 4   IsHoliday     421570 non-null  bool   
 5   Temperature   421570 non-null  float64
 6   Fuel_Price    421570 non-null  float64
 7   MarkDown1     421570 non-null  float64
 8   MarkDown2     421570 non-null  float64
 9   MarkDown3     421570 non-null  float64
 10  MarkDown4     421570 non-null  float64
 11  MarkDown5     421570 non-null  float64
 12  CPI           421570 non-null  float64
 13  Unemployment  421570 non-null  float64
 14  Type          421570 non-null  object 
 15  Size          421570 non-null  int64  
dtypes: bool(1), float64(10), int64(3), object(2)
memory usage: 48.6+ MB
In [ ]:
sns.pairplot(comb_data, vars=['Weekly_Sales', 'Dept', 'Store', 'Size', 'CPI', 'Unemployment', 'Fuel_Price', 'Temperature'], hue='Type')
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-34-288b13584d44> in <module>
----> 1 sns.pairplot(comb_data, vars=['Weekly_Sales', 'Dept', 'Store', 'Size', 'CPI', 'Unemployment', 'Fuel_Price', 'Temperature'], hue='Type')

~\Anaconda3\lib\site-packages\seaborn\_decorators.py in inner_f(*args, **kwargs)
     44             )
     45         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 46         return f(**kwargs)
     47     return inner_f
     48 

~\Anaconda3\lib\site-packages\seaborn\axisgrid.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, height, aspect, corner, dropna, plot_kws, diag_kws, grid_kws, size)
   1985     # Add a legend
   1986     if hue is not None:
-> 1987         grid.add_legend()
   1988 
   1989     grid.tight_layout()

~\Anaconda3\lib\site-packages\seaborn\axisgrid.py in add_legend(self, legend_data, title, label_order, adjust_subtitles, **kwargs)
    140             # Draw the plot to set the bounding boxes correctly
    141             if hasattr(self.fig.canvas, "get_renderer"):
--> 142                 self.fig.draw(self.fig.canvas.get_renderer())
    143 
    144             # Calculate and set the new width of the figure so the legend fits

~\Anaconda3\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     39                 renderer.start_filter()
     40 
---> 41             return draw(artist, renderer, *args, **kwargs)
     42         finally:
     43             if artist.get_agg_filter() is not None:

~\Anaconda3\lib\site-packages\matplotlib\figure.py in draw(self, renderer)
   1861 
   1862             self.patch.draw(renderer)
-> 1863             mimage._draw_list_compositing_images(
   1864                 renderer, self, artists, self.suppressComposite)
   1865 

~\Anaconda3\lib\site-packages\matplotlib\image.py in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
    129     if not_composite or not has_images:
    130         for a in artists:
--> 131             a.draw(renderer)
    132     else:
    133         # Composite any adjacent images together

~\Anaconda3\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     39                 renderer.start_filter()
     40 
---> 41             return draw(artist, renderer, *args, **kwargs)
     42         finally:
     43             if artist.get_agg_filter() is not None:

~\Anaconda3\lib\site-packages\matplotlib\cbook\deprecation.py in wrapper(*inner_args, **inner_kwargs)
    409                          else deprecation_addendum,
    410                 **kwargs)
--> 411         return func(*inner_args, **inner_kwargs)
    412 
    413     return wrapper

~\Anaconda3\lib\site-packages\matplotlib\axes\_base.py in draw(self, renderer, inframe)
   2745             renderer.stop_rasterizing()
   2746 
-> 2747         mimage._draw_list_compositing_images(renderer, self, artists)
   2748 
   2749         renderer.close_group('axes')

~\Anaconda3\lib\site-packages\matplotlib\image.py in _draw_list_compositing_images(renderer, parent, artists, suppress_composite)
    129     if not_composite or not has_images:
    130         for a in artists:
--> 131             a.draw(renderer)
    132     else:
    133         # Composite any adjacent images together

~\Anaconda3\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     39                 renderer.start_filter()
     40 
---> 41             return draw(artist, renderer, *args, **kwargs)
     42         finally:
     43             if artist.get_agg_filter() is not None:

~\Anaconda3\lib\site-packages\matplotlib\collections.py in draw(self, renderer)
    929     def draw(self, renderer):
    930         self.set_sizes(self._sizes, self.figure.dpi)
--> 931         Collection.draw(self, renderer)
    932 
    933 

~\Anaconda3\lib\site-packages\matplotlib\artist.py in draw_wrapper(artist, renderer, *args, **kwargs)
     39                 renderer.start_filter()
     40 
---> 41             return draw(artist, renderer, *args, **kwargs)
     42         finally:
     43             if artist.get_agg_filter() is not None:

~\Anaconda3\lib\site-packages\matplotlib\collections.py in draw(self, renderer)
    404                 mpath.Path(offsets), transOffset, tuple(facecolors[0]))
    405         else:
--> 406             renderer.draw_path_collection(
    407                 gc, transform.frozen(), paths,
    408                 self.get_transforms(), offsets, transOffset,

~\Anaconda3\lib\site-packages\matplotlib\backends\backend_agg.py in draw_path_collection(self, gc, master_transform, paths, all_transforms, offsets, offsetTrans, facecolors, edgecolors, linewidths, linestyles, antialiaseds, urls, offset_position)
    170                 "3.3", message="Support for offset_position='data' is "
    171                 "deprecated since %(since)s and will be removed %(removal)s.")
--> 172         return self._renderer.draw_path_collection(
    173             gc, master_transform, paths, all_transforms, offsets, offsetTrans,
    174             facecolors, edgecolors, linewidths, linestyles, antialiaseds, urls,

~\Anaconda3\lib\site-packages\matplotlib\path.py in vertices(self)
    200         )
    201 
--> 202     @property
    203     def vertices(self):
    204         """

KeyboardInterrupt: 
In [33]:
holiday = sns.FacetGrid(comb_data, col='IsHoliday')
holiday.map(plt.hist, 'Weekly_Sales', bins=30)
Out[33]:
<seaborn.axisgrid.FacetGrid at 0x21569758a60>
No description has been provided for this image
In [34]:
comb_data = comb_data.applymap(lambda x: 1 if x == True else x)
comb_data = comb_data.applymap(lambda x: 0 if x == False else x)
In [35]:
#comb_data["Date"] = pd.to_datetime(comb_data["Date"])
comb_data.Date = pd.to_datetime(comb_data.Date)
In [36]:
clean_data = comb_data[comb_data['Weekly_Sales']>0]
In [37]:
clean_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 420212 entries, 0 to 421569
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   Store         420212 non-null  int64         
 1   Dept          420212 non-null  int64         
 2   Date          420212 non-null  datetime64[ns]
 3   Weekly_Sales  420212 non-null  float64       
 4   IsHoliday     420212 non-null  int64         
 5   Temperature   420212 non-null  float64       
 6   Fuel_Price    420212 non-null  float64       
 7   MarkDown1     420212 non-null  float64       
 8   MarkDown2     420212 non-null  float64       
 9   MarkDown3     420212 non-null  float64       
 10  MarkDown4     420212 non-null  float64       
 11  MarkDown5     420212 non-null  float64       
 12  CPI           420212 non-null  float64       
 13  Unemployment  420212 non-null  float64       
 14  Type          420212 non-null  object        
 15  Size          420212 non-null  int64         
dtypes: datetime64[ns](1), float64(10), int64(4), object(1)
memory usage: 54.5+ MB
In [38]:
clean_data1 = clean_data[clean_data["Type"] == "A"] 
In [39]:
clean_data1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 202286 entries, 0 to 391059
Data columns (total 16 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   Store         202286 non-null  int64         
 1   Dept          202286 non-null  int64         
 2   Date          202286 non-null  datetime64[ns]
 3   Weekly_Sales  202286 non-null  float64       
 4   IsHoliday     202286 non-null  int64         
 5   Temperature   202286 non-null  float64       
 6   Fuel_Price    202286 non-null  float64       
 7   MarkDown1     202286 non-null  float64       
 8   MarkDown2     202286 non-null  float64       
 9   MarkDown3     202286 non-null  float64       
 10  MarkDown4     202286 non-null  float64       
 11  MarkDown5     202286 non-null  float64       
 12  CPI           202286 non-null  float64       
 13  Unemployment  202286 non-null  float64       
 14  Type          202286 non-null  object        
 15  Size          202286 non-null  int64         
dtypes: datetime64[ns](1), float64(10), int64(4), object(1)
memory usage: 26.2+ MB
In [40]:
clean_data1.describe(include='all')
<ipython-input-40-fbf7f56c5ae5>:1: FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
  clean_data1.describe(include='all')
Out[40]:
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Type Size
count 202286.000000 202286.000000 202286 202286.000000 202286.000000 202286.000000 202286.000000 202286.000000 202286.000000 202286.000000 202286.000000 202286.000000 202286.000000 202286.000000 202286 202286.000000
unique NaN NaN 143 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN
top NaN NaN 2011-11-25 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN A NaN
freq NaN NaN 1451 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 202286 NaN
first NaN NaN 2010-02-05 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
last NaN NaN 2012-10-26 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean 20.934326 44.485570 NaN 20962.814036 0.070549 59.682803 3.340252 3290.207260 1146.760225 584.821624 1409.380958 2247.480669 174.610036 7.766532 NaN 191149.743017
std 12.754509 30.036683 NaN 26898.626654 0.256070 17.991029 0.456896 6827.151864 6000.218287 6194.194569 4537.213118 5338.825440 39.249815 1.744491 NaN 21961.735467
min 1.000000 1.000000 NaN 0.010000 0.000000 5.540000 2.514000 0.000000 -265.760000 -29.100000 0.000000 0.000000 126.064000 3.879000 NaN 151315.000000
25% 8.000000 20.000000 NaN 3756.900000 0.000000 46.420000 2.906000 0.000000 0.000000 0.000000 0.000000 0.000000 132.751667 6.901000 NaN 184109.000000
50% 20.000000 37.500000 NaN 10879.935000 0.000000 62.010000 3.415000 0.000000 0.000000 0.000000 0.000000 0.000000 189.662884 7.806000 NaN 203007.000000
75% 31.000000 72.000000 NaN 27566.712500 0.000000 74.100000 3.717000 4864.300000 32.040000 11.850000 919.320000 3430.560000 214.118080 8.326000 NaN 204184.000000
max 41.000000 99.000000 NaN 474330.100000 1.000000 99.220000 4.468000 78124.500000 97740.990000 109030.750000 67474.850000 108519.280000 227.036936 14.313000 NaN 219622.000000
In [41]:
clean_data1['Week'] = clean_data1.Date.dt.week
clean_data1['Year'] = clean_data1.Date.dt.year
<ipython-input-41-84301be4958e>:1: FutureWarning: Series.dt.weekofyear and Series.dt.week have been deprecated.  Please use Series.dt.isocalendar().week instead.
  clean_data1['Week'] = clean_data1.Date.dt.week
<ipython-input-41-84301be4958e>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_data1['Week'] = clean_data1.Date.dt.week
<ipython-input-41-84301be4958e>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_data1['Year'] = clean_data1.Date.dt.year
In [42]:
weekly_sales_2010 = clean_data1[clean_data1.Year==2010]['Weekly_Sales'].groupby(clean_data1['Week']).mean()
weekly_sales_2011 = clean_data1[clean_data1.Year==2011]['Weekly_Sales'].groupby(clean_data1['Week']).mean()
weekly_sales_2012 = clean_data1[clean_data1.Year==2012]['Weekly_Sales'].groupby(clean_data1['Week']).mean()
plt.figure(figsize=(20,8))
sns.lineplot(weekly_sales_2010.index, weekly_sales_2010.values)
sns.lineplot(weekly_sales_2011.index, weekly_sales_2011.values)
sns.lineplot(weekly_sales_2012.index, weekly_sales_2012.values)
plt.grid()
plt.xticks(np.arange(1, 53, step=1))
plt.legend(['2010', '2011', '2012'], loc='best', fontsize=16)
plt.title('Average Weekly Sales - Per Year', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Week', fontsize=16)
plt.show()
C:\Users\eebal\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\Users\eebal\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\Users\eebal\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
No description has been provided for this image
In [43]:
clean_data1[['Dept', 'Weekly_Sales']].groupby(['Dept'], as_index=False).mean().sort_values(by='Weekly_Sales', ascending=False)
Out[43]:
Dept Weekly_Sales
73 92 119434.208462
76 95 103465.496752
71 90 74586.846629
36 38 69920.036122
60 72 62415.554941
... ... ...
49 51 23.226359
43 45 22.558960
63 78 10.762018
37 39 9.998667
41 43 0.705000

81 rows × 2 columns

In [44]:
weekly_sales = clean_data1['Weekly_Sales'].groupby(clean_data1['Dept']).mean()
plt.figure(figsize=(25,8))
sns.barplot(weekly_sales.index, weekly_sales.values, palette='dark')
plt.grid()
plt.title('Durchschnittlicher Umsatz pro Abteilung', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Dept', fontsize=16)
plt.show()
C:\Users\eebal\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
No description has been provided for this image
In [45]:
sns.distplot(clean_data1['Weekly_Sales'])  
C:\Users\eebal\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[45]:
<AxesSubplot:xlabel='Weekly_Sales', ylabel='Density'>
No description has been provided for this image
In [46]:
clean_data2 = clean_data1[clean_data1["Dept"] == 92] 
In [47]:
plt.figure(figsize=(20,1))
sns.lineplot(x="Week", y="IsHoliday", hue="Year", data=clean_data2)
plt.grid()
plt.xticks(np.arange(1, 53, step=1))
plt.title('holidays - Per Year', fontsize=18)
plt.ylabel('holidays', fontsize=16)
plt.xlabel('Week', fontsize=16)
plt.show()
No description has been provided for this image
In [48]:
plt.figure(figsize=(20,8))
sns.lineplot(x="Week", y="MarkDown1", hue="Year", data=clean_data2, marker="s") 
#sns.lineplot(x="Week", y="MarkDown2", hue="Year", data=clean_data2, marker=">") #, palette="tab10")
#sns.lineplot(x="Week", y="MarkDown3", hue="Year", data=clean_data2, marker="+", palette="tab10")
#sns.lineplot(x="Week", y="MarkDown4", hue="Year", data=clean_data2, marker="o")
#sns.lineplot(x="Week", y="MarkDown5", hue="Year", data=clean_data2, marker="x", palette="Spectral")
plt.grid()
plt.xticks(np.arange(1, 53, step=1))
plt.title('MarkDowns - Pro Jahr', fontsize=18)
plt.ylabel('MarkDowns', fontsize=16)
plt.xlabel('Week', fontsize=16)
plt.show()
No description has been provided for this image
In [49]:
plt.figure(figsize=(20,8))
sns.lineplot(x="Date", y="MarkDown1", data=clean_data2)
sns.lineplot(x="Date", y="MarkDown2", data=clean_data2)
sns.lineplot(x="Date", y="MarkDown3", data=clean_data2)
sns.lineplot(x="Date", y="MarkDown4", data=clean_data2)
sns.lineplot(x="Date", y="MarkDown5", data=clean_data2)

sns.lineplot(x="Date", y="Weekly_Sales", data=clean_data2)
plt.grid()
plt.legend(['MarkDown1', 'MarkDown2', 'MarkDown3' 'MarkDown4', 'MarkDown5', 'Weekly_Sales'], loc='best', fontsize=16)
plt.title('Weekly Sales - MarkDowns', fontsize=18)
plt.ylabel('Sales', fontsize=16)
plt.xlabel('Date', fontsize=16)
plt.show()
No description has been provided for this image
In [50]:
sns.lmplot(x='CPI',y='Weekly_Sales',data=clean_data2,hue='Type')
Out[50]:
<seaborn.axisgrid.FacetGrid at 0x2156dc92c70>
No description has been provided for this image
In [51]:
sns.lmplot(x='Temperature',y='Weekly_Sales',data=clean_data2,hue='Type')
Out[51]:
<seaborn.axisgrid.FacetGrid at 0x2156dcaca60>
No description has been provided for this image
In [52]:
sns.lmplot(x='Fuel_Price',y='Weekly_Sales',data=clean_data2,hue='Type')
Out[52]:
<seaborn.axisgrid.FacetGrid at 0x2156cbe8f40>
No description has been provided for this image
In [53]:
sns.lmplot(x='Unemployment',y='Weekly_Sales',data=clean_data2,hue='Type')
Out[53]:
<seaborn.axisgrid.FacetGrid at 0x2156a22ee50>
No description has been provided for this image
In [54]:
sns.lmplot(x='Week',y='Weekly_Sales',data=clean_data2,hue='Type')
Out[54]:
<seaborn.axisgrid.FacetGrid at 0x2156ca66a00>
No description has been provided for this image
In [55]:
sns.distplot(clean_data2['Weekly_Sales'])
C:\Users\eebal\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[55]:
<AxesSubplot:xlabel='Weekly_Sales', ylabel='Density'>
No description has been provided for this image

3.1. Test auf Korrelation¶

In [56]:
sns.heatmap(clean_data2.corr())
Out[56]:
<AxesSubplot:>
No description has been provided for this image
In [57]:
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(clean_data2.corr()[['Weekly_Sales']].sort_values(by='Weekly_Sales', ascending=False), vmin=-1, vmax=1, annot=True, cmap='PiYG')
heatmap.set_title('which features influence Weekly Sales the most ?', fontdict={'fontsize':18}, pad=16);
No description has been provided for this image
In [58]:
clean_data2.isnull().sum()
Out[58]:
Store           0
Dept            0
Date            0
Weekly_Sales    0
IsHoliday       0
Temperature     0
Fuel_Price      0
MarkDown1       0
MarkDown2       0
MarkDown3       0
MarkDown4       0
MarkDown5       0
CPI             0
Unemployment    0
Type            0
Size            0
Week            0
Year            0
dtype: int64
In [59]:
clean_data3 = clean_data2.drop(['CPI', 'Fuel_Price', 'Dept'], axis = 1) 
In [60]:
clean_data3.isnull().sum() 
Out[60]:
Store           0
Date            0
Weekly_Sales    0
IsHoliday       0
Temperature     0
MarkDown1       0
MarkDown2       0
MarkDown3       0
MarkDown4       0
MarkDown5       0
Unemployment    0
Type            0
Size            0
Week            0
Year            0
dtype: int64
In [61]:
sns.heatmap(clean_data3.corr(),cmap="PiYG")
Out[61]:
<AxesSubplot:>
No description has been provided for this image
In [62]:
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(clean_data3.corr()[['Weekly_Sales']].sort_values(by='Weekly_Sales', ascending=False), vmin=-1, vmax=1, annot=True, cmap='PiYG')
heatmap.set_title('which features influence Weekly Sales the most ?', fontdict={'fontsize':18}, pad=16);
No description has been provided for this image

3.2. Index zurücksetzen¶

  • Beim Löschen von Beobachtungen bleiben die ursprünglichen Indizes erhalten
  • Das Löschen von Beobachtungen mit den Indizes 2 und 3 führt zu einem Index wie: 0,1,4,5,6
  • Sobald wir den Index zurücksetzen, wird eine neue Spalte erstellt, die den alten Index enthält
  • Verwenden Sie 'drop=True', um den alten Index vollständig zu löschen
In [63]:
clean_data4 = clean_data3.reset_index(drop=True)  
clean_data4.describe(include='all')
<ipython-input-63-a745ee53004c>:2: FutureWarning: Treating datetime data as categorical rather than numeric in `.describe` is deprecated and will be removed in a future version of pandas. Specify `datetime_is_numeric=True` to silence this warning and adopt the future behavior now.
  clean_data4.describe(include='all')
Out[63]:
Store Date Weekly_Sales IsHoliday Temperature MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 Unemployment Type Size Week Year
count 2860.000000 2860 2860.000000 2860.000000 2860.000000 2860.000000 2860.000000 2860.000000 2860.000000 2860.000000 2860.000000 2860 2860.000000 2860.000000 2860.000000
unique NaN 143 NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN
top NaN 2011-07-15 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN A NaN NaN NaN
freq NaN 20 NaN NaN NaN NaN NaN NaN NaN NaN NaN 2860 NaN NaN NaN
first NaN 2010-02-05 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
last NaN 2012-10-26 00:00:00 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
mean 21.000000 NaN 119434.208462 0.069930 59.737063 3274.485661 1136.812409 570.622937 1396.637923 2246.015262 7.765653 NaN 190992.500000 25.818182 2010.965035
std 12.761542 NaN 37562.653935 0.255074 18.000214 6803.391970 5966.226600 6110.689225 4506.434693 5378.014318 1.745066 NaN 21992.587981 14.130574 0.797097
min 1.000000 NaN 45263.340000 0.000000 5.540000 0.000000 -265.760000 -29.100000 0.000000 0.000000 3.879000 NaN 151315.000000 1.000000 2010.000000
25% 10.250000 NaN 91257.565000 0.000000 46.507500 0.000000 0.000000 0.000000 0.000000 0.000000 6.901000 NaN 177610.250000 14.000000 2010.000000
50% 22.000000 NaN 115690.535000 0.000000 62.045000 0.000000 0.000000 0.000000 0.000000 0.000000 7.806000 NaN 202756.000000 26.000000 2011.000000
75% 31.250000 NaN 147571.427500 0.000000 74.142500 4833.170000 31.337500 11.165000 916.982500 3422.680000 8.326000 NaN 203910.250000 38.000000 2012.000000
max 41.000000 NaN 293966.050000 1.000000 99.220000 78124.500000 97740.990000 109030.750000 67474.850000 108519.280000 14.313000 NaN 219622.000000 52.000000 2012.000000

3.3. Test auf Multikollinearität¶

Keine nicht-signifikanten Variablen mehr. Das endgültige Modell wird erstellt.

In [64]:
plt.scatter(clean_data4['Size'], clean_data3['Weekly_Sales'])  # eine lineare Funktion sollte entstehen - nicht exponentielle Funktion
Out[64]:
<matplotlib.collections.PathCollection at 0x2156b7200a0>
No description has been provided for this image

Verletzung der OLS-Annahme --> keine Linearität Lösung dann Log-Transformation durchführen

In [65]:
log_sales = np.log(clean_data4['Weekly_Sales'])
clean_data4['log_sales'] = log_sales
clean_data4
Out[65]:
Store Date Weekly_Sales IsHoliday Temperature MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 Unemployment Type Size Week Year log_sales
0 1 2010-02-05 139884.94 0 42.31 0.00 0.00 0.00 0.00 0.00 8.106 A 151315 5 2010 11.848576
1 1 2010-02-12 143081.42 1 38.51 0.00 0.00 0.00 0.00 0.00 8.106 A 151315 6 2010 11.871169
2 1 2010-02-19 135066.75 0 39.93 0.00 0.00 0.00 0.00 0.00 8.106 A 151315 7 2010 11.813524
3 1 2010-02-26 125048.08 0 46.63 0.00 0.00 0.00 0.00 0.00 8.106 A 151315 8 2010 11.736454
4 1 2010-03-05 132945.44 0 46.50 0.00 0.00 0.00 0.00 0.00 8.106 A 151315 9 2010 11.797694
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2855 41 2012-09-28 128422.98 0 56.08 4578.46 51.73 3.00 1902.88 14423.97 6.432 A 196321 39 2012 11.763085
2856 41 2012-10-05 142134.06 0 50.14 6094.23 0.00 33.94 2887.65 3853.33 6.195 A 196321 40 2012 11.864526
2857 41 2012-10-12 145984.90 0 39.38 1570.23 0.00 26.31 834.80 14421.12 6.195 A 196321 41 2012 11.891258
2858 41 2012-10-19 132903.53 0 49.56 3138.86 0.00 6.17 129.04 8206.10 6.195 A 196321 42 2012 11.797379
2859 41 2012-10-26 131128.24 0 41.80 4864.30 101.34 250.60 47.24 1524.43 6.195 A 196321 43 2012 11.783931

2860 rows × 16 columns

In [66]:
plt.scatter(clean_data4['Size'], clean_data4['log_sales'])
Out[66]:
<matplotlib.collections.PathCollection at 0x2156c482c40>
No description has been provided for this image
In [67]:
clean_data4 = clean_data4.drop(['Weekly_Sales'], axis=1)
In [68]:
plt.scatter(clean_data4['Week'], clean_data4['log_sales'])
Out[68]:
<matplotlib.collections.PathCollection at 0x215696f10a0>
No description has been provided for this image
In [69]:
plt.scatter(clean_data4['IsHoliday'], clean_data4['log_sales'])
Out[69]:
<matplotlib.collections.PathCollection at 0x215682e9dc0>
No description has been provided for this image

Annahme keine Multikollinearität

In [70]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = clean_data4[['Week', 'Year', 'Size']]   

vif = pd.DataFrame() 

vif["VIF"] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]

vif["Features"] = variables.columns

vif
Out[70]:
VIF Features
0 4.338373 Week
1 79.762781 Year
2 76.444362 Size
In [71]:
data_no_multi = clean_data4.drop(['Year'], axis=1)

3.4. Dummy-Variablen erstellen¶

In [72]:
data_with_dummies = pd.get_dummies(data_no_multi, drop_first=True)   # drop_first=true: die erste Ausprägung aller categorialen Werte wird weggelassen
In [73]:
data_with_dummies.head()
Out[73]:
Store Date IsHoliday Temperature MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 Unemployment Size Week log_sales
0 1 2010-02-05 0 42.31 0.0 0.0 0.0 0.0 0.0 8.106 151315 5 11.848576
1 1 2010-02-12 1 38.51 0.0 0.0 0.0 0.0 0.0 8.106 151315 6 11.871169
2 1 2010-02-19 0 39.93 0.0 0.0 0.0 0.0 0.0 8.106 151315 7 11.813524
3 1 2010-02-26 0 46.63 0.0 0.0 0.0 0.0 0.0 8.106 151315 8 11.736454
4 1 2010-03-05 0 46.50 0.0 0.0 0.0 0.0 0.0 8.106 151315 9 11.797694
In [74]:
data_with_dummies.columns.values
Out[74]:
array(['Store', 'Date', 'IsHoliday', 'Temperature', 'MarkDown1',
       'MarkDown2', 'MarkDown3', 'MarkDown4', 'MarkDown5', 'Unemployment',
       'Size', 'Week', 'log_sales'], dtype=object)
In [76]:
cols=['log_sales', 'Store', 'Date', 'IsHoliday', 'Size', 'Week', 'Temperature', 'Unemployment', 'MarkDown1', 'MarkDown2','MarkDown3', 'MarkDown4', 'MarkDown5']
#cols=['log_sales', 'Store', 'Date', 'IsHoliday', 'Size', 'Week', 'Temperature', 'Unemployment'] # because the R² improves if we keep these values in
In [77]:
data_preprocessed = data_with_dummies[cols]
data_preprocessed.head()
Out[77]:
log_sales Store Date IsHoliday Size Week Temperature Unemployment MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5
0 11.848576 1 2010-02-05 0 151315 5 42.31 8.106 0.0 0.0 0.0 0.0 0.0
1 11.871169 1 2010-02-12 1 151315 6 38.51 8.106 0.0 0.0 0.0 0.0 0.0
2 11.813524 1 2010-02-19 0 151315 7 39.93 8.106 0.0 0.0 0.0 0.0 0.0
3 11.736454 1 2010-02-26 0 151315 8 46.63 8.106 0.0 0.0 0.0 0.0 0.0
4 11.797694 1 2010-03-05 0 151315 9 46.50 8.106 0.0 0.0 0.0 0.0 0.0

Lokale Datei¶

df_dummies.to_csv('train_dummies.csv', index = False) # vollständiger Pfad, wenn die Datei nicht im selben Ordner wie das Notebook liegen soll

Cloud-Datei¶

project.save_data("train_dummies.csv", df_dummies.to_csv(index=False))¶

4. Modellierung und Evaluation¶

4.1. Mehrfaches lineares Regressionsmodell¶

In [78]:
targets = data_preprocessed['log_sales']   # zielvariable
inputs = data_preprocessed.drop(['log_sales', 'Date'], axis=1)

4.2. Merkmalsskalierung¶

wandelt unsere Daten in verteilte Standarddaten um

In [79]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(inputs)
input_scaled = scaler.transform(inputs)

4.3. Testen & Trainieren der Daten¶

In [80]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(input_scaled, targets, test_size=0.2, random_state=365)
In [81]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x_train,y_train)
Out[81]:
LinearRegression()
In [82]:
y_hat = reg.predict(x_train)
In [83]:
plt.scatter(y_train, y_hat)
plt.xlabel('Targets')
plt.ylabel('Predictions')
plt.show()
No description has been provided for this image
In [84]:
sns.distplot(y_train - y_hat)
plt.title("Residuals")
C:\Users\eebal\Anaconda3\lib\site-packages\seaborn\distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[84]:
Text(0.5, 1.0, 'Residuals')
No description has been provided for this image
In [85]:
reg.score(x_train, y_train)
Out[85]:
0.35803414778302856

Das Bestimmtheitsmaß R² beträgt nur etwa 35,8 %. Vermutlich muss ein anderes Modell gefunden werden, da diese Daten für diesen Zweck nicht brauchbar sind. Wahrscheinlich liegt es auch daran, dass wir nur die Geschäfte des Typs A und nur die Abteilung 92 berücksichtigt haben und somit eine gute Vorhersage gar nicht für alle möglich ist.

In [87]:
reg.intercept_   
Out[87]:
11.638698055432007
In [88]:
reg.coef_  
Out[88]:
array([-0.05460284, -0.00297525,  0.1697551 ,  0.03078332, -0.06801271,
       -0.04136023,  0.02414852, -0.01160719,  0.01454898,  0.00087434,
        0.01764553])
In [89]:
reg_summary = pd.DataFrame(inputs.columns, columns=['Features'])
reg_summary['Weights'] = reg.coef_
reg_summary
Out[89]:
Features Weights
0 Store -0.054603
1 IsHoliday -0.002975
2 Size 0.169755
3 Week 0.030783
4 Temperature -0.068013
5 Unemployment -0.041360
6 MarkDown1 0.024149
7 MarkDown2 -0.011607
8 MarkDown3 0.014549
9 MarkDown4 0.000874
10 MarkDown5 0.017646

Die Größe hat den größten Einfluss auf den Umsatz, gefolgt von der Woche, wie erwartet. Wenn wir die Größe um 1 ändern, ändert sich der Umsatz um 0,169.

In [90]:
y_hat_test = reg.predict(x_test)
In [91]:
plt.scatter(y_test, y_hat_test)
plt.xlabel('Targets Test')
plt.ylabel('Predictions Test')
plt.show()
No description has been provided for this image
In [92]:
df_pf = pd.DataFrame(np.exp(y_hat_test), columns=['Predictions'])
df_pf.head() 
Out[92]:
Predictions
0 116819.417145
1 130998.419879
2 92313.264681
3 90837.833968
4 117826.572101
In [93]:
df_pf['Target'] = np.exp(y_test) 
df_pf.head()
Out[93]:
Predictions Target
0 116819.417145 NaN
1 130998.419879 143081.42
2 92313.264681 NaN
3 90837.833968 NaN
4 117826.572101 NaN
In [94]:
y_test = y_test.reset_index(drop=True)
y_test.head()
Out[94]:
0    11.703337
1    12.038666
2    11.550857
3    11.463528
4    11.968262
Name: log_sales, dtype: float64
In [95]:
df_pf['Target'] = np.exp(y_test) 
df_pf.head()
Out[95]:
Predictions Target
0 116819.417145 120974.69
1 130998.419879 169171.18
2 92313.264681 103866.03
3 90837.833968 95180.23
4 117826.572101 157670.36
In [96]:
df_pf['Residuals'] = df_pf['Target'] - df_pf['Predictions']
df_pf
Out[96]:
Predictions Target Residuals
0 116819.417145 120974.69 4155.272855
1 130998.419879 169171.18 38172.760121
2 92313.264681 103866.03 11552.765319
3 90837.833968 95180.23 4342.396032
4 117826.572101 157670.36 39843.787899
... ... ... ...
567 109907.197654 116989.93 7082.732346
568 111919.500560 80209.05 -31710.450560
569 85980.050529 126907.41 40927.359471
570 140304.667673 155777.29 15472.622327
571 121242.374520 88335.66 -32906.714520

572 rows × 3 columns

In [97]:
df_pf['Difference%'] = np.absolute(df_pf['Residuals'] / df_pf['Target']*100)
df_pf   
Out[97]:
Predictions Target Residuals Difference%
0 116819.417145 120974.69 4155.272855 3.434828
1 130998.419879 169171.18 38172.760121 22.564576
2 92313.264681 103866.03 11552.765319 11.122756
3 90837.833968 95180.23 4342.396032 4.562288
4 117826.572101 157670.36 39843.787899 25.270309
... ... ... ... ...
567 109907.197654 116989.93 7082.732346 6.054138
568 111919.500560 80209.05 -31710.450560 39.534754
569 85980.050529 126907.41 40927.359471 32.249779
570 140304.667673 155777.29 15472.622327 9.932528
571 121242.374520 88335.66 -32906.714520 37.251903

572 rows × 4 columns

In [98]:
df_pf.describe()
Out[98]:
Predictions Target Residuals Difference%
count 572.000000 572.000000 572.000000 572.000000
mean 115670.583849 120025.710717 4355.126868 21.517588
std 22941.875293 39330.902460 31032.826964 17.482142
min 71040.429117 45263.340000 -63813.487107 0.127541
25% 94828.761143 90459.230000 -17799.809897 8.428694
50% 118220.278893 116320.655000 1400.734279 18.218184
75% 129494.110343 148368.807500 24456.722143 28.399692
max 208616.696105 293966.050000 163920.222512 98.055631