1. Business Understanding¶

Uber Technologies Inc. ist ein 2009 gegründetes Unternehmen mit Sitz in San Francisco. Es hat 91 Millionen aktive Nutzer und 3,9 Millionen Fahrer weltweit. Das Unternehmen ist mit seiner App in insgesamt 63 Ländern vertreten. Es muss ein grundlegendes Verständnis für die Ziele aus der Unternehmensperspektive vorhanden sein. Darauf aufbauend können dann die entsprechenden Anforderungen an ein Data-Mining-Projekt definiert werden, damit diese realisiert werden können. Uber versucht, Prognosen auf Basis von Angebot und Nachfrage so zu optimieren, dass eine Verfügbarkeit von Fahrzeugen stets gewährleistet ist, um den Service für seine Nutzer aufrechtzuerhalten. Dieser Datensatz soll die Realisierbarkeit dieses Ziels demonstrieren.

2. Daten und Datenverständnis¶

Der Datensatz besteht aus vier Basisvariablen: Abfertigungsbasisnummer, Datum, active_vehicles und Fahrten. Die Variable "dispatching base number" ist ein vom TLC zugewiesener Code, der eine Uber-Basis in New York City angibt. Dementsprechend können die Codes den folgenden Basen zugewiesen werden: B02512: Unter, B02598: Hinter, B02617: Neben, B02682: Taste, B02764: Nach-NY, B02765: Grun, B02835: Dreist, B02836: Inside. Durch einen ersten Überblick über die verfügbaren Daten kann man bereits über mögliche Abhängigkeiten zwischen der Anzahl der aktiven Fahrzeuge, der Anzahl der Fahrten sowie dem Datum spekulieren.

2.1 Import von relevanten Modulen¶

In [ ]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import math
import matplotlib.dates as mdates
from matplotlib.dates import DateFormatter
sns.set()
from sklearn.linear_model import LinearRegression

2.2 Daten einlesen¶

In [2]:
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
In [3]:
raw_data =pd.read_csv('https://storage.googleapis.com/ml-service-repository-datastorage/Forecast_of_required_vehicles_in_the_city_center_data.csv')
In [4]:
#Output record head
raw_data.head()
Out[4]:
dispatching_base_number date active_vehicles trips
0 B02512 1/1/2015 190 1132
1 B02765 1/1/2015 225 1765
2 B02764 1/1/2015 3427 29421
3 B02682 1/1/2015 945 7679
4 B02617 1/1/2015 1228 9537
In [5]:
#data description
raw_data.describe(include='all')
Out[5]:
dispatching_base_number date active_vehicles trips
count 354 354 354.000000 354.000000
unique 6 59 NaN NaN
top B02598 1/27/2015 NaN NaN
freq 59 6 NaN NaN
mean NaN NaN 1307.435028 11667.316384
std NaN NaN 1162.510626 10648.284865
min NaN NaN 112.000000 629.000000
25% NaN NaN 296.750000 2408.500000
50% NaN NaN 1077.000000 9601.000000
75% NaN NaN 1417.000000 13711.250000
max NaN NaN 4395.000000 45858.000000

2.3 Datenbereinigung¶

In [6]:
#check null values
raw_data.isnull().sum()
Out[6]:
dispatching_base_number    0
date                       0
active_vehicles            0
trips                      0
dtype: int64
In [7]:
raw_data.duplicated().sum()
Out[7]:
0
In [8]:
# View data distributions for active vehicles
sns.distplot(raw_data['active_vehicles'])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[8]:
<AxesSubplot:xlabel='active_vehicles', ylabel='Density'>
No description has been provided for this image
In [9]:
# View data distributions for trips
sns.distplot(raw_data['trips'])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[9]:
<AxesSubplot:xlabel='trips', ylabel='Density'>
No description has been provided for this image
In [10]:
#Here a 95% quantile is generated for active vehicles & trips
quant1= raw_data['active_vehicles'].quantile(0.95)
quant1
Out[10]:
3904.7499999999986
In [11]:
quant1= raw_data['trips'].quantile(0.95)
quant1
Out[11]:
36025.34999999999
In [12]:
#quantile containment for active_vehicles
data1= raw_data[raw_data['active_vehicles']<quant1]
In [13]:
#Quantile containment for trips

data1=raw_data[raw_data['trips']<quant1]
In [14]:
data1.describe(include='all')
Out[14]:
dispatching_base_number date active_vehicles trips
count 336 336 336.000000 336.000000
unique 6 59 NaN NaN
top B02598 1/27/2015 NaN NaN
freq 59 6 NaN NaN
mean NaN NaN 1163.815476 10141.514881
std NaN NaN 1006.895472 8548.456980
min NaN NaN 112.000000 629.000000
25% NaN NaN 288.250000 2265.750000
50% NaN NaN 1056.500000 9467.500000
75% NaN NaN 1368.250000 12707.000000
max NaN NaN 4093.000000 35990.000000
In [15]:
#Displot from the quantile is generated & the irregular distribution is still noticed
sns.distplot(data1['active_vehicles'])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[15]:
<AxesSubplot:xlabel='active_vehicles', ylabel='Density'>
No description has been provided for this image
In [16]:
#Next step is to narrow down the values to 4000(active vehicles).
#On the left side you can see the main distribution of the dataset
data2= data1[data1['active_vehicles']<=4000]
data2= data1[data1['active_vehicles']>=600]
plt.xlim(0,4000)
sns.distplot(data2['active_vehicles'])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[16]:
<AxesSubplot:xlabel='active_vehicles', ylabel='Density'>
No description has been provided for this image
In [17]:
#here a quantile is created for number of trips
#data2= data1[data1['trips']<quant1]
In [18]:
sns.distplot(data2['trips'])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[18]:
<AxesSubplot:xlabel='trips', ylabel='Density'>
No description has been provided for this image
In [19]:
#Define value ranges
data2= data1[data1['trips']<=20000]
data2= data1[data1['trips']>=2000]
plt.xlim(1000,17000)
sns.distplot(data2['trips'])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[19]:
<AxesSubplot:xlabel='trips', ylabel='Density'>
No description has been provided for this image

3. Datenaufbereitung¶

In [20]:
data_cleaned=pd.to_datetime(data2['date'])
data_cleaned=str(data2['dispatching_base_number'].astype(str))
#data_cleaned= data2["dispatching_base_number"].astype(str)
data_cleaned = data2.reset_index(drop=True)
data_cleaned.describe(include='all')
Out[20]:
dispatching_base_number date active_vehicles trips
count 266 266 266.000000 266.000000
unique 6 59 NaN NaN
top B02598 2/5/2015 NaN NaN
freq 59 6 NaN NaN
mean NaN NaN 1411.646617 12414.560150
std NaN NaN 992.671651 8213.348133
min NaN NaN 236.000000 2016.000000
25% NaN NaN 925.500000 7620.500000
50% NaN NaN 1181.000000 10727.000000
75% NaN NaN 1439.500000 13842.500000
max NaN NaN 4093.000000 35990.000000
In [21]:
#After assuming that the data has been cleaned as far as possible, we continue with OLS assumptions.
#Before a scatter plot showing the split and density of the values
#Main feature: The data in the dataset are concentrated in the x=2000 & y= 20000 range
#However, values in the range of 3000 to over 4000 must also be included in the model
plt.scatter(data_cleaned['active_vehicles'], data_cleaned['trips'])
Out[21]:
<matplotlib.collections.PathCollection at 0x11ed7d310>
No description has been provided for this image
In [22]:
#Related to statement(s) from the line above.
#Main distribution here is in the 2000 range
plt.scatter(data_cleaned['active_vehicles'], data_cleaned['trips'])
plt.xlim(0,2000)
plt.ylim(0,20000)
Out[22]:
(0.0, 20000.0)
No description has been provided for this image
In [23]:
#A few more checks on the other variables
plt.scatter(data_cleaned['trips'], data_cleaned['dispatching_base_number'])
Out[23]:
<matplotlib.collections.PathCollection at 0x11ef43e80>
No description has been provided for this image
In [24]:
plt.scatter(data_cleaned['dispatching_base_number'], data_cleaned['active_vehicles'])
Out[24]:
<matplotlib.collections.PathCollection at 0x11f02a340>
No description has been provided for this image
In [25]:
#Transformation für kontinuierliche Variablen
#Log: Die Logarithmentransformation hilft bei schiefen Daten, die Schiefe zu reduzieren. Siehe Oben

log_trips = np.log(data_cleaned['trips'])
data_cleaned['log_trips'] = log_trips


#Linearity is not present-> Violation of OLS Assumption
In [26]:
plt.scatter(data_cleaned['dispatching_base_number'], data_cleaned['log_trips'])
Out[26]:
<matplotlib.collections.PathCollection at 0x11f099d90>
No description has been provided for this image
In [27]:
#Output of the new columns -> You can see that "Trips" occurs twice since log transformation
data_cleaned.columns.values
Out[27]:
array(['dispatching_base_number', 'date', 'active_vehicles', 'trips',
       'log_trips'], dtype=object)
In [28]:
#data_cleaned = data_cleaned.drop(['trips'], axis=1)
In [29]:
#Variance Inflation Factor (VIF).
#If the VIF is between 5 and 10, multicolinearity is probably present and you should consider dropping the variable.

from statsmodels.stats.outliers_influence import variance_inflation_factor

variables = data_cleaned[['active_vehicles', 'log_trips']]
vif = pd.DataFrame()

vif["VIF"]= [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]

vif["Features"]= variables.columns
In [30]:
#VIF is output and the values are <5 
vif
Out[30]:
VIF Features
0 3.684423 active_vehicles
1 3.684423 log_trips
In [31]:
data_cleaned.describe(include='all')
Out[31]:
dispatching_base_number date active_vehicles trips log_trips
count 266 266 266.000000 266.000000 266.000000
unique 6 59 NaN NaN NaN
top B02598 2/5/2015 NaN NaN NaN
freq 59 6 NaN NaN NaN
mean NaN NaN 1411.646617 12414.560150 9.208195
std NaN NaN 992.671651 8213.348133 0.697698
min NaN NaN 236.000000 2016.000000 7.608871
25% NaN NaN 925.500000 7620.500000 8.938593
50% NaN NaN 1181.000000 10727.000000 9.280516
75% NaN NaN 1439.500000 13842.500000 9.535498
max NaN NaN 4093.000000 35990.000000 10.490996
In [32]:
#Linearity could be improved by LOG transformation
plt.scatter(data_cleaned['active_vehicles'], data_cleaned['log_trips'])
Out[32]:
<matplotlib.collections.PathCollection at 0x11f1e4d90>
No description has been provided for this image

3.1 Umkodierung von kategorialen Variablen¶

In [ ]:
 
In [ ]:
 
In [33]:
#Convert vpn categorical data into dummy/ indicator variables. Focus on base number for more general model
data_with_dummies = pd.get_dummies(data_cleaned, drop_first = True)
In [34]:
# Dummy variables head output
data_with_dummies.head()
Out[34]:
active_vehicles trips log_trips dispatching_base_number_B02598 dispatching_base_number_B02617 dispatching_base_number_B02682 dispatching_base_number_B02764 dispatching_base_number_B02765 date_1/10/2015 date_1/11/2015 ... date_2/26/2015 date_2/27/2015 date_2/28/2015 date_2/3/2015 date_2/4/2015 date_2/5/2015 date_2/6/2015 date_2/7/2015 date_2/8/2015 date_2/9/2015
0 3427 29421 10.289464 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 945 7679 8.946245 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1228 9537 9.162934 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 870 6903 8.839711 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 785 4768 8.469682 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 66 columns

In [35]:
data_with_dummies.columns.values
Out[35]:
array(['active_vehicles', 'trips', 'log_trips',
       'dispatching_base_number_B02598', 'dispatching_base_number_B02617',
       'dispatching_base_number_B02682', 'dispatching_base_number_B02764',
       'dispatching_base_number_B02765', 'date_1/10/2015',
       'date_1/11/2015', 'date_1/12/2015', 'date_1/13/2015',
       'date_1/14/2015', 'date_1/15/2015', 'date_1/16/2015',
       'date_1/17/2015', 'date_1/18/2015', 'date_1/19/2015',
       'date_1/2/2015', 'date_1/20/2015', 'date_1/21/2015',
       'date_1/22/2015', 'date_1/23/2015', 'date_1/24/2015',
       'date_1/25/2015', 'date_1/26/2015', 'date_1/27/2015',
       'date_1/28/2015', 'date_1/29/2015', 'date_1/3/2015',
       'date_1/30/2015', 'date_1/31/2015', 'date_1/4/2015',
       'date_1/5/2015', 'date_1/6/2015', 'date_1/7/2015', 'date_1/8/2015',
       'date_1/9/2015', 'date_2/1/2015', 'date_2/10/2015',
       'date_2/11/2015', 'date_2/12/2015', 'date_2/13/2015',
       'date_2/14/2015', 'date_2/15/2015', 'date_2/16/2015',
       'date_2/17/2015', 'date_2/18/2015', 'date_2/19/2015',
       'date_2/2/2015', 'date_2/20/2015', 'date_2/21/2015',
       'date_2/22/2015', 'date_2/23/2015', 'date_2/24/2015',
       'date_2/25/2015', 'date_2/26/2015', 'date_2/27/2015',
       'date_2/28/2015', 'date_2/3/2015', 'date_2/4/2015',
       'date_2/5/2015', 'date_2/6/2015', 'date_2/7/2015', 'date_2/8/2015',
       'date_2/9/2015'], dtype=object)
In [36]:
#Columns for the later model are provided
cols=['active_vehicles','log_trips','dispatching_base_number_B02598',
       'dispatching_base_number_B02617', 'dispatching_base_number_B02682',
       'dispatching_base_number_B02764', 'dispatching_base_number_B02765']
In [37]:
#-> At this point it was decided that the variable 'Date' is no longer useful and too costly for model building
#-> It is therefore not considered for the model
In [38]:
#Arrangement-> First column becomes target variable
data_preprocessed = data_with_dummies[cols]
data_preprocessed.head()
Out[38]:
active_vehicles log_trips dispatching_base_number_B02598 dispatching_base_number_B02617 dispatching_base_number_B02682 dispatching_base_number_B02764 dispatching_base_number_B02765
0 3427 10.289464 0 0 0 1 0
1 945 8.946245 0 0 1 0 0
2 1228 9.162934 0 1 0 0 0
3 870 8.839711 1 0 0 0 0
4 785 8.469682 1 0 0 0 0

3.2 Erstellen von Test- & Trainingsdaten¶

In [40]:
targets = data_preprocessed['active_vehicles']
inputs = data_preprocessed.drop(['active_vehicles'], axis=1)
In [41]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(inputs)
input_scaled = scaler.transform(inputs)
In [42]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(input_scaled, targets, test_size=0.20, random_state=365)

4. Modellierung und Evaluation¶

4.1 Lineare Regression¶

In [44]:
reg = LinearRegression()
reg.fit(x_train, y_train)
Out[44]:
LinearRegression()
In [45]:
y_hat = reg.predict(x_train)
In [46]:
#The division of values is also noticeable here
plt.scatter(y_train, y_hat)
plt.xlabel('Targets')
plt.ylabel('Predictions')
plt.show()
No description has been provided for this image
In [47]:
#Error thermal: Trainins values minus predicted data
sns.distplot(y_train - y_hat)
plt.title("Residuals")
#plt.show()
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)
Out[47]:
Text(0.5, 1.0, 'Residuals')
No description has been provided for this image
In [48]:
#R-squared value is important: variance of the target variable
reg.score(x_train, y_train)
Out[48]:
0.9747815666217599
In [49]:
# Important for the regression line
reg.intercept_
Out[49]:
1413.3913426862414
In [50]:
reg.score(x_test, y_test)
Out[50]:
0.9849060957803961
In [51]:
print('training Performance')
print(reg.score(x_train,y_train))
print('test Performance')
print(reg.score(x_test,y_test))
training Performance
0.9747815666217599
test Performance
0.9849060957803961
In [52]:
#Regression coefficients for the model 
reg.coef_
Out[52]:
array([432.4680536 , -55.50352875,  11.88834549, -20.8881854 ,
       611.37694067, -50.75412983])
In [53]:
#What weights are critical to the model:
reg_summary = pd.DataFrame(inputs.columns, columns = ['Features'])
reg_summary['Weights']= reg.coef_/1000
reg_summary
Out[53]:
Features Weights
0 log_trips 0.432468
1 dispatching_base_number_B02598 -0.055504
2 dispatching_base_number_B02617 0.011888
3 dispatching_base_number_B02682 -0.020888
4 dispatching_base_number_B02764 0.611377
5 dispatching_base_number_B02765 -0.050754
In [54]:
#Testing the model
In [55]:
y_hat_test = reg.predict(x_test)
In [56]:
plt.scatter(y_test, y_hat_test, alpha=0.2)
plt.xlabel('Targets Test')
plt.ylabel('Predictions Test')
plt.show()
No description has been provided for this image
In [57]:
#Vprediction= has
df_pf = pd.DataFrame((y_hat_test), columns = ['Predictions'])
df_pf.head()
Out[57]:
Predictions
0 1091.364516
1 1057.719388
2 1170.036887
3 1259.987262
4 1078.100005
In [58]:
y_test=y_test.reset_index(drop=True)
y_test.head()
Out[58]:
0    1218
1    1151
2    1027
3    1330
4     945
Name: active_vehicles, dtype: int64
In [59]:
df_pf['Targets'] = y_test
df_pf.head()
Out[59]:
Predictions Targets
0 1091.364516 1218
1 1057.719388 1151
2 1170.036887 1027
3 1259.987262 1330
4 1078.100005 945
In [60]:
#Overview how much the prediction and the actual values differ:
y_pred= reg.predict(x_test)
test=pd.DataFrame({'Predicted':y_pred, 'Actual':y_test})
fig=plt.figure(figsize=(16,8))
test=test.reset_index()
test=test.drop(['index'],axis=1)
plt.plot(test[:50])
plt.legend(['Actual', 'Predicted'])
sns.jointplot(x='Actual', y='Predicted', data=test, kind='reg');
No description has been provided for this image
No description has been provided for this image

4.2 Evaluation¶

In [61]:
df_pf
Out[61]:
Predictions Targets
0 1091.364516 1218
1 1057.719388 1151
2 1170.036887 1027
3 1259.987262 1330
4 1078.100005 945
5 1394.956731 1223
6 1506.660989 1526
7 1358.148148 1418
8 990.063854 975
9 1384.551075 1350
10 1343.005058 1300
11 967.122403 974
12 1426.599642 1321
13 1200.898158 1295
14 985.642063 1084
15 271.859854 322
16 672.261307 746
17 114.609218 299
18 1476.113622 1551
19 1289.180154 1356
20 1279.616087 1248
21 3584.319579 3658
22 1434.194940 1456
23 1349.242107 1405
24 162.148057 309
25 114.034467 252
26 1358.412778 1471
27 237.654959 269
28 1479.194768 1452
29 892.269643 786
30 3653.514821 3820
31 3478.878964 3473
32 3534.709964 3186
33 900.021763 685
34 214.674054 256
35 639.482237 566
36 1475.478612 1293
37 872.131134 944
38 1226.884458 1186
39 1195.639823 1044
40 1432.790074 1383
41 3575.285742 3736
42 1098.867095 1039
43 1095.121958 994
44 1296.851003 1429
45 1458.344504 1532
46 1257.183801 1262
47 1294.327396 1214
48 3587.913110 3478
49 3572.568763 3427
50 694.462976 521
51 1241.508649 1216
52 957.733177 909
53 1277.755356 1188
In [62]:
df_pf['Residuals'] = df_pf['Targets']-df_pf['Predictions']
df_pf
Out[62]:
Predictions Targets Residuals
0 1091.364516 1218 126.635484
1 1057.719388 1151 93.280612
2 1170.036887 1027 -143.036887
3 1259.987262 1330 70.012738
4 1078.100005 945 -133.100005
5 1394.956731 1223 -171.956731
6 1506.660989 1526 19.339011
7 1358.148148 1418 59.851852
8 990.063854 975 -15.063854
9 1384.551075 1350 -34.551075
10 1343.005058 1300 -43.005058
11 967.122403 974 6.877597
12 1426.599642 1321 -105.599642
13 1200.898158 1295 94.101842
14 985.642063 1084 98.357937
15 271.859854 322 50.140146
16 672.261307 746 73.738693
17 114.609218 299 184.390782
18 1476.113622 1551 74.886378
19 1289.180154 1356 66.819846
20 1279.616087 1248 -31.616087
21 3584.319579 3658 73.680421
22 1434.194940 1456 21.805060
23 1349.242107 1405 55.757893
24 162.148057 309 146.851943
25 114.034467 252 137.965533
26 1358.412778 1471 112.587222
27 237.654959 269 31.345041
28 1479.194768 1452 -27.194768
29 892.269643 786 -106.269643
30 3653.514821 3820 166.485179
31 3478.878964 3473 -5.878964
32 3534.709964 3186 -348.709964
33 900.021763 685 -215.021763
34 214.674054 256 41.325946
35 639.482237 566 -73.482237
36 1475.478612 1293 -182.478612
37 872.131134 944 71.868866
38 1226.884458 1186 -40.884458
39 1195.639823 1044 -151.639823
40 1432.790074 1383 -49.790074
41 3575.285742 3736 160.714258
42 1098.867095 1039 -59.867095
43 1095.121958 994 -101.121958
44 1296.851003 1429 132.148997
45 1458.344504 1532 73.655496
46 1257.183801 1262 4.816199
47 1294.327396 1214 -80.327396
48 3587.913110 3478 -109.913110
49 3572.568763 3427 -145.568763
50 694.462976 521 -173.462976
51 1241.508649 1216 -25.508649
52 957.733177 909 -48.733177
53 1277.755356 1188 -89.755356
In [63]:
#Residuals shows the deviation
#Display difference in percent for Prediction & Target
df_pf['Difference%']=np.absolute(df_pf['Residuals']/df_pf['Targets']*100)
df_pf
Out[63]:
Predictions Targets Residuals Difference%
0 1091.364516 1218 126.635484 10.397002
1 1057.719388 1151 93.280612 8.104310
2 1170.036887 1027 -143.036887 13.927642
3 1259.987262 1330 70.012738 5.264116
4 1078.100005 945 -133.100005 14.084657
5 1394.956731 1223 -171.956731 14.060240
6 1506.660989 1526 19.339011 1.267301
7 1358.148148 1418 59.851852 4.220864
8 990.063854 975 -15.063854 1.545011
9 1384.551075 1350 -34.551075 2.559339
10 1343.005058 1300 -43.005058 3.308081
11 967.122403 974 6.877597 0.706119
12 1426.599642 1321 -105.599642 7.993917
13 1200.898158 1295 94.101842 7.266551
14 985.642063 1084 98.357937 9.073610
15 271.859854 322 50.140146 15.571474
16 672.261307 746 73.738693 9.884543
17 114.609218 299 184.390782 61.669158
18 1476.113622 1551 74.886378 4.828264
19 1289.180154 1356 66.819846 4.927717
20 1279.616087 1248 -31.616087 2.533340
21 3584.319579 3658 73.680421 2.014227
22 1434.194940 1456 21.805060 1.497600
23 1349.242107 1405 55.757893 3.968533
24 162.148057 309 146.851943 47.524901
25 114.034467 252 137.965533 54.748227
26 1358.412778 1471 112.587222 7.653788
27 237.654959 269 31.345041 11.652431
28 1479.194768 1452 -27.194768 1.872918
29 892.269643 786 -106.269643 13.520311
30 3653.514821 3820 166.485179 4.358251
31 3478.878964 3473 -5.878964 0.169276
32 3534.709964 3186 -348.709964 10.945071
33 900.021763 685 -215.021763 31.390038
34 214.674054 256 41.325946 16.142948
35 639.482237 566 -73.482237 12.982727
36 1475.478612 1293 -182.478612 14.112808
37 872.131134 944 71.868866 7.613227
38 1226.884458 1186 -40.884458 3.447256
39 1195.639823 1044 -151.639823 14.524887
40 1432.790074 1383 -49.790074 3.600150
41 3575.285742 3736 160.714258 4.301773
42 1098.867095 1039 -59.867095 5.761992
43 1095.121958 994 -101.121958 10.173235
44 1296.851003 1429 132.148997 9.247655
45 1458.344504 1532 73.655496 4.807800
46 1257.183801 1262 4.816199 0.381632
47 1294.327396 1214 -80.327396 6.616754
48 3587.913110 3478 -109.913110 3.160239
49 3572.568763 3427 -145.568763 4.247702
50 694.462976 521 -173.462976 33.294237
51 1241.508649 1216 -25.508649 2.097751
52 957.733177 909 -48.733177 5.361186
53 1277.755356 1188 -89.755356 7.555165
In [64]:
df_pf.describe()
Out[64]:
Predictions Targets Residuals Difference%
count 54.000000 54.000000 54.000000 54.000000
mean 1388.186984 1379.592593 -8.594392 10.443333
std 929.385603 918.172272 112.470052 12.721770
min 114.034467 252.000000 -348.709964 0.169276
25% 971.752318 952.250000 -87.398366 3.485480
50% 1258.585532 1220.500000 -0.531383 6.941653
75% 1431.242466 1426.250000 73.674190 12.650153
max 3653.514821 3820.000000 184.390782 61.669158
In [ ]:
 
In [ ]: