1. Business Understanding¶
Uber Technologies Inc. ist ein 2009 gegründetes Unternehmen mit Sitz in San Francisco. Es hat 91 Millionen aktive Nutzer und 3,9 Millionen Fahrer weltweit. Das Unternehmen ist mit seiner App in insgesamt 63 Ländern vertreten. Es muss ein grundlegendes Verständnis für die Ziele aus der Unternehmensperspektive vorhanden sein. Darauf aufbauend können dann die entsprechenden Anforderungen an ein Data-Mining-Projekt definiert werden, damit diese realisiert werden können. Uber versucht, Prognosen auf Basis von Angebot und Nachfrage so zu optimieren, dass eine Verfügbarkeit von Fahrzeugen stets gewährleistet ist, um den Service für seine Nutzer aufrechtzuerhalten. Dieser Datensatz soll die Realisierbarkeit dieses Ziels demonstrieren.
2. Daten und Datenverständnis¶
Der Datensatz besteht aus vier Basisvariablen: Abfertigungsbasisnummer, Datum, active_vehicles und Fahrten. Die Variable "dispatching base number" ist ein vom TLC zugewiesener Code, der eine Uber-Basis in New York City angibt. Dementsprechend können die Codes den folgenden Basen zugewiesen werden: B02512: Unter, B02598: Hinter, B02617: Neben, B02682: Taste, B02764: Nach-NY, B02765: Grun, B02835: Dreist, B02836: Inside. Durch einen ersten Überblick über die verfügbaren Daten kann man bereits über mögliche Abhängigkeiten zwischen der Anzahl der aktiven Fahrzeuge, der Anzahl der Fahrten sowie dem Datum spekulieren.
2.1 Import von relevanten Modulen¶
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import math
import matplotlib.dates as mdates
from matplotlib.dates import DateFormatter
sns.set()
from sklearn.linear_model import LinearRegression
2.2 Daten einlesen¶
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
raw_data =pd.read_csv('https://storage.googleapis.com/ml-service-repository-datastorage/Forecast_of_required_vehicles_in_the_city_center_data.csv')
#Output record head
raw_data.head()
dispatching_base_number | date | active_vehicles | trips | |
---|---|---|---|---|
0 | B02512 | 1/1/2015 | 190 | 1132 |
1 | B02765 | 1/1/2015 | 225 | 1765 |
2 | B02764 | 1/1/2015 | 3427 | 29421 |
3 | B02682 | 1/1/2015 | 945 | 7679 |
4 | B02617 | 1/1/2015 | 1228 | 9537 |
#data description
raw_data.describe(include='all')
dispatching_base_number | date | active_vehicles | trips | |
---|---|---|---|---|
count | 354 | 354 | 354.000000 | 354.000000 |
unique | 6 | 59 | NaN | NaN |
top | B02598 | 1/27/2015 | NaN | NaN |
freq | 59 | 6 | NaN | NaN |
mean | NaN | NaN | 1307.435028 | 11667.316384 |
std | NaN | NaN | 1162.510626 | 10648.284865 |
min | NaN | NaN | 112.000000 | 629.000000 |
25% | NaN | NaN | 296.750000 | 2408.500000 |
50% | NaN | NaN | 1077.000000 | 9601.000000 |
75% | NaN | NaN | 1417.000000 | 13711.250000 |
max | NaN | NaN | 4395.000000 | 45858.000000 |
2.3 Datenbereinigung¶
#check null values
raw_data.isnull().sum()
dispatching_base_number 0 date 0 active_vehicles 0 trips 0 dtype: int64
raw_data.duplicated().sum()
0
# View data distributions for active vehicles
sns.distplot(raw_data['active_vehicles'])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='active_vehicles', ylabel='Density'>
# View data distributions for trips
sns.distplot(raw_data['trips'])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='trips', ylabel='Density'>
#Here a 95% quantile is generated for active vehicles & trips
quant1= raw_data['active_vehicles'].quantile(0.95)
quant1
3904.7499999999986
quant1= raw_data['trips'].quantile(0.95)
quant1
36025.34999999999
#quantile containment for active_vehicles
data1= raw_data[raw_data['active_vehicles']<quant1]
#Quantile containment for trips
data1=raw_data[raw_data['trips']<quant1]
data1.describe(include='all')
dispatching_base_number | date | active_vehicles | trips | |
---|---|---|---|---|
count | 336 | 336 | 336.000000 | 336.000000 |
unique | 6 | 59 | NaN | NaN |
top | B02598 | 1/27/2015 | NaN | NaN |
freq | 59 | 6 | NaN | NaN |
mean | NaN | NaN | 1163.815476 | 10141.514881 |
std | NaN | NaN | 1006.895472 | 8548.456980 |
min | NaN | NaN | 112.000000 | 629.000000 |
25% | NaN | NaN | 288.250000 | 2265.750000 |
50% | NaN | NaN | 1056.500000 | 9467.500000 |
75% | NaN | NaN | 1368.250000 | 12707.000000 |
max | NaN | NaN | 4093.000000 | 35990.000000 |
#Displot from the quantile is generated & the irregular distribution is still noticed
sns.distplot(data1['active_vehicles'])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='active_vehicles', ylabel='Density'>
#Next step is to narrow down the values to 4000(active vehicles).
#On the left side you can see the main distribution of the dataset
data2= data1[data1['active_vehicles']<=4000]
data2= data1[data1['active_vehicles']>=600]
plt.xlim(0,4000)
sns.distplot(data2['active_vehicles'])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='active_vehicles', ylabel='Density'>
#here a quantile is created for number of trips
#data2= data1[data1['trips']<quant1]
sns.distplot(data2['trips'])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='trips', ylabel='Density'>
#Define value ranges
data2= data1[data1['trips']<=20000]
data2= data1[data1['trips']>=2000]
plt.xlim(1000,17000)
sns.distplot(data2['trips'])
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
<AxesSubplot:xlabel='trips', ylabel='Density'>
3. Datenaufbereitung¶
data_cleaned=pd.to_datetime(data2['date'])
data_cleaned=str(data2['dispatching_base_number'].astype(str))
#data_cleaned= data2["dispatching_base_number"].astype(str)
data_cleaned = data2.reset_index(drop=True)
data_cleaned.describe(include='all')
dispatching_base_number | date | active_vehicles | trips | |
---|---|---|---|---|
count | 266 | 266 | 266.000000 | 266.000000 |
unique | 6 | 59 | NaN | NaN |
top | B02598 | 2/5/2015 | NaN | NaN |
freq | 59 | 6 | NaN | NaN |
mean | NaN | NaN | 1411.646617 | 12414.560150 |
std | NaN | NaN | 992.671651 | 8213.348133 |
min | NaN | NaN | 236.000000 | 2016.000000 |
25% | NaN | NaN | 925.500000 | 7620.500000 |
50% | NaN | NaN | 1181.000000 | 10727.000000 |
75% | NaN | NaN | 1439.500000 | 13842.500000 |
max | NaN | NaN | 4093.000000 | 35990.000000 |
#After assuming that the data has been cleaned as far as possible, we continue with OLS assumptions.
#Before a scatter plot showing the split and density of the values
#Main feature: The data in the dataset are concentrated in the x=2000 & y= 20000 range
#However, values in the range of 3000 to over 4000 must also be included in the model
plt.scatter(data_cleaned['active_vehicles'], data_cleaned['trips'])
<matplotlib.collections.PathCollection at 0x11ed7d310>
#Related to statement(s) from the line above.
#Main distribution here is in the 2000 range
plt.scatter(data_cleaned['active_vehicles'], data_cleaned['trips'])
plt.xlim(0,2000)
plt.ylim(0,20000)
(0.0, 20000.0)
#A few more checks on the other variables
plt.scatter(data_cleaned['trips'], data_cleaned['dispatching_base_number'])
<matplotlib.collections.PathCollection at 0x11ef43e80>
plt.scatter(data_cleaned['dispatching_base_number'], data_cleaned['active_vehicles'])
<matplotlib.collections.PathCollection at 0x11f02a340>
#Transformation für kontinuierliche Variablen
#Log: Die Logarithmentransformation hilft bei schiefen Daten, die Schiefe zu reduzieren. Siehe Oben
log_trips = np.log(data_cleaned['trips'])
data_cleaned['log_trips'] = log_trips
#Linearity is not present-> Violation of OLS Assumption
plt.scatter(data_cleaned['dispatching_base_number'], data_cleaned['log_trips'])
<matplotlib.collections.PathCollection at 0x11f099d90>
#Output of the new columns -> You can see that "Trips" occurs twice since log transformation
data_cleaned.columns.values
array(['dispatching_base_number', 'date', 'active_vehicles', 'trips', 'log_trips'], dtype=object)
#data_cleaned = data_cleaned.drop(['trips'], axis=1)
#Variance Inflation Factor (VIF).
#If the VIF is between 5 and 10, multicolinearity is probably present and you should consider dropping the variable.
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = data_cleaned[['active_vehicles', 'log_trips']]
vif = pd.DataFrame()
vif["VIF"]= [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif["Features"]= variables.columns
#VIF is output and the values are <5
vif
VIF | Features | |
---|---|---|
0 | 3.684423 | active_vehicles |
1 | 3.684423 | log_trips |
data_cleaned.describe(include='all')
dispatching_base_number | date | active_vehicles | trips | log_trips | |
---|---|---|---|---|---|
count | 266 | 266 | 266.000000 | 266.000000 | 266.000000 |
unique | 6 | 59 | NaN | NaN | NaN |
top | B02598 | 2/5/2015 | NaN | NaN | NaN |
freq | 59 | 6 | NaN | NaN | NaN |
mean | NaN | NaN | 1411.646617 | 12414.560150 | 9.208195 |
std | NaN | NaN | 992.671651 | 8213.348133 | 0.697698 |
min | NaN | NaN | 236.000000 | 2016.000000 | 7.608871 |
25% | NaN | NaN | 925.500000 | 7620.500000 | 8.938593 |
50% | NaN | NaN | 1181.000000 | 10727.000000 | 9.280516 |
75% | NaN | NaN | 1439.500000 | 13842.500000 | 9.535498 |
max | NaN | NaN | 4093.000000 | 35990.000000 | 10.490996 |
#Linearity could be improved by LOG transformation
plt.scatter(data_cleaned['active_vehicles'], data_cleaned['log_trips'])
<matplotlib.collections.PathCollection at 0x11f1e4d90>
3.1 Umkodierung von kategorialen Variablen¶
#Convert vpn categorical data into dummy/ indicator variables. Focus on base number for more general model
data_with_dummies = pd.get_dummies(data_cleaned, drop_first = True)
# Dummy variables head output
data_with_dummies.head()
active_vehicles | trips | log_trips | dispatching_base_number_B02598 | dispatching_base_number_B02617 | dispatching_base_number_B02682 | dispatching_base_number_B02764 | dispatching_base_number_B02765 | date_1/10/2015 | date_1/11/2015 | ... | date_2/26/2015 | date_2/27/2015 | date_2/28/2015 | date_2/3/2015 | date_2/4/2015 | date_2/5/2015 | date_2/6/2015 | date_2/7/2015 | date_2/8/2015 | date_2/9/2015 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3427 | 29421 | 10.289464 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 945 | 7679 | 8.946245 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1228 | 9537 | 9.162934 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 870 | 6903 | 8.839711 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 785 | 4768 | 8.469682 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 66 columns
data_with_dummies.columns.values
array(['active_vehicles', 'trips', 'log_trips', 'dispatching_base_number_B02598', 'dispatching_base_number_B02617', 'dispatching_base_number_B02682', 'dispatching_base_number_B02764', 'dispatching_base_number_B02765', 'date_1/10/2015', 'date_1/11/2015', 'date_1/12/2015', 'date_1/13/2015', 'date_1/14/2015', 'date_1/15/2015', 'date_1/16/2015', 'date_1/17/2015', 'date_1/18/2015', 'date_1/19/2015', 'date_1/2/2015', 'date_1/20/2015', 'date_1/21/2015', 'date_1/22/2015', 'date_1/23/2015', 'date_1/24/2015', 'date_1/25/2015', 'date_1/26/2015', 'date_1/27/2015', 'date_1/28/2015', 'date_1/29/2015', 'date_1/3/2015', 'date_1/30/2015', 'date_1/31/2015', 'date_1/4/2015', 'date_1/5/2015', 'date_1/6/2015', 'date_1/7/2015', 'date_1/8/2015', 'date_1/9/2015', 'date_2/1/2015', 'date_2/10/2015', 'date_2/11/2015', 'date_2/12/2015', 'date_2/13/2015', 'date_2/14/2015', 'date_2/15/2015', 'date_2/16/2015', 'date_2/17/2015', 'date_2/18/2015', 'date_2/19/2015', 'date_2/2/2015', 'date_2/20/2015', 'date_2/21/2015', 'date_2/22/2015', 'date_2/23/2015', 'date_2/24/2015', 'date_2/25/2015', 'date_2/26/2015', 'date_2/27/2015', 'date_2/28/2015', 'date_2/3/2015', 'date_2/4/2015', 'date_2/5/2015', 'date_2/6/2015', 'date_2/7/2015', 'date_2/8/2015', 'date_2/9/2015'], dtype=object)
#Columns for the later model are provided
cols=['active_vehicles','log_trips','dispatching_base_number_B02598',
'dispatching_base_number_B02617', 'dispatching_base_number_B02682',
'dispatching_base_number_B02764', 'dispatching_base_number_B02765']
#-> At this point it was decided that the variable 'Date' is no longer useful and too costly for model building
#-> It is therefore not considered for the model
#Arrangement-> First column becomes target variable
data_preprocessed = data_with_dummies[cols]
data_preprocessed.head()
active_vehicles | log_trips | dispatching_base_number_B02598 | dispatching_base_number_B02617 | dispatching_base_number_B02682 | dispatching_base_number_B02764 | dispatching_base_number_B02765 | |
---|---|---|---|---|---|---|---|
0 | 3427 | 10.289464 | 0 | 0 | 0 | 1 | 0 |
1 | 945 | 8.946245 | 0 | 0 | 1 | 0 | 0 |
2 | 1228 | 9.162934 | 0 | 1 | 0 | 0 | 0 |
3 | 870 | 8.839711 | 1 | 0 | 0 | 0 | 0 |
4 | 785 | 8.469682 | 1 | 0 | 0 | 0 | 0 |
3.2 Erstellen von Test- & Trainingsdaten¶
targets = data_preprocessed['active_vehicles']
inputs = data_preprocessed.drop(['active_vehicles'], axis=1)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(inputs)
input_scaled = scaler.transform(inputs)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(input_scaled, targets, test_size=0.20, random_state=365)
4. Modellierung und Evaluation¶
4.1 Lineare Regression¶
reg = LinearRegression()
reg.fit(x_train, y_train)
LinearRegression()
y_hat = reg.predict(x_train)
#The division of values is also noticeable here
plt.scatter(y_train, y_hat)
plt.xlabel('Targets')
plt.ylabel('Predictions')
plt.show()
#Error thermal: Trainins values minus predicted data
sns.distplot(y_train - y_hat)
plt.title("Residuals")
#plt.show()
/Users/Jumana/opt/anaconda3/lib/python3.8/site-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms). warnings.warn(msg, FutureWarning)
Text(0.5, 1.0, 'Residuals')
#R-squared value is important: variance of the target variable
reg.score(x_train, y_train)
0.9747815666217599
# Important for the regression line
reg.intercept_
1413.3913426862414
reg.score(x_test, y_test)
0.9849060957803961
print('training Performance')
print(reg.score(x_train,y_train))
print('test Performance')
print(reg.score(x_test,y_test))
training Performance 0.9747815666217599 test Performance 0.9849060957803961
#Regression coefficients for the model
reg.coef_
array([432.4680536 , -55.50352875, 11.88834549, -20.8881854 , 611.37694067, -50.75412983])
#What weights are critical to the model:
reg_summary = pd.DataFrame(inputs.columns, columns = ['Features'])
reg_summary['Weights']= reg.coef_/1000
reg_summary
Features | Weights | |
---|---|---|
0 | log_trips | 0.432468 |
1 | dispatching_base_number_B02598 | -0.055504 |
2 | dispatching_base_number_B02617 | 0.011888 |
3 | dispatching_base_number_B02682 | -0.020888 |
4 | dispatching_base_number_B02764 | 0.611377 |
5 | dispatching_base_number_B02765 | -0.050754 |
#Testing the model
y_hat_test = reg.predict(x_test)
plt.scatter(y_test, y_hat_test, alpha=0.2)
plt.xlabel('Targets Test')
plt.ylabel('Predictions Test')
plt.show()
#Vprediction= has
df_pf = pd.DataFrame((y_hat_test), columns = ['Predictions'])
df_pf.head()
Predictions | |
---|---|
0 | 1091.364516 |
1 | 1057.719388 |
2 | 1170.036887 |
3 | 1259.987262 |
4 | 1078.100005 |
y_test=y_test.reset_index(drop=True)
y_test.head()
0 1218 1 1151 2 1027 3 1330 4 945 Name: active_vehicles, dtype: int64
df_pf['Targets'] = y_test
df_pf.head()
Predictions | Targets | |
---|---|---|
0 | 1091.364516 | 1218 |
1 | 1057.719388 | 1151 |
2 | 1170.036887 | 1027 |
3 | 1259.987262 | 1330 |
4 | 1078.100005 | 945 |
#Overview how much the prediction and the actual values differ:
y_pred= reg.predict(x_test)
test=pd.DataFrame({'Predicted':y_pred, 'Actual':y_test})
fig=plt.figure(figsize=(16,8))
test=test.reset_index()
test=test.drop(['index'],axis=1)
plt.plot(test[:50])
plt.legend(['Actual', 'Predicted'])
sns.jointplot(x='Actual', y='Predicted', data=test, kind='reg');
4.2 Evaluation¶
df_pf
Predictions | Targets | |
---|---|---|
0 | 1091.364516 | 1218 |
1 | 1057.719388 | 1151 |
2 | 1170.036887 | 1027 |
3 | 1259.987262 | 1330 |
4 | 1078.100005 | 945 |
5 | 1394.956731 | 1223 |
6 | 1506.660989 | 1526 |
7 | 1358.148148 | 1418 |
8 | 990.063854 | 975 |
9 | 1384.551075 | 1350 |
10 | 1343.005058 | 1300 |
11 | 967.122403 | 974 |
12 | 1426.599642 | 1321 |
13 | 1200.898158 | 1295 |
14 | 985.642063 | 1084 |
15 | 271.859854 | 322 |
16 | 672.261307 | 746 |
17 | 114.609218 | 299 |
18 | 1476.113622 | 1551 |
19 | 1289.180154 | 1356 |
20 | 1279.616087 | 1248 |
21 | 3584.319579 | 3658 |
22 | 1434.194940 | 1456 |
23 | 1349.242107 | 1405 |
24 | 162.148057 | 309 |
25 | 114.034467 | 252 |
26 | 1358.412778 | 1471 |
27 | 237.654959 | 269 |
28 | 1479.194768 | 1452 |
29 | 892.269643 | 786 |
30 | 3653.514821 | 3820 |
31 | 3478.878964 | 3473 |
32 | 3534.709964 | 3186 |
33 | 900.021763 | 685 |
34 | 214.674054 | 256 |
35 | 639.482237 | 566 |
36 | 1475.478612 | 1293 |
37 | 872.131134 | 944 |
38 | 1226.884458 | 1186 |
39 | 1195.639823 | 1044 |
40 | 1432.790074 | 1383 |
41 | 3575.285742 | 3736 |
42 | 1098.867095 | 1039 |
43 | 1095.121958 | 994 |
44 | 1296.851003 | 1429 |
45 | 1458.344504 | 1532 |
46 | 1257.183801 | 1262 |
47 | 1294.327396 | 1214 |
48 | 3587.913110 | 3478 |
49 | 3572.568763 | 3427 |
50 | 694.462976 | 521 |
51 | 1241.508649 | 1216 |
52 | 957.733177 | 909 |
53 | 1277.755356 | 1188 |
df_pf['Residuals'] = df_pf['Targets']-df_pf['Predictions']
df_pf
Predictions | Targets | Residuals | |
---|---|---|---|
0 | 1091.364516 | 1218 | 126.635484 |
1 | 1057.719388 | 1151 | 93.280612 |
2 | 1170.036887 | 1027 | -143.036887 |
3 | 1259.987262 | 1330 | 70.012738 |
4 | 1078.100005 | 945 | -133.100005 |
5 | 1394.956731 | 1223 | -171.956731 |
6 | 1506.660989 | 1526 | 19.339011 |
7 | 1358.148148 | 1418 | 59.851852 |
8 | 990.063854 | 975 | -15.063854 |
9 | 1384.551075 | 1350 | -34.551075 |
10 | 1343.005058 | 1300 | -43.005058 |
11 | 967.122403 | 974 | 6.877597 |
12 | 1426.599642 | 1321 | -105.599642 |
13 | 1200.898158 | 1295 | 94.101842 |
14 | 985.642063 | 1084 | 98.357937 |
15 | 271.859854 | 322 | 50.140146 |
16 | 672.261307 | 746 | 73.738693 |
17 | 114.609218 | 299 | 184.390782 |
18 | 1476.113622 | 1551 | 74.886378 |
19 | 1289.180154 | 1356 | 66.819846 |
20 | 1279.616087 | 1248 | -31.616087 |
21 | 3584.319579 | 3658 | 73.680421 |
22 | 1434.194940 | 1456 | 21.805060 |
23 | 1349.242107 | 1405 | 55.757893 |
24 | 162.148057 | 309 | 146.851943 |
25 | 114.034467 | 252 | 137.965533 |
26 | 1358.412778 | 1471 | 112.587222 |
27 | 237.654959 | 269 | 31.345041 |
28 | 1479.194768 | 1452 | -27.194768 |
29 | 892.269643 | 786 | -106.269643 |
30 | 3653.514821 | 3820 | 166.485179 |
31 | 3478.878964 | 3473 | -5.878964 |
32 | 3534.709964 | 3186 | -348.709964 |
33 | 900.021763 | 685 | -215.021763 |
34 | 214.674054 | 256 | 41.325946 |
35 | 639.482237 | 566 | -73.482237 |
36 | 1475.478612 | 1293 | -182.478612 |
37 | 872.131134 | 944 | 71.868866 |
38 | 1226.884458 | 1186 | -40.884458 |
39 | 1195.639823 | 1044 | -151.639823 |
40 | 1432.790074 | 1383 | -49.790074 |
41 | 3575.285742 | 3736 | 160.714258 |
42 | 1098.867095 | 1039 | -59.867095 |
43 | 1095.121958 | 994 | -101.121958 |
44 | 1296.851003 | 1429 | 132.148997 |
45 | 1458.344504 | 1532 | 73.655496 |
46 | 1257.183801 | 1262 | 4.816199 |
47 | 1294.327396 | 1214 | -80.327396 |
48 | 3587.913110 | 3478 | -109.913110 |
49 | 3572.568763 | 3427 | -145.568763 |
50 | 694.462976 | 521 | -173.462976 |
51 | 1241.508649 | 1216 | -25.508649 |
52 | 957.733177 | 909 | -48.733177 |
53 | 1277.755356 | 1188 | -89.755356 |
#Residuals shows the deviation
#Display difference in percent for Prediction & Target
df_pf['Difference%']=np.absolute(df_pf['Residuals']/df_pf['Targets']*100)
df_pf
Predictions | Targets | Residuals | Difference% | |
---|---|---|---|---|
0 | 1091.364516 | 1218 | 126.635484 | 10.397002 |
1 | 1057.719388 | 1151 | 93.280612 | 8.104310 |
2 | 1170.036887 | 1027 | -143.036887 | 13.927642 |
3 | 1259.987262 | 1330 | 70.012738 | 5.264116 |
4 | 1078.100005 | 945 | -133.100005 | 14.084657 |
5 | 1394.956731 | 1223 | -171.956731 | 14.060240 |
6 | 1506.660989 | 1526 | 19.339011 | 1.267301 |
7 | 1358.148148 | 1418 | 59.851852 | 4.220864 |
8 | 990.063854 | 975 | -15.063854 | 1.545011 |
9 | 1384.551075 | 1350 | -34.551075 | 2.559339 |
10 | 1343.005058 | 1300 | -43.005058 | 3.308081 |
11 | 967.122403 | 974 | 6.877597 | 0.706119 |
12 | 1426.599642 | 1321 | -105.599642 | 7.993917 |
13 | 1200.898158 | 1295 | 94.101842 | 7.266551 |
14 | 985.642063 | 1084 | 98.357937 | 9.073610 |
15 | 271.859854 | 322 | 50.140146 | 15.571474 |
16 | 672.261307 | 746 | 73.738693 | 9.884543 |
17 | 114.609218 | 299 | 184.390782 | 61.669158 |
18 | 1476.113622 | 1551 | 74.886378 | 4.828264 |
19 | 1289.180154 | 1356 | 66.819846 | 4.927717 |
20 | 1279.616087 | 1248 | -31.616087 | 2.533340 |
21 | 3584.319579 | 3658 | 73.680421 | 2.014227 |
22 | 1434.194940 | 1456 | 21.805060 | 1.497600 |
23 | 1349.242107 | 1405 | 55.757893 | 3.968533 |
24 | 162.148057 | 309 | 146.851943 | 47.524901 |
25 | 114.034467 | 252 | 137.965533 | 54.748227 |
26 | 1358.412778 | 1471 | 112.587222 | 7.653788 |
27 | 237.654959 | 269 | 31.345041 | 11.652431 |
28 | 1479.194768 | 1452 | -27.194768 | 1.872918 |
29 | 892.269643 | 786 | -106.269643 | 13.520311 |
30 | 3653.514821 | 3820 | 166.485179 | 4.358251 |
31 | 3478.878964 | 3473 | -5.878964 | 0.169276 |
32 | 3534.709964 | 3186 | -348.709964 | 10.945071 |
33 | 900.021763 | 685 | -215.021763 | 31.390038 |
34 | 214.674054 | 256 | 41.325946 | 16.142948 |
35 | 639.482237 | 566 | -73.482237 | 12.982727 |
36 | 1475.478612 | 1293 | -182.478612 | 14.112808 |
37 | 872.131134 | 944 | 71.868866 | 7.613227 |
38 | 1226.884458 | 1186 | -40.884458 | 3.447256 |
39 | 1195.639823 | 1044 | -151.639823 | 14.524887 |
40 | 1432.790074 | 1383 | -49.790074 | 3.600150 |
41 | 3575.285742 | 3736 | 160.714258 | 4.301773 |
42 | 1098.867095 | 1039 | -59.867095 | 5.761992 |
43 | 1095.121958 | 994 | -101.121958 | 10.173235 |
44 | 1296.851003 | 1429 | 132.148997 | 9.247655 |
45 | 1458.344504 | 1532 | 73.655496 | 4.807800 |
46 | 1257.183801 | 1262 | 4.816199 | 0.381632 |
47 | 1294.327396 | 1214 | -80.327396 | 6.616754 |
48 | 3587.913110 | 3478 | -109.913110 | 3.160239 |
49 | 3572.568763 | 3427 | -145.568763 | 4.247702 |
50 | 694.462976 | 521 | -173.462976 | 33.294237 |
51 | 1241.508649 | 1216 | -25.508649 | 2.097751 |
52 | 957.733177 | 909 | -48.733177 | 5.361186 |
53 | 1277.755356 | 1188 | -89.755356 | 7.555165 |
df_pf.describe()
Predictions | Targets | Residuals | Difference% | |
---|---|---|---|---|
count | 54.000000 | 54.000000 | 54.000000 | 54.000000 |
mean | 1388.186984 | 1379.592593 | -8.594392 | 10.443333 |
std | 929.385603 | 918.172272 | 112.470052 | 12.721770 |
min | 114.034467 | 252.000000 | -348.709964 | 0.169276 |
25% | 971.752318 | 952.250000 | -87.398366 | 3.485480 |
50% | 1258.585532 | 1220.500000 | -0.531383 | 6.941653 |
75% | 1431.242466 | 1426.250000 | 73.674190 | 12.650153 |
max | 3653.514821 | 3820.000000 | 184.390782 | 61.669158 |