1. Business Understanding¶
Cancellations are part of the everyday life of hotels. For some time now, hotels have been charging no/barely any cancellation fees. This increasingly attracts customers to make bookings that are then not taken at the last second. Is it possible to use machine learning approaches to determine the size by given parameters and to predict the cancellations by overbooking hotels accordingly in order to fill the hotel nonetheless in case of cancellations that are not perceived?
2. Data and Data Understanding¶
The data sets contain all kinds of data recorded about the guests. Characteristics regarding children, bookings via travel agencies, etc. could provide information on whether they have a higher cancellation rate.
2.1. Import of Relevant Modules¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
2.2. Read Data¶
# my_file = project.get_file()
# my_file.seek(0)
df = pd.read_csv("https://storage.googleapis.com/ml-service-repository-datastorage/Prediction_cancellation_of_hotel_bookings_data.csv")
df.head()
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 2015-07-01 |
2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 2015-07-02 |
4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 2015-07-03 |
5 rows × 32 columns
def attribute_description(data):
longestColumnName = len(max(np.array(data.columns), key=len))
for col in data.columns:
description = ''
col_dropna = data[col].dropna()
example = col_dropna.sample(1).values[0]
if type(example) == str:
description = 'str '
if len(col_dropna.unique()) < 10:
description += '['
description += '; '.join([ f'"{name}"' for name in col_dropna.unique()])
description += ']'
else:
description += '[ example: "'+ example + '" ]'
else:
description = str(type(example))
print(col.ljust(longestColumnName)+ f': {description}')
attribute_description(df)
hotel : str ["Resort Hotel"; "City Hotel"] is_canceled : <class 'numpy.int64'> lead_time : <class 'numpy.int64'> arrival_date_year : <class 'numpy.int64'> arrival_date_month : str [ example: "May" ] arrival_date_week_number : <class 'numpy.int64'> arrival_date_day_of_month : <class 'numpy.int64'> stays_in_weekend_nights : <class 'numpy.int64'> stays_in_week_nights : <class 'numpy.int64'> adults : <class 'numpy.int64'> children : <class 'numpy.float64'> babies : <class 'numpy.int64'> meal : str ["BB"; "FB"; "HB"; "SC"; "Undefined"] country : str [ example: "SWE" ] market_segment : str ["Direct"; "Corporate"; "Online TA"; "Offline TA/TO"; "Complementary"; "Groups"; "Undefined"; "Aviation"] distribution_channel : str ["Direct"; "Corporate"; "TA/TO"; "Undefined"; "GDS"] is_repeated_guest : <class 'numpy.int64'> previous_cancellations : <class 'numpy.int64'> previous_bookings_not_canceled: <class 'numpy.int64'> reserved_room_type : str [ example: "A" ] assigned_room_type : str [ example: "C" ] booking_changes : <class 'numpy.int64'> deposit_type : str ["No Deposit"; "Refundable"; "Non Refund"] agent : <class 'numpy.float64'> company : <class 'numpy.float64'> days_in_waiting_list : <class 'numpy.int64'> customer_type : str ["Transient"; "Contract"; "Transient-Party"; "Group"] adr : <class 'numpy.float64'> required_car_parking_spaces : <class 'numpy.int64'> total_of_special_requests : <class 'numpy.int64'> reservation_status : str ["Check-Out"; "Canceled"; "No-Show"] reservation_status_date : str [ example: "2017-02-27" ]
df.describe(include='all')
hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 119390 | 119390.000000 | 119390.000000 | 119390.000000 | 119390 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | ... | 119390 | 103050.000000 | 6797.000000 | 119390.000000 | 119390 | 119390.000000 | 119390.000000 | 119390.000000 | 119390 | 119390 |
unique | 2 | NaN | NaN | NaN | 12 | NaN | NaN | NaN | NaN | NaN | ... | 3 | NaN | NaN | NaN | 4 | NaN | NaN | NaN | 3 | 926 |
top | City Hotel | NaN | NaN | NaN | August | NaN | NaN | NaN | NaN | NaN | ... | No Deposit | NaN | NaN | NaN | Transient | NaN | NaN | NaN | Check-Out | 2015-10-21 |
freq | 79330 | NaN | NaN | NaN | 13877 | NaN | NaN | NaN | NaN | NaN | ... | 104641 | NaN | NaN | NaN | 89613 | NaN | NaN | NaN | 75166 | 1461 |
mean | NaN | 0.370416 | 104.011416 | 2016.156554 | NaN | 27.165173 | 15.798241 | 0.927599 | 2.500302 | 1.856403 | ... | NaN | 86.693382 | 189.266735 | 2.321149 | NaN | 101.831122 | 0.062518 | 0.571363 | NaN | NaN |
std | NaN | 0.482918 | 106.863097 | 0.707476 | NaN | 13.605138 | 8.780829 | 0.998613 | 1.908286 | 0.579261 | ... | NaN | 110.774548 | 131.655015 | 17.594721 | NaN | 50.535790 | 0.245291 | 0.792798 | NaN | NaN |
min | NaN | 0.000000 | 0.000000 | 2015.000000 | NaN | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | ... | NaN | 1.000000 | 6.000000 | 0.000000 | NaN | -6.380000 | 0.000000 | 0.000000 | NaN | NaN |
25% | NaN | 0.000000 | 18.000000 | 2016.000000 | NaN | 16.000000 | 8.000000 | 0.000000 | 1.000000 | 2.000000 | ... | NaN | 9.000000 | 62.000000 | 0.000000 | NaN | 69.290000 | 0.000000 | 0.000000 | NaN | NaN |
50% | NaN | 0.000000 | 69.000000 | 2016.000000 | NaN | 28.000000 | 16.000000 | 1.000000 | 2.000000 | 2.000000 | ... | NaN | 14.000000 | 179.000000 | 0.000000 | NaN | 94.575000 | 0.000000 | 0.000000 | NaN | NaN |
75% | NaN | 1.000000 | 160.000000 | 2017.000000 | NaN | 38.000000 | 23.000000 | 2.000000 | 3.000000 | 2.000000 | ... | NaN | 229.000000 | 270.000000 | 0.000000 | NaN | 126.000000 | 0.000000 | 1.000000 | NaN | NaN |
max | NaN | 1.000000 | 737.000000 | 2017.000000 | NaN | 53.000000 | 31.000000 | 19.000000 | 50.000000 | 55.000000 | ... | NaN | 535.000000 | 543.000000 | 391.000000 | NaN | 5400.000000 | 8.000000 | 5.000000 | NaN | NaN |
11 rows × 32 columns
2.3. Data Cleaning¶
f,ax=plt.subplots(figsize = (18,18))
sns.heatmap(df.corr(),annot= True,linewidths=0.5,fmt = ".1f",ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Map')
plt.show()
df.isnull().sum()
hotel 0 is_canceled 0 lead_time 0 arrival_date_year 0 arrival_date_month 0 arrival_date_week_number 0 arrival_date_day_of_month 0 stays_in_weekend_nights 0 stays_in_week_nights 0 adults 0 children 4 babies 0 meal 0 country 488 market_segment 0 distribution_channel 0 is_repeated_guest 0 previous_cancellations 0 previous_bookings_not_canceled 0 reserved_room_type 0 assigned_room_type 0 booking_changes 0 deposit_type 0 agent 16340 company 112593 days_in_waiting_list 0 customer_type 0 adr 0 required_car_parking_spaces 0 total_of_special_requests 0 reservation_status 0 reservation_status_date 0 dtype: int64
df = df.drop(['reservation_status'], axis=1)
df = df.drop(['stays_in_weekend_nights'], axis=1)
df = df.drop(['reservation_status_date'], axis=1)
df = df.drop(['arrival_date_day_of_month'], axis=1)
df = df.drop(['arrival_date_year'], axis=1)
df = df.drop(['arrival_date_month'], axis=1)
df = df.drop(['arrival_date_week_number'], axis=1)
df = df.drop(['required_car_parking_spaces'], axis=1)
df = df.drop(['previous_bookings_not_canceled'], axis=1)
df = df.drop(['total_of_special_requests'], axis=1)
df = df.drop(['agent'], axis=1)
df = df.drop(['company'], axis=1)
df = df.drop(['adr'], axis=1)
df = df.dropna(axis=0)
2.4. Test for Multicollinearity¶
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = df[['lead_time', 'is_repeated_guest', 'adults', 'booking_changes', 'previous_cancellations', 'is_canceled', 'stays_in_week_nights', 'babies', 'days_in_waiting_list']]
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif['Features'] = variables.columns
vif
VIF | Features | |
---|---|---|
0 | 2.285568 | lead_time |
1 | 1.033605 | is_repeated_guest |
2 | 3.354523 | adults |
3 | 1.143147 | booking_changes |
4 | 1.037159 | previous_cancellations |
5 | 1.759416 | is_canceled |
6 | 2.680081 | stays_in_week_nights |
7 | 1.015332 | babies |
8 | 1.049190 | days_in_waiting_list |
2.5. Descriptive Analysis¶
df.hist(figsize=(25,25), bins=50)
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37350C08>, <matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37C47A88>, <matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37C572C8>], [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37C913C8>, <matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37CC8508>, <matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37D02608>], [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37D3B708>, <matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37D73808>, <matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37D80408>], [<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37DB85C8>, <matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37E1EB48>, <matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37E54D08>]], dtype=object)
3. Data Preparation¶
3.1. Recoding of Categorical Variables¶
df_dummies = pd.get_dummies(df, drop_first=True) # 0-1 encoding for categorical values
df_dummies.head()
is_canceled | lead_time | stays_in_week_nights | adults | children | babies | is_repeated_guest | previous_cancellations | booking_changes | days_in_waiting_list | ... | assigned_room_type_H | assigned_room_type_I | assigned_room_type_K | assigned_room_type_L | assigned_room_type_P | deposit_type_Non Refund | deposit_type_Refundable | customer_type_Group | customer_type_Transient | customer_type_Transient-Party | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 342 | 0 | 2 | 0.0 | 0 | 0 | 0 | 3 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 737 | 0 | 2 | 0.0 | 0 | 0 | 0 | 4 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | 7 | 1 | 1 | 0.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 0 | 13 | 1 | 1 | 0.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 14 | 2 | 2 | 0.0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 226 columns
#df_dummies.to_csv('train_dummies.csv', index = False)
df_dummies.axes[0]
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 119380, 119381, 119382, 119383, 119384, 119385, 119386, 119387, 119388, 119389], dtype='int64', length=118898)
df_dummies.axes[1]
Index(['is_canceled', 'lead_time', 'stays_in_week_nights', 'adults', 'children', 'babies', 'is_repeated_guest', 'previous_cancellations', 'booking_changes', 'days_in_waiting_list', ... 'assigned_room_type_H', 'assigned_room_type_I', 'assigned_room_type_K', 'assigned_room_type_L', 'assigned_room_type_P', 'deposit_type_Non Refund', 'deposit_type_Refundable', 'customer_type_Group', 'customer_type_Transient', 'customer_type_Transient-Party'], dtype='object', length=226)
4. Modelling and Evaluation¶
4.1. Test and Train Data¶
target = df_dummies['is_canceled'] # feature to be predicted
predictors = df_dummies.drop(['is_canceled'], axis = 1) # all other features are used as predictors
predictors.head()
lead_time | stays_in_week_nights | adults | children | babies | is_repeated_guest | previous_cancellations | booking_changes | days_in_waiting_list | hotel_Resort Hotel | ... | assigned_room_type_H | assigned_room_type_I | assigned_room_type_K | assigned_room_type_L | assigned_room_type_P | deposit_type_Non Refund | deposit_type_Refundable | customer_type_Group | customer_type_Transient | customer_type_Transient-Party | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 342 | 0 | 2 | 0.0 | 0 | 0 | 0 | 3 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 737 | 0 | 2 | 0.0 | 0 | 0 | 0 | 4 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 7 | 1 | 1 | 0.0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 13 | 1 | 1 | 0.0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 14 | 2 | 2 | 0.0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 225 columns
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, random_state=123)
4.2. DecisionTree¶
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
DecisionTreeClassifier()
tn, fp, fn, tp = confusion_matrix(y_test, tree.predict(X_test)).ravel()
print(tn, fp, fn, tp)
12983 2041 2295 6461
print(classification_report(y_train, tree.predict(X_train)))
precision recall f1-score support 0 0.97 0.99 0.98 59721 1 0.99 0.94 0.97 35397 accuracy 0.98 95118 macro avg 0.98 0.97 0.97 95118 weighted avg 0.98 0.98 0.98 95118
print(classification_report(y_test, tree.predict(X_test)))
precision recall f1-score support 0 0.85 0.86 0.86 15024 1 0.76 0.74 0.75 8756 accuracy 0.82 23780 macro avg 0.80 0.80 0.80 23780 weighted avg 0.82 0.82 0.82 23780
4.3. Logistic Regression¶
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
C:\Users\alexa\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
LogisticRegression()
print(confusion_matrix(y_test, logreg.predict(X_test)))
[[13634 1390] [ 3870 4886]]
conf_mat = confusion_matrix(y_test, logreg.predict(X_test))
df_cm = pd.DataFrame(conf_mat, index=['0','1'], columns=['0', '1'],)
fig = plt.figure(figsize=[10,7])
heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=14)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=14)
plt.ylabel('True label')
plt.xlabel('Predicted label')
Text(0.5, 39.5, 'Predicted label')
print(classification_report(y_test, logreg.predict(X_test)))
precision recall f1-score support 0 0.78 0.91 0.84 15024 1 0.78 0.56 0.65 8756 accuracy 0.78 23780 macro avg 0.78 0.73 0.74 23780 weighted avg 0.78 0.78 0.77 23780
print(classification_report(y_train, logreg.predict(X_train)))
precision recall f1-score support 0 0.78 0.91 0.84 59721 1 0.78 0.56 0.66 35397 accuracy 0.78 95118 macro avg 0.78 0.74 0.75 95118 weighted avg 0.78 0.78 0.77 95118
4.4. Random Forest¶
tree_depth = [5, 10, 20]
for i in tree_depth:
rf = RandomForestClassifier(max_depth=i)
rf.fit(X_train, y_train)
print('Max tree depth: ', i)
print('Train results: ', classification_report(y_train, rf.predict(X_train)))
print('Test results: ',classification_report(y_test, rf.predict(X_test)))
Max tree depth: 5 Train results: precision recall f1-score support 0 0.73 1.00 0.84 59721 1 1.00 0.37 0.54 35397 accuracy 0.77 95118 macro avg 0.86 0.69 0.69 95118 weighted avg 0.83 0.77 0.73 95118 Test results: precision recall f1-score support 0 0.73 1.00 0.84 15024 1 1.00 0.36 0.53 8756 accuracy 0.76 23780 macro avg 0.86 0.68 0.69 23780 weighted avg 0.83 0.76 0.73 23780 Max tree depth: 10 Train results: precision recall f1-score support 0 0.75 0.98 0.85 59721 1 0.94 0.45 0.61 35397 accuracy 0.78 95118 macro avg 0.85 0.71 0.73 95118 weighted avg 0.82 0.78 0.76 95118 Test results: precision recall f1-score support 0 0.75 0.98 0.85 15024 1 0.94 0.44 0.59 8756 accuracy 0.78 23780 macro avg 0.84 0.71 0.72 23780 weighted avg 0.82 0.78 0.76 23780 Max tree depth: 20 Train results: precision recall f1-score support 0 0.81 0.97 0.88 59721 1 0.92 0.62 0.74 35397 accuracy 0.84 95118 macro avg 0.87 0.79 0.81 95118 weighted avg 0.85 0.84 0.83 95118 Test results: precision recall f1-score support 0 0.80 0.96 0.87 15024 1 0.90 0.58 0.70 8756 accuracy 0.82 23780 macro avg 0.85 0.77 0.79 23780 weighted avg 0.83 0.82 0.81 23780
rf = RandomForestClassifier(max_depth=20)
rf.fit(X_train, y_train)
print('Max tree depth: ', i)
print('Train results: ', classification_report(y_train, rf.predict(X_train)))
print('Test results: ',classification_report(y_test, rf.predict(X_test)))
Max tree depth: 20 Train results: precision recall f1-score support 0 0.80 0.98 0.88 59721 1 0.94 0.60 0.73 35397 accuracy 0.84 95118 macro avg 0.87 0.79 0.81 95118 weighted avg 0.85 0.84 0.83 95118 Test results: precision recall f1-score support 0 0.79 0.97 0.87 15024 1 0.91 0.56 0.69 8756 accuracy 0.82 23780 macro avg 0.85 0.76 0.78 23780 weighted avg 0.83 0.82 0.80 23780