Skip to content
Snippets Groups Projects
Commit c70bb8b9 authored by chris waisi's avatar chris waisi
Browse files

tagsnachoben

parent fcae07cc
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# 1. Geschäftsverständnis
%% Cell type:markdown id: tags:
Abschätzung des Kundenverhaltens bezüglich Hotelstornierungen zur Planung von Kapazitäten. Der Anwendungsfall soll testen, ob es möglich ist,
Hotelstornierungen vorherzusagen.Ein wichtiges Ziel für jedes Unternehmen liegt in der Erhaltung wertvoller Kundenbeziehungen,
postive Ranking und Kapazitat plannung. Für Unternehmen ist daher eine Einschätzung der Kundenvehalten bei Buchungstonierung notwendig,
So dass sich das Risiko der Buchungstonierung eines Kunden vorab einschätzen lässt, können Gegenmaßnahmen eingeleitet werden.
%% Cell type:markdown id: tags:
# 2. Daten und Datenverständnis
%% Cell type:markdown id: tags:
Die Daten Rahmen sind fuer Hotel Buchungen und Der Datensatz für diese Demo wurde auf der Kaggle Data Science Plattform als c.s.v file veröffentlicht.
In den Datensätzen sind allerlei Daten zu den Gästen erfasst. Merkmale zu Familie mit Kindern Buchungen über Reisebüros etc. können Aufschluss
darüber geben, ob bei ihnen eine höhere Stornoquote vorliegt. Der Datenrahmen enthält Buchungsinformationen von 2 verschiedenen Hotels und die
Anzahl der Beobachtungen = 119390, Anzahl der Merkmale = 32
Correlation Analysis: stays in weekend nights and stays in week night with 0.5.
%% Cell type:markdown id: tags:
# 2.1. Import von Relevant Module
%% Cell type:code id: tags:
``` python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import confusion_matrix, classification_report
```
%% Cell type:markdown id: tags:
## 2.2. Read Data
## 2.2. Daten Auslesen
%% Cell type:code id: tags:
``` python
# my_file = project.get_file()
# my_file.seek(0)
df = pd.read_csv("https://storage.googleapis.com/ml-service-repository-datastorage/Prediction_cancellation_of_hotel_bookings_data.csv")
df.head()
```
%% Output
hotel is_canceled lead_time arrival_date_year arrival_date_month \
0 Resort Hotel 0 342 2015 July
1 Resort Hotel 0 737 2015 July
2 Resort Hotel 0 7 2015 July
3 Resort Hotel 0 13 2015 July
4 Resort Hotel 0 14 2015 July
arrival_date_week_number arrival_date_day_of_month \
0 27 1
1 27 1
2 27 1
3 27 1
4 27 1
stays_in_weekend_nights stays_in_week_nights adults ... deposit_type \
0 0 0 2 ... No Deposit
1 0 0 2 ... No Deposit
2 0 1 1 ... No Deposit
3 0 1 1 ... No Deposit
4 0 2 2 ... No Deposit
agent company days_in_waiting_list customer_type adr \
0 NaN NaN 0 Transient 0.0
1 NaN NaN 0 Transient 0.0
2 NaN NaN 0 Transient 75.0
3 304.0 NaN 0 Transient 75.0
4 240.0 NaN 0 Transient 98.0
required_car_parking_spaces total_of_special_requests reservation_status \
0 0 0 Check-Out
1 0 0 Check-Out
2 0 0 Check-Out
3 0 0 Check-Out
4 0 1 Check-Out
reservation_status_date
0 2015-07-01
1 2015-07-01
2 2015-07-02
3 2015-07-02
4 2015-07-03
[5 rows x 32 columns]
%% Cell type:code id: tags:
``` python
def attribute_description(data):
longestColumnName = len(max(np.array(data.columns), key=len))
for col in data.columns:
description = ''
col_dropna = data[col].dropna()
example = col_dropna.sample(1).values[0]
if type(example) == str:
description = 'str '
if len(col_dropna.unique()) < 10:
description += '['
description += '; '.join([ f'"{name}"' for name in col_dropna.unique()])
description += ']'
else:
description += '[ example: "'+ example + '" ]'
else:
description = str(type(example))
print(col.ljust(longestColumnName)+ f': {description}')
```
%% Cell type:code id: tags:
``` python
attribute_description(df)
```
%% Output
hotel : str ["Resort Hotel"; "City Hotel"]
is_canceled : <class 'numpy.int64'>
lead_time : <class 'numpy.int64'>
arrival_date_year : <class 'numpy.int64'>
arrival_date_month : str [ example: "May" ]
arrival_date_week_number : <class 'numpy.int64'>
arrival_date_day_of_month : <class 'numpy.int64'>
stays_in_weekend_nights : <class 'numpy.int64'>
stays_in_week_nights : <class 'numpy.int64'>
adults : <class 'numpy.int64'>
children : <class 'numpy.float64'>
babies : <class 'numpy.int64'>
meal : str ["BB"; "FB"; "HB"; "SC"; "Undefined"]
country : str [ example: "SWE" ]
market_segment : str ["Direct"; "Corporate"; "Online TA"; "Offline TA/TO"; "Complementary"; "Groups"; "Undefined"; "Aviation"]
distribution_channel : str ["Direct"; "Corporate"; "TA/TO"; "Undefined"; "GDS"]
is_repeated_guest : <class 'numpy.int64'>
previous_cancellations : <class 'numpy.int64'>
previous_bookings_not_canceled: <class 'numpy.int64'>
reserved_room_type : str [ example: "A" ]
assigned_room_type : str [ example: "C" ]
booking_changes : <class 'numpy.int64'>
deposit_type : str ["No Deposit"; "Refundable"; "Non Refund"]
agent : <class 'numpy.float64'>
company : <class 'numpy.float64'>
days_in_waiting_list : <class 'numpy.int64'>
customer_type : str ["Transient"; "Contract"; "Transient-Party"; "Group"]
adr : <class 'numpy.float64'>
required_car_parking_spaces : <class 'numpy.int64'>
total_of_special_requests : <class 'numpy.int64'>
reservation_status : str ["Check-Out"; "Canceled"; "No-Show"]
reservation_status_date : str [ example: "2017-02-27" ]
%% Cell type:code id: tags:
``` python
df.describe(include='all')
```
%% Output
hotel is_canceled lead_time arrival_date_year \
count 119390 119390.000000 119390.000000 119390.000000
unique 2 NaN NaN NaN
top City Hotel NaN NaN NaN
freq 79330 NaN NaN NaN
mean NaN 0.370416 104.011416 2016.156554
std NaN 0.482918 106.863097 0.707476
min NaN 0.000000 0.000000 2015.000000
25% NaN 0.000000 18.000000 2016.000000
50% NaN 0.000000 69.000000 2016.000000
75% NaN 1.000000 160.000000 2017.000000
max NaN 1.000000 737.000000 2017.000000
arrival_date_month arrival_date_week_number \
count 119390 119390.000000
unique 12 NaN
top August NaN
freq 13877 NaN
mean NaN 27.165173
std NaN 13.605138
min NaN 1.000000
25% NaN 16.000000
50% NaN 28.000000
75% NaN 38.000000
max NaN 53.000000
arrival_date_day_of_month stays_in_weekend_nights \
count 119390.000000 119390.000000
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 15.798241 0.927599
std 8.780829 0.998613
min 1.000000 0.000000
25% 8.000000 0.000000
50% 16.000000 1.000000
75% 23.000000 2.000000
max 31.000000 19.000000
stays_in_week_nights adults ... deposit_type agent \
count 119390.000000 119390.000000 ... 119390 103050.000000
unique NaN NaN ... 3 NaN
top NaN NaN ... No Deposit NaN
freq NaN NaN ... 104641 NaN
mean 2.500302 1.856403 ... NaN 86.693382
std 1.908286 0.579261 ... NaN 110.774548
min 0.000000 0.000000 ... NaN 1.000000
25% 1.000000 2.000000 ... NaN 9.000000
50% 2.000000 2.000000 ... NaN 14.000000
75% 3.000000 2.000000 ... NaN 229.000000
max 50.000000 55.000000 ... NaN 535.000000
company days_in_waiting_list customer_type adr \
count 6797.000000 119390.000000 119390 119390.000000
unique NaN NaN 4 NaN
top NaN NaN Transient NaN
freq NaN NaN 89613 NaN
mean 189.266735 2.321149 NaN 101.831122
std 131.655015 17.594721 NaN 50.535790
min 6.000000 0.000000 NaN -6.380000
25% 62.000000 0.000000 NaN 69.290000
50% 179.000000 0.000000 NaN 94.575000
75% 270.000000 0.000000 NaN 126.000000
max 543.000000 391.000000 NaN 5400.000000
required_car_parking_spaces total_of_special_requests \
count 119390.000000 119390.000000
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 0.062518 0.571363
std 0.245291 0.792798
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 1.000000
max 8.000000 5.000000
reservation_status reservation_status_date
count 119390 119390
unique 3 926
top Check-Out 2015-10-21
freq 75166 1461
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
[11 rows x 32 columns]
%% Cell type:markdown id: tags:
## 2.3. Data Cleaning
## 2.3. Daten bereingung
%% Cell type:code id: tags:
``` python
f,ax=plt.subplots(figsize = (18,18))
sns.heatmap(df.corr(),annot= True,linewidths=0.5,fmt = ".1f",ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Map')
plt.show()
```
%% Output
%% Cell type:code id: tags:
``` python
df.isnull().sum()
```
%% Output
hotel 0
is_canceled 0
lead_time 0
arrival_date_year 0
arrival_date_month 0
arrival_date_week_number 0
arrival_date_day_of_month 0
stays_in_weekend_nights 0
stays_in_week_nights 0
adults 0
children 4
babies 0
meal 0
country 488
market_segment 0
distribution_channel 0
is_repeated_guest 0
previous_cancellations 0
previous_bookings_not_canceled 0
reserved_room_type 0
assigned_room_type 0
booking_changes 0
deposit_type 0
agent 16340
company 112593
days_in_waiting_list 0
customer_type 0
adr 0
required_car_parking_spaces 0
total_of_special_requests 0
reservation_status 0
reservation_status_date 0
dtype: int64
%% Cell type:code id: tags:
``` python
df = df.drop(['reservation_status'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.drop(['stays_in_weekend_nights'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.drop(['reservation_status_date'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.drop(['arrival_date_day_of_month'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.drop(['arrival_date_year'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.drop(['arrival_date_month'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.drop(['arrival_date_week_number'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.drop(['required_car_parking_spaces'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.drop(['previous_bookings_not_canceled'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.drop(['total_of_special_requests'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.drop(['agent'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.drop(['company'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.drop(['adr'], axis=1)
```
%% Cell type:code id: tags:
``` python
df = df.dropna(axis=0)
```
%% Cell type:markdown id: tags:
## 2.4. Test for Multicollinearity
%% Cell type:code id: tags:
``` python
from statsmodels.stats.outliers_influence import variance_inflation_factor
variables = df[['lead_time', 'is_repeated_guest', 'adults', 'booking_changes', 'previous_cancellations', 'is_canceled', 'stays_in_week_nights', 'babies', 'days_in_waiting_list']]
vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(variables.values, i) for i in range(variables.shape[1])]
vif['Features'] = variables.columns
```
%% Cell type:code id: tags:
``` python
vif
```
%% Output
VIF Features
0 2.285568 lead_time
1 1.033605 is_repeated_guest
2 3.354523 adults
3 1.143147 booking_changes
4 1.037159 previous_cancellations
5 1.759416 is_canceled
6 2.680081 stays_in_week_nights
7 1.015332 babies
8 1.049190 days_in_waiting_list
%% Cell type:markdown id: tags:
## 2.5. Descriptive Analysis
## 2.5. Deskriptive Analyse
%% Cell type:code id: tags:
``` python
df.hist(figsize=(25,25), bins=50)
```
%% Output
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37350C08>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37C47A88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37C572C8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37C913C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37CC8508>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37D02608>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37D3B708>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37D73808>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37D80408>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37DB85C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37E1EB48>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37E54D08>]],
dtype=object)
%% Cell type:markdown id: tags:
# 3. Datenaufbereitung
%% Cell type:code id: tags:
``` python
Zunächst wird der Typ der Daten nach dem Einlesen in das Notebook überprüft. Einlesefehler werden entsprechend korrigiert.
Dimensionalitäts reduktion: entfernte Attribute ohne Beschreibung. Fehlende Daten: Zeilen mit fehlenden Daten werden entfernt.
Datenkonvertierung: Dummy-Variablen werden erstellt.
```
%% Cell type:markdown id: tags:
## 3.1. Recoding of Categorical Variables
%% Cell type:code id: tags:
``` python
df_dummies = pd.get_dummies(df, drop_first=True) # 0-1 encoding for categorical values
df_dummies.head()
```
%% Output
is_canceled lead_time stays_in_week_nights adults children babies \
0 0 342 0 2 0.0 0
1 0 737 0 2 0.0 0
2 0 7 1 1 0.0 0
3 0 13 1 1 0.0 0
4 0 14 2 2 0.0 0
is_repeated_guest previous_cancellations booking_changes \
0 0 0 3
1 0 0 4
2 0 0 0
3 0 0 0
4 0 0 0
days_in_waiting_list ... assigned_room_type_H assigned_room_type_I \
0 0 ... 0 0
1 0 ... 0 0
2 0 ... 0 0
3 0 ... 0 0
4 0 ... 0 0
assigned_room_type_K assigned_room_type_L assigned_room_type_P \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
deposit_type_Non Refund deposit_type_Refundable customer_type_Group \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
customer_type_Transient customer_type_Transient-Party
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
[5 rows x 226 columns]
%% Cell type:code id: tags:
``` python
#df_dummies.to_csv('train_dummies.csv', index = False)
```
%% Cell type:code id: tags:
``` python
df_dummies.axes[0]
```
%% Output
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7,
8, 9,
...
119380, 119381, 119382, 119383, 119384, 119385, 119386, 119387,
119388, 119389],
dtype='int64', length=118898)
%% Cell type:code id: tags:active_ipynb
``` python
df_dummies.axes[1]
```
%% Output
Index(['is_canceled', 'lead_time', 'stays_in_week_nights', 'adults',
'children', 'babies', 'is_repeated_guest', 'previous_cancellations',
'booking_changes', 'days_in_waiting_list',
...
'assigned_room_type_H', 'assigned_room_type_I', 'assigned_room_type_K',
'assigned_room_type_L', 'assigned_room_type_P',
'deposit_type_Non Refund', 'deposit_type_Refundable',
'customer_type_Group', 'customer_type_Transient',
'customer_type_Transient-Party'],
dtype='object', length=226)
%% Cell type:markdown id: tags:
# 4. Modellierung und Auswertung
%% Cell type:code id: tags:
``` python
Der Datensatz wird mit seinen Dummy-Variablen hochgeladen und in einen Trainings- und einen Testsatz aufgeteilt.
Dann wird der Trainings- und Testprozess mit 3 verschiedenen Algorithmen durchgeführt und ausgewertet - Logistische Regression, Entscheidungsbaum,
Random Forest.
Fur Bewertung, Hyperparameter:
Output: überwachtes Lernen, Klassifikation
Datenaufteilung: 80% Trainingsdaten, 20% Testdaten.
Auswertungsmetriken DecisionTree: Genauigkeit= 0.82, Rückruf= 0.74, Präzision=0.76.
Auswertungsmetriken Logistische Regression: Genauigkeit= 0.78, Rückruf= 0.55,
Präzision= 0.78.
Auswertungsmetriken Random Forest: Genauigkeit= 0.82, Rückruf= 0.57, Genauigkeit= 0,90.
```
%% Cell type:markdown id: tags:
## 4.1. Test and Train Data
%% Cell type:code id: tags:
``` python
target = df_dummies['is_canceled'] # feature to be predicted
predictors = df_dummies.drop(['is_canceled'], axis = 1) # all other features are used as predictors
```
%% Cell type:code id: tags:
``` python
predictors.head()
```
%% Output
lead_time stays_in_week_nights adults children babies \
0 342 0 2 0.0 0
1 737 0 2 0.0 0
2 7 1 1 0.0 0
3 13 1 1 0.0 0
4 14 2 2 0.0 0
is_repeated_guest previous_cancellations booking_changes \
0 0 0 3
1 0 0 4
2 0 0 0
3 0 0 0
4 0 0 0
days_in_waiting_list hotel_Resort Hotel ... assigned_room_type_H \
0 0 1 ... 0
1 0 1 ... 0
2 0 1 ... 0
3 0 1 ... 0
4 0 1 ... 0
assigned_room_type_I assigned_room_type_K assigned_room_type_L \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
assigned_room_type_P deposit_type_Non Refund deposit_type_Refundable \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
customer_type_Group customer_type_Transient customer_type_Transient-Party
0 0 1 0
1 0 1 0
2 0 1 0
3 0 1 0
4 0 1 0
[5 rows x 225 columns]
%% Cell type:code id: tags:active_ipynb
``` python
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, random_state=123)
```
%% Cell type:markdown id: tags:
## 4.2. DecisionTree
%% Cell type:code id: tags:active_ipynb
``` python
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)
```
%% Output
DecisionTreeClassifier()
%% Cell type:code id: tags:
``` python
tn, fp, fn, tp = confusion_matrix(y_test, tree.predict(X_test)).ravel()
print(tn, fp, fn, tp)
```
%% Output
12983 2041 2295 6461
%% Cell type:code id: tags:active_ipynb
``` python
print(classification_report(y_train, tree.predict(X_train)))
```
%% Output
precision recall f1-score support
0 0.97 0.99 0.98 59721
1 0.99 0.94 0.97 35397
accuracy 0.98 95118
macro avg 0.98 0.97 0.97 95118
weighted avg 0.98 0.98 0.98 95118
%% Cell type:code id: tags:active_ipynb
``` python
print(classification_report(y_test, tree.predict(X_test)))
```
%% Output
precision recall f1-score support
0 0.85 0.86 0.86 15024
1 0.76 0.74 0.75 8756
accuracy 0.82 23780
macro avg 0.80 0.80 0.80 23780
weighted avg 0.82 0.82 0.82 23780
%% Cell type:markdown id: tags:
## 4.3. Logistic Regression
%% Cell type:code id: tags:active_ipynb
``` python
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
```
%% Output
C:\Users\alexa\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
LogisticRegression()
%% Cell type:code id: tags:active_ipynb
``` python
print(confusion_matrix(y_test, logreg.predict(X_test)))
```
%% Output
[[13634 1390]
[ 3870 4886]]
%% Cell type:code id: tags:active_ipynb
``` python
conf_mat = confusion_matrix(y_test, logreg.predict(X_test))
df_cm = pd.DataFrame(conf_mat, index=['0','1'], columns=['0', '1'],)
fig = plt.figure(figsize=[10,7])
heatmap = sns.heatmap(df_cm, annot=True, fmt="d")
heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=14)
heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=14)
plt.ylabel('True label')
plt.xlabel('Predicted label')
```
%% Output
Text(0.5, 39.5, 'Predicted label')
%% Cell type:code id: tags:
``` python
print(classification_report(y_test, logreg.predict(X_test)))
```
%% Output
precision recall f1-score support
0 0.78 0.91 0.84 15024
1 0.78 0.56 0.65 8756
accuracy 0.78 23780
macro avg 0.78 0.73 0.74 23780
weighted avg 0.78 0.78 0.77 23780
%% Cell type:code id: tags:
``` python
print(classification_report(y_train, logreg.predict(X_train)))
```
%% Output
precision recall f1-score support
0 0.78 0.91 0.84 59721
1 0.78 0.56 0.66 35397
accuracy 0.78 95118
macro avg 0.78 0.74 0.75 95118
weighted avg 0.78 0.78 0.77 95118
%% Cell type:markdown id: tags:
## 4.4. Random Forest
%% Cell type:code id: tags:active_ipynb
``` python
tree_depth = [5, 10, 20]
for i in tree_depth:
rf = RandomForestClassifier(max_depth=i)
rf.fit(X_train, y_train)
print('Max tree depth: ', i)
print('Train results: ', classification_report(y_train, rf.predict(X_train)))
print('Test results: ',classification_report(y_test, rf.predict(X_test)))
```
%% Output
Max tree depth: 5
Train results: precision recall f1-score support
0 0.73 1.00 0.84 59721
1 1.00 0.37 0.54 35397
accuracy 0.77 95118
macro avg 0.86 0.69 0.69 95118
weighted avg 0.83 0.77 0.73 95118
Test results: precision recall f1-score support
0 0.73 1.00 0.84 15024
1 1.00 0.36 0.53 8756
accuracy 0.76 23780
macro avg 0.86 0.68 0.69 23780
weighted avg 0.83 0.76 0.73 23780
Max tree depth: 10
Train results: precision recall f1-score support
0 0.75 0.98 0.85 59721
1 0.94 0.45 0.61 35397
accuracy 0.78 95118
macro avg 0.85 0.71 0.73 95118
weighted avg 0.82 0.78 0.76 95118
Test results: precision recall f1-score support
0 0.75 0.98 0.85 15024
1 0.94 0.44 0.59 8756
accuracy 0.78 23780
macro avg 0.84 0.71 0.72 23780
weighted avg 0.82 0.78 0.76 23780
Max tree depth: 20
Train results: precision recall f1-score support
0 0.81 0.97 0.88 59721
1 0.92 0.62 0.74 35397
accuracy 0.84 95118
macro avg 0.87 0.79 0.81 95118
weighted avg 0.85 0.84 0.83 95118
Test results: precision recall f1-score support
0 0.80 0.96 0.87 15024
1 0.90 0.58 0.70 8756
accuracy 0.82 23780
macro avg 0.85 0.77 0.79 23780
weighted avg 0.83 0.82 0.81 23780
%% Cell type:code id: tags:active_ipynb
``` python
rf = RandomForestClassifier(max_depth=20)
rf.fit(X_train, y_train)
print('Max tree depth: ', i)
print('Train results: ', classification_report(y_train, rf.predict(X_train)))
print('Test results: ',classification_report(y_test, rf.predict(X_test)))
```
%% Output
Max tree depth: 20
Train results: precision recall f1-score support
0 0.80 0.98 0.88 59721
1 0.94 0.60 0.73 35397
accuracy 0.84 95118
macro avg 0.87 0.79 0.81 95118
weighted avg 0.85 0.84 0.83 95118
Test results: precision recall f1-score support
0 0.79 0.97 0.87 15024
1 0.91 0.56 0.69 8756
accuracy 0.82 23780
macro avg 0.85 0.76 0.78 23780
weighted avg 0.83 0.82 0.80 23780
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment