Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
M
machine-learning-services
Manage
Activity
Members
Plan
Wiki
Code
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Locked files
Help
Help
Support
GitLab documentation
Compare GitLab plans
GitLab community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
KI_LAB
machine-learning-services
Commits
c70bb8b9
Commit
c70bb8b9
authored
Jul 6, 2024
by
chris waisi
Browse files
Options
Downloads
Patches
Plain Diff
tagsnachoben
parent
fcae07cc
No related branches found
No related tags found
No related merge requests found
Changes
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
Tourism/Prediction cancellation of hotel bookings/notebook.ipynb
+3
-3
3 additions, 3 deletions
.../Prediction cancellation of hotel bookings/notebook.ipynb
with
3 additions
and
3 deletions
Tourism/Prediction cancellation of hotel bookings/notebook.ipynb
+
3
−
3
View file @
c70bb8b9
...
...
@@ -119,7 +119,7 @@
"tags": []
},
"source": [
"## 2.2.
Read Data
"
"## 2.2.
Daten Auslesen
"
]
},
{
...
...
@@ -893,7 +893,7 @@
"tags": []
},
"source": [
"## 2.3. Dat
a Cleani
ng"
"## 2.3. Dat
en bereingu
ng"
]
},
{
...
...
@@ -1369,7 +1369,7 @@
"tags": []
},
"source": [
"## 2.5. Des
c
riptive Analys
is
"
"## 2.5. Des
k
riptive Analys
e
"
]
},
{
...
...
%% Cell type:markdown id: tags:
# 1. Geschäftsverständnis
%% Cell type:markdown id: tags:
Abschätzung des Kundenverhaltens bezüglich Hotelstornierungen zur Planung von Kapazitäten. Der Anwendungsfall soll testen, ob es möglich ist,
Hotelstornierungen vorherzusagen.Ein wichtiges Ziel für jedes Unternehmen liegt in der Erhaltung wertvoller Kundenbeziehungen,
postive Ranking und Kapazitat plannung. Für Unternehmen ist daher eine Einschätzung der Kundenvehalten bei Buchungstonierung notwendig,
So dass sich das Risiko der Buchungstonierung eines Kunden vorab einschätzen lässt, können Gegenmaßnahmen eingeleitet werden.
%% Cell type:markdown id: tags:
# 2. Daten und Datenverständnis
%% Cell type:markdown id: tags:
Die Daten Rahmen sind fuer Hotel Buchungen und Der Datensatz für diese Demo wurde auf der Kaggle Data Science Plattform als c.s.v file veröffentlicht.
In den Datensätzen sind allerlei Daten zu den Gästen erfasst. Merkmale zu Familie mit Kindern Buchungen über Reisebüros etc. können Aufschluss
darüber geben, ob bei ihnen eine höhere Stornoquote vorliegt. Der Datenrahmen enthält Buchungsinformationen von 2 verschiedenen Hotels und die
Anzahl der Beobachtungen = 119390, Anzahl der Merkmale = 32
Correlation Analysis: stays in weekend nights and stays in week night with 0.5.
%% Cell type:markdown id: tags:
# 2.1. Import von Relevant Module
%% Cell type:code id: tags:
```
python
import
pandas
as
pd
import
numpy
as
np
import
matplotlib.pyplot
as
plt
import
seaborn
as
sns
sns
.
set
()
from
sklearn.model_selection
import
train_test_split
from
sklearn.tree
import
DecisionTreeClassifier
from
sklearn.ensemble
import
RandomForestClassifier
from
sklearn.linear_model
import
LogisticRegression
from
sklearn
import
metrics
from
sklearn.metrics
import
confusion_matrix
,
classification_report
```
%% Cell type:markdown id: tags:
## 2.2.
Read Data
## 2.2.
Daten Auslesen
%% Cell type:code id: tags:
```
python
# my_file = project.get_file()
# my_file.seek(0)
df
=
pd
.
read_csv
(
"
https://storage.googleapis.com/ml-service-repository-datastorage/Prediction_cancellation_of_hotel_bookings_data.csv
"
)
df
.
head
()
```
%% Output
hotel is_canceled lead_time arrival_date_year arrival_date_month \
0 Resort Hotel 0 342 2015 July
1 Resort Hotel 0 737 2015 July
2 Resort Hotel 0 7 2015 July
3 Resort Hotel 0 13 2015 July
4 Resort Hotel 0 14 2015 July
arrival_date_week_number arrival_date_day_of_month \
0 27 1
1 27 1
2 27 1
3 27 1
4 27 1
stays_in_weekend_nights stays_in_week_nights adults ... deposit_type \
0 0 0 2 ... No Deposit
1 0 0 2 ... No Deposit
2 0 1 1 ... No Deposit
3 0 1 1 ... No Deposit
4 0 2 2 ... No Deposit
agent company days_in_waiting_list customer_type adr \
0 NaN NaN 0 Transient 0.0
1 NaN NaN 0 Transient 0.0
2 NaN NaN 0 Transient 75.0
3 304.0 NaN 0 Transient 75.0
4 240.0 NaN 0 Transient 98.0
required_car_parking_spaces total_of_special_requests reservation_status \
0 0 0 Check-Out
1 0 0 Check-Out
2 0 0 Check-Out
3 0 0 Check-Out
4 0 1 Check-Out
reservation_status_date
0 2015-07-01
1 2015-07-01
2 2015-07-02
3 2015-07-02
4 2015-07-03
[5 rows x 32 columns]
%% Cell type:code id: tags:
```
python
def
attribute_description
(
data
):
longestColumnName
=
len
(
max
(
np
.
array
(
data
.
columns
),
key
=
len
))
for
col
in
data
.
columns
:
description
=
''
col_dropna
=
data
[
col
].
dropna
()
example
=
col_dropna
.
sample
(
1
).
values
[
0
]
if
type
(
example
)
==
str
:
description
=
'
str
'
if
len
(
col_dropna
.
unique
())
<
10
:
description
+=
'
[
'
description
+=
'
;
'
.
join
([
f
'"
{
name
}
"'
for
name
in
col_dropna
.
unique
()])
description
+=
'
]
'
else
:
description
+=
'
[ example:
"'
+
example
+
'"
]
'
else
:
description
=
str
(
type
(
example
))
print
(
col
.
ljust
(
longestColumnName
)
+
f
'
:
{
description
}
'
)
```
%% Cell type:code id: tags:
```
python
attribute_description
(
df
)
```
%% Output
hotel : str ["Resort Hotel"; "City Hotel"]
is_canceled : <class 'numpy.int64'>
lead_time : <class 'numpy.int64'>
arrival_date_year : <class 'numpy.int64'>
arrival_date_month : str [ example: "May" ]
arrival_date_week_number : <class 'numpy.int64'>
arrival_date_day_of_month : <class 'numpy.int64'>
stays_in_weekend_nights : <class 'numpy.int64'>
stays_in_week_nights : <class 'numpy.int64'>
adults : <class 'numpy.int64'>
children : <class 'numpy.float64'>
babies : <class 'numpy.int64'>
meal : str ["BB"; "FB"; "HB"; "SC"; "Undefined"]
country : str [ example: "SWE" ]
market_segment : str ["Direct"; "Corporate"; "Online TA"; "Offline TA/TO"; "Complementary"; "Groups"; "Undefined"; "Aviation"]
distribution_channel : str ["Direct"; "Corporate"; "TA/TO"; "Undefined"; "GDS"]
is_repeated_guest : <class 'numpy.int64'>
previous_cancellations : <class 'numpy.int64'>
previous_bookings_not_canceled: <class 'numpy.int64'>
reserved_room_type : str [ example: "A" ]
assigned_room_type : str [ example: "C" ]
booking_changes : <class 'numpy.int64'>
deposit_type : str ["No Deposit"; "Refundable"; "Non Refund"]
agent : <class 'numpy.float64'>
company : <class 'numpy.float64'>
days_in_waiting_list : <class 'numpy.int64'>
customer_type : str ["Transient"; "Contract"; "Transient-Party"; "Group"]
adr : <class 'numpy.float64'>
required_car_parking_spaces : <class 'numpy.int64'>
total_of_special_requests : <class 'numpy.int64'>
reservation_status : str ["Check-Out"; "Canceled"; "No-Show"]
reservation_status_date : str [ example: "2017-02-27" ]
%% Cell type:code id: tags:
```
python
df
.
describe
(
include
=
'
all
'
)
```
%% Output
hotel is_canceled lead_time arrival_date_year \
count 119390 119390.000000 119390.000000 119390.000000
unique 2 NaN NaN NaN
top City Hotel NaN NaN NaN
freq 79330 NaN NaN NaN
mean NaN 0.370416 104.011416 2016.156554
std NaN 0.482918 106.863097 0.707476
min NaN 0.000000 0.000000 2015.000000
25% NaN 0.000000 18.000000 2016.000000
50% NaN 0.000000 69.000000 2016.000000
75% NaN 1.000000 160.000000 2017.000000
max NaN 1.000000 737.000000 2017.000000
arrival_date_month arrival_date_week_number \
count 119390 119390.000000
unique 12 NaN
top August NaN
freq 13877 NaN
mean NaN 27.165173
std NaN 13.605138
min NaN 1.000000
25% NaN 16.000000
50% NaN 28.000000
75% NaN 38.000000
max NaN 53.000000
arrival_date_day_of_month stays_in_weekend_nights \
count 119390.000000 119390.000000
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 15.798241 0.927599
std 8.780829 0.998613
min 1.000000 0.000000
25% 8.000000 0.000000
50% 16.000000 1.000000
75% 23.000000 2.000000
max 31.000000 19.000000
stays_in_week_nights adults ... deposit_type agent \
count 119390.000000 119390.000000 ... 119390 103050.000000
unique NaN NaN ... 3 NaN
top NaN NaN ... No Deposit NaN
freq NaN NaN ... 104641 NaN
mean 2.500302 1.856403 ... NaN 86.693382
std 1.908286 0.579261 ... NaN 110.774548
min 0.000000 0.000000 ... NaN 1.000000
25% 1.000000 2.000000 ... NaN 9.000000
50% 2.000000 2.000000 ... NaN 14.000000
75% 3.000000 2.000000 ... NaN 229.000000
max 50.000000 55.000000 ... NaN 535.000000
company days_in_waiting_list customer_type adr \
count 6797.000000 119390.000000 119390 119390.000000
unique NaN NaN 4 NaN
top NaN NaN Transient NaN
freq NaN NaN 89613 NaN
mean 189.266735 2.321149 NaN 101.831122
std 131.655015 17.594721 NaN 50.535790
min 6.000000 0.000000 NaN -6.380000
25% 62.000000 0.000000 NaN 69.290000
50% 179.000000 0.000000 NaN 94.575000
75% 270.000000 0.000000 NaN 126.000000
max 543.000000 391.000000 NaN 5400.000000
required_car_parking_spaces total_of_special_requests \
count 119390.000000 119390.000000
unique NaN NaN
top NaN NaN
freq NaN NaN
mean 0.062518 0.571363
std 0.245291 0.792798
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 0.000000 1.000000
max 8.000000 5.000000
reservation_status reservation_status_date
count 119390 119390
unique 3 926
top Check-Out 2015-10-21
freq 75166 1461
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
[11 rows x 32 columns]
%% Cell type:markdown id: tags:
## 2.3. Dat
a Cleani
ng
## 2.3. Dat
en bereingu
ng
%% Cell type:code id: tags:
```
python
f
,
ax
=
plt
.
subplots
(
figsize
=
(
18
,
18
))
sns
.
heatmap
(
df
.
corr
(),
annot
=
True
,
linewidths
=
0.5
,
fmt
=
"
.1f
"
,
ax
=
ax
)
plt
.
xticks
(
rotation
=
90
)
plt
.
yticks
(
rotation
=
0
)
plt
.
title
(
'
Correlation Map
'
)
plt
.
show
()
```
%% Output
%% Cell type:code id: tags:
```
python
df
.
isnull
().
sum
()
```
%% Output
hotel 0
is_canceled 0
lead_time 0
arrival_date_year 0
arrival_date_month 0
arrival_date_week_number 0
arrival_date_day_of_month 0
stays_in_weekend_nights 0
stays_in_week_nights 0
adults 0
children 4
babies 0
meal 0
country 488
market_segment 0
distribution_channel 0
is_repeated_guest 0
previous_cancellations 0
previous_bookings_not_canceled 0
reserved_room_type 0
assigned_room_type 0
booking_changes 0
deposit_type 0
agent 16340
company 112593
days_in_waiting_list 0
customer_type 0
adr 0
required_car_parking_spaces 0
total_of_special_requests 0
reservation_status 0
reservation_status_date 0
dtype: int64
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
reservation_status
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
stays_in_weekend_nights
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
reservation_status_date
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
arrival_date_day_of_month
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
arrival_date_year
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
arrival_date_month
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
arrival_date_week_number
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
required_car_parking_spaces
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
previous_bookings_not_canceled
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
total_of_special_requests
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
agent
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
company
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
drop
([
'
adr
'
],
axis
=
1
)
```
%% Cell type:code id: tags:
```
python
df
=
df
.
dropna
(
axis
=
0
)
```
%% Cell type:markdown id: tags:
## 2.4. Test for Multicollinearity
%% Cell type:code id: tags:
```
python
from
statsmodels.stats.outliers_influence
import
variance_inflation_factor
variables
=
df
[[
'
lead_time
'
,
'
is_repeated_guest
'
,
'
adults
'
,
'
booking_changes
'
,
'
previous_cancellations
'
,
'
is_canceled
'
,
'
stays_in_week_nights
'
,
'
babies
'
,
'
days_in_waiting_list
'
]]
vif
=
pd
.
DataFrame
()
vif
[
'
VIF
'
]
=
[
variance_inflation_factor
(
variables
.
values
,
i
)
for
i
in
range
(
variables
.
shape
[
1
])]
vif
[
'
Features
'
]
=
variables
.
columns
```
%% Cell type:code id: tags:
```
python
vif
```
%% Output
VIF Features
0 2.285568 lead_time
1 1.033605 is_repeated_guest
2 3.354523 adults
3 1.143147 booking_changes
4 1.037159 previous_cancellations
5 1.759416 is_canceled
6 2.680081 stays_in_week_nights
7 1.015332 babies
8 1.049190 days_in_waiting_list
%% Cell type:markdown id: tags:
## 2.5. Des
c
riptive Analys
is
## 2.5. Des
k
riptive Analys
e
%% Cell type:code id: tags:
```
python
df
.
hist
(
figsize
=
(
25
,
25
),
bins
=
50
)
```
%% Output
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37350C08>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37C47A88>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37C572C8>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37C913C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37CC8508>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37D02608>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37D3B708>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37D73808>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37D80408>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37DB85C8>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37E1EB48>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001DE37E54D08>]],
dtype=object)
%% Cell type:markdown id: tags:
# 3. Datenaufbereitung
%% Cell type:code id: tags:
```
python
Zunächst
wird
der
Typ
der
Daten
nach
dem
Einlesen
in
das
Notebook
überprüft
.
Einlesefehler
werden
entsprechend
korrigiert
.
Dimensionalitäts
reduktion
:
entfernte
Attribute
ohne
Beschreibung
.
Fehlende
Daten
:
Zeilen
mit
fehlenden
Daten
werden
entfernt
.
Datenkonvertierung
:
Dummy
-
Variablen
werden
erstellt
.
```
%% Cell type:markdown id: tags:
## 3.1. Recoding of Categorical Variables
%% Cell type:code id: tags:
```
python
df_dummies
=
pd
.
get_dummies
(
df
,
drop_first
=
True
)
# 0-1 encoding for categorical values
df_dummies
.
head
()
```
%% Output
is_canceled lead_time stays_in_week_nights adults children babies \
0 0 342 0 2 0.0 0
1 0 737 0 2 0.0 0
2 0 7 1 1 0.0 0
3 0 13 1 1 0.0 0
4 0 14 2 2 0.0 0
is_repeated_guest previous_cancellations booking_changes \
0 0 0 3
1 0 0 4
2 0 0 0
3 0 0 0
4 0 0 0
days_in_waiting_list ... assigned_room_type_H assigned_room_type_I \
0 0 ... 0 0
1 0 ... 0 0
2 0 ... 0 0
3 0 ... 0 0
4 0 ... 0 0
assigned_room_type_K assigned_room_type_L assigned_room_type_P \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
deposit_type_Non Refund deposit_type_Refundable customer_type_Group \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
customer_type_Transient customer_type_Transient-Party
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
[5 rows x 226 columns]
%% Cell type:code id: tags:
```
python
#df_dummies.to_csv('train_dummies.csv', index = False)
```
%% Cell type:code id: tags:
```
python
df_dummies
.
axes
[
0
]
```
%% Output
Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7,
8, 9,
...
119380, 119381, 119382, 119383, 119384, 119385, 119386, 119387,
119388, 119389],
dtype='int64', length=118898)
%% Cell type:code id: tags:active_ipynb
```
python
df_dummies
.
axes
[
1
]
```
%% Output
Index(['is_canceled', 'lead_time', 'stays_in_week_nights', 'adults',
'children', 'babies', 'is_repeated_guest', 'previous_cancellations',
'booking_changes', 'days_in_waiting_list',
...
'assigned_room_type_H', 'assigned_room_type_I', 'assigned_room_type_K',
'assigned_room_type_L', 'assigned_room_type_P',
'deposit_type_Non Refund', 'deposit_type_Refundable',
'customer_type_Group', 'customer_type_Transient',
'customer_type_Transient-Party'],
dtype='object', length=226)
%% Cell type:markdown id: tags:
# 4. Modellierung und Auswertung
%% Cell type:code id: tags:
```
python
Der
Datensatz
wird
mit
seinen
Dummy
-
Variablen
hochgeladen
und
in
einen
Trainings
-
und
einen
Testsatz
aufgeteilt
.
Dann
wird
der
Trainings
-
und
Testprozess
mit
3
verschiedenen
Algorithmen
durchgeführt
und
ausgewertet
-
Logistische
Regression
,
Entscheidungsbaum
,
Random
Forest
.
Fur
Bewertung
,
Hyperparameter
:
Output
:
überwachtes
Lernen
,
Klassifikation
Datenaufteilung
:
80
%
Trainingsdaten
,
20
%
Testdaten
.
Auswertungsmetriken
DecisionTree
:
Genauigkeit
=
0.82
,
Rückruf
=
0.74
,
Präzision
=
0.76
.
Auswertungsmetriken
Logistische
Regression
:
Genauigkeit
=
0.78
,
Rückruf
=
0.55
,
Präzision
=
0.78
.
Auswertungsmetriken
Random
Forest
:
Genauigkeit
=
0.82
,
Rückruf
=
0.57
,
Genauigkeit
=
0
,
90.
```
%% Cell type:markdown id: tags:
## 4.1. Test and Train Data
%% Cell type:code id: tags:
```
python
target
=
df_dummies
[
'
is_canceled
'
]
# feature to be predicted
predictors
=
df_dummies
.
drop
([
'
is_canceled
'
],
axis
=
1
)
# all other features are used as predictors
```
%% Cell type:code id: tags:
```
python
predictors
.
head
()
```
%% Output
lead_time stays_in_week_nights adults children babies \
0 342 0 2 0.0 0
1 737 0 2 0.0 0
2 7 1 1 0.0 0
3 13 1 1 0.0 0
4 14 2 2 0.0 0
is_repeated_guest previous_cancellations booking_changes \
0 0 0 3
1 0 0 4
2 0 0 0
3 0 0 0
4 0 0 0
days_in_waiting_list hotel_Resort Hotel ... assigned_room_type_H \
0 0 1 ... 0
1 0 1 ... 0
2 0 1 ... 0
3 0 1 ... 0
4 0 1 ... 0
assigned_room_type_I assigned_room_type_K assigned_room_type_L \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
assigned_room_type_P deposit_type_Non Refund deposit_type_Refundable \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
customer_type_Group customer_type_Transient customer_type_Transient-Party
0 0 1 0
1 0 1 0
2 0 1 0
3 0 1 0
4 0 1 0
[5 rows x 225 columns]
%% Cell type:code id: tags:active_ipynb
```
python
X_train
,
X_test
,
y_train
,
y_test
=
train_test_split
(
predictors
,
target
,
test_size
=
0.2
,
random_state
=
123
)
```
%% Cell type:markdown id: tags:
## 4.2. DecisionTree
%% Cell type:code id: tags:active_ipynb
```
python
tree
=
DecisionTreeClassifier
()
tree
.
fit
(
X_train
,
y_train
)
```
%% Output
DecisionTreeClassifier()
%% Cell type:code id: tags:
```
python
tn
,
fp
,
fn
,
tp
=
confusion_matrix
(
y_test
,
tree
.
predict
(
X_test
)).
ravel
()
print
(
tn
,
fp
,
fn
,
tp
)
```
%% Output
12983 2041 2295 6461
%% Cell type:code id: tags:active_ipynb
```
python
print
(
classification_report
(
y_train
,
tree
.
predict
(
X_train
)))
```
%% Output
precision recall f1-score support
0 0.97 0.99 0.98 59721
1 0.99 0.94 0.97 35397
accuracy 0.98 95118
macro avg 0.98 0.97 0.97 95118
weighted avg 0.98 0.98 0.98 95118
%% Cell type:code id: tags:active_ipynb
```
python
print
(
classification_report
(
y_test
,
tree
.
predict
(
X_test
)))
```
%% Output
precision recall f1-score support
0 0.85 0.86 0.86 15024
1 0.76 0.74 0.75 8756
accuracy 0.82 23780
macro avg 0.80 0.80 0.80 23780
weighted avg 0.82 0.82 0.82 23780
%% Cell type:markdown id: tags:
## 4.3. Logistic Regression
%% Cell type:code id: tags:active_ipynb
```
python
logreg
=
LogisticRegression
()
logreg
.
fit
(
X_train
,
y_train
)
```
%% Output
C:\Users\alexa\Anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:765: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
LogisticRegression()
%% Cell type:code id: tags:active_ipynb
```
python
print
(
confusion_matrix
(
y_test
,
logreg
.
predict
(
X_test
)))
```
%% Output
[[13634 1390]
[ 3870 4886]]
%% Cell type:code id: tags:active_ipynb
```
python
conf_mat
=
confusion_matrix
(
y_test
,
logreg
.
predict
(
X_test
))
df_cm
=
pd
.
DataFrame
(
conf_mat
,
index
=
[
'
0
'
,
'
1
'
],
columns
=
[
'
0
'
,
'
1
'
],)
fig
=
plt
.
figure
(
figsize
=
[
10
,
7
])
heatmap
=
sns
.
heatmap
(
df_cm
,
annot
=
True
,
fmt
=
"
d
"
)
heatmap
.
yaxis
.
set_ticklabels
(
heatmap
.
yaxis
.
get_ticklabels
(),
rotation
=
0
,
ha
=
'
right
'
,
fontsize
=
14
)
heatmap
.
xaxis
.
set_ticklabels
(
heatmap
.
xaxis
.
get_ticklabels
(),
rotation
=
45
,
ha
=
'
right
'
,
fontsize
=
14
)
plt
.
ylabel
(
'
True label
'
)
plt
.
xlabel
(
'
Predicted label
'
)
```
%% Output
Text(0.5, 39.5, 'Predicted label')
%% Cell type:code id: tags:
```
python
print
(
classification_report
(
y_test
,
logreg
.
predict
(
X_test
)))
```
%% Output
precision recall f1-score support
0 0.78 0.91 0.84 15024
1 0.78 0.56 0.65 8756
accuracy 0.78 23780
macro avg 0.78 0.73 0.74 23780
weighted avg 0.78 0.78 0.77 23780
%% Cell type:code id: tags:
```
python
print
(
classification_report
(
y_train
,
logreg
.
predict
(
X_train
)))
```
%% Output
precision recall f1-score support
0 0.78 0.91 0.84 59721
1 0.78 0.56 0.66 35397
accuracy 0.78 95118
macro avg 0.78 0.74 0.75 95118
weighted avg 0.78 0.78 0.77 95118
%% Cell type:markdown id: tags:
## 4.4. Random Forest
%% Cell type:code id: tags:active_ipynb
```
python
tree_depth
=
[
5
,
10
,
20
]
for
i
in
tree_depth
:
rf
=
RandomForestClassifier
(
max_depth
=
i
)
rf
.
fit
(
X_train
,
y_train
)
print
(
'
Max tree depth:
'
,
i
)
print
(
'
Train results:
'
,
classification_report
(
y_train
,
rf
.
predict
(
X_train
)))
print
(
'
Test results:
'
,
classification_report
(
y_test
,
rf
.
predict
(
X_test
)))
```
%% Output
Max tree depth: 5
Train results: precision recall f1-score support
0 0.73 1.00 0.84 59721
1 1.00 0.37 0.54 35397
accuracy 0.77 95118
macro avg 0.86 0.69 0.69 95118
weighted avg 0.83 0.77 0.73 95118
Test results: precision recall f1-score support
0 0.73 1.00 0.84 15024
1 1.00 0.36 0.53 8756
accuracy 0.76 23780
macro avg 0.86 0.68 0.69 23780
weighted avg 0.83 0.76 0.73 23780
Max tree depth: 10
Train results: precision recall f1-score support
0 0.75 0.98 0.85 59721
1 0.94 0.45 0.61 35397
accuracy 0.78 95118
macro avg 0.85 0.71 0.73 95118
weighted avg 0.82 0.78 0.76 95118
Test results: precision recall f1-score support
0 0.75 0.98 0.85 15024
1 0.94 0.44 0.59 8756
accuracy 0.78 23780
macro avg 0.84 0.71 0.72 23780
weighted avg 0.82 0.78 0.76 23780
Max tree depth: 20
Train results: precision recall f1-score support
0 0.81 0.97 0.88 59721
1 0.92 0.62 0.74 35397
accuracy 0.84 95118
macro avg 0.87 0.79 0.81 95118
weighted avg 0.85 0.84 0.83 95118
Test results: precision recall f1-score support
0 0.80 0.96 0.87 15024
1 0.90 0.58 0.70 8756
accuracy 0.82 23780
macro avg 0.85 0.77 0.79 23780
weighted avg 0.83 0.82 0.81 23780
%% Cell type:code id: tags:active_ipynb
```
python
rf
=
RandomForestClassifier
(
max_depth
=
20
)
rf
.
fit
(
X_train
,
y_train
)
print
(
'
Max tree depth:
'
,
i
)
print
(
'
Train results:
'
,
classification_report
(
y_train
,
rf
.
predict
(
X_train
)))
print
(
'
Test results:
'
,
classification_report
(
y_test
,
rf
.
predict
(
X_test
)))
```
%% Output
Max tree depth: 20
Train results: precision recall f1-score support
0 0.80 0.98 0.88 59721
1 0.94 0.60 0.73 35397
accuracy 0.84 95118
macro avg 0.87 0.79 0.81 95118
weighted avg 0.85 0.84 0.83 95118
Test results: precision recall f1-score support
0 0.79 0.97 0.87 15024
1 0.91 0.56 0.69 8756
accuracy 0.82 23780
macro avg 0.85 0.76 0.78 23780
weighted avg 0.83 0.82 0.80 23780
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment