Analyse der Bewegung und Aktivität von freilaufenden Rindern¶
1. Business Understanding¶
Landwirte wollen analysieren, wie sich ihr Vieh auf der Weide bewegt und verhält. (siehe Readme für weitere Informationen)
import numpy as np # standard for data processing
import pandas as pd # standard for data processing
import plotly.graph_objects as go # creates plots
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.ensemble import RandomForestClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report
Daten lesen¶
Quelle der Daten:
Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra und Jorge L. Reyes-Ortiz. Ein öffentlich zugänglicher Datensatz für die Erkennung menschlicher Aktivitäten mit Smartphones. 21. Europäisches Symposium über Künstliche Neuronale Netze, Computational Intelligence und Maschinelles Lernen, ESANN 2013. Brügge, Belgien 24-26 April 2013. Heruntergeladen von Kaggle: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones
Es gibt zwei getrennte Datensätze aus der Testgruppe und der Trainingsgruppe. Diese werden für die weitere Analyse zunächst noch einmal zusammengefasst. Die Daten werden identifiziert, ob sie aus dem Trainings- oder Testdatensatz stammen, damit sie später wieder aufgeteilt werden können.
df_train = pd.read_csv('https://storage.googleapis.com/ml-service-repository-datastorage/Analysis_of_the_movement_and_activity_of_free-ranging_cattle_test.csv', delimiter=',')
df_test = pd.read_csv('https://storage.googleapis.com/ml-service-repository-datastorage/Analysis_of_the_movement_and_activity_of_free-ranging_cattle_test.csv', delimiter=',')
df_train['type']='train'
df_test['type']='test'
df = pd.concat([df_train, df_test], axis=0).reset_index(drop=True)
df.head()
tBodyAcc-mean()-X | tBodyAcc-mean()-Y | tBodyAcc-mean()-Z | tBodyAcc-std()-X | tBodyAcc-std()-Y | tBodyAcc-std()-Z | tBodyAcc-mad()-X | tBodyAcc-mad()-Y | tBodyAcc-mad()-Z | tBodyAcc-max()-X | ... | angle(tBodyAccMean,gravity) | angle(tBodyAccJerkMean),gravityMean) | angle(tBodyGyroMean,gravityMean) | angle(tBodyGyroJerkMean,gravityMean) | angle(X,gravityMean) | angle(Y,gravityMean) | angle(Z,gravityMean) | subject | Activity | type | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.288585 | -0.020294 | -0.132905 | -0.995279 | -0.983111 | -0.913526 | -0.995112 | -0.983185 | -0.923527 | -0.934724 | ... | -0.112754 | 0.030400 | -0.464761 | -0.018446 | -0.841247 | 0.179941 | -0.058627 | 1 | STANDING | train |
1 | 0.278419 | -0.016411 | -0.123520 | -0.998245 | -0.975300 | -0.960322 | -0.998807 | -0.974914 | -0.957686 | -0.943068 | ... | 0.053477 | -0.007435 | -0.732626 | 0.703511 | -0.844788 | 0.180289 | -0.054317 | 1 | STANDING | train |
2 | 0.279653 | -0.019467 | -0.113462 | -0.995380 | -0.967187 | -0.978944 | -0.996520 | -0.963668 | -0.977469 | -0.938692 | ... | -0.118559 | 0.177899 | 0.100699 | 0.808529 | -0.848933 | 0.180637 | -0.049118 | 1 | STANDING | train |
3 | 0.279174 | -0.026201 | -0.123283 | -0.996091 | -0.983403 | -0.990675 | -0.997099 | -0.982750 | -0.989302 | -0.938692 | ... | -0.036788 | -0.012892 | 0.640011 | -0.485366 | -0.848649 | 0.181935 | -0.047663 | 1 | STANDING | train |
4 | 0.276629 | -0.016570 | -0.115362 | -0.998139 | -0.980817 | -0.990482 | -0.998321 | -0.979672 | -0.990441 | -0.942469 | ... | 0.123320 | 0.122542 | 0.693578 | -0.615971 | -0.847865 | 0.185151 | -0.043892 | 1 | STANDING | train |
5 rows × 564 columns
- Datenexploration und -aufbereitung
print('Total number of observations: ' + str(df.shape[0]))
Total number of observations: 10299
Auf fehlende Werte prüfen¶
df.isna().sum()
tBodyAcc-mean()-X 0 tBodyAcc-mean()-Y 0 tBodyAcc-mean()-Z 0 tBodyAcc-std()-X 0 tBodyAcc-std()-Y 0 .. angle(Y,gravityMean) 0 angle(Z,gravityMean) 0 subject 0 Activity 0 type 0 Length: 564, dtype: int64
df.isna().sum().sum()
0
--> **Keine fehlenden Werte im Data Frame
Auf Duplikate prüfen¶
df[df.duplicated()]
tBodyAcc-mean()-X | tBodyAcc-mean()-Y | tBodyAcc-mean()-Z | tBodyAcc-std()-X | tBodyAcc-std()-Y | tBodyAcc-std()-Z | tBodyAcc-mad()-X | tBodyAcc-mad()-Y | tBodyAcc-mad()-Z | tBodyAcc-max()-X | ... | angle(tBodyAccMean,gravity) | angle(tBodyAccJerkMean),gravityMean) | angle(tBodyGyroMean,gravityMean) | angle(tBodyGyroJerkMean,gravityMean) | angle(X,gravityMean) | angle(Y,gravityMean) | angle(Z,gravityMean) | subject | Activity | type |
---|
0 rows × 564 columns
--> keine Duplikatzeilen in den Daten
Ziel Variable¶
Wir haben ein Klassifizierungsproblem und die Zielspalte ist die Spalte "Aktivität".
#Possible classes/labels
df['Activity'].unique()
array(['STANDING', 'SITTING', 'LAYING', 'WALKING', 'WALKING_DOWNSTAIRS', 'WALKING_UPSTAIRS'], dtype=object)
df['Activity'].value_counts().plot.bar()
<AxesSubplot:>
- Im Smart Farming Kontext dieser Aufgabe würde ein Rind normalerweise nicht auf einer Treppe laufen. Daher werden die Zeilen mit den Bezeichnungen "WALKING_DOWNSTAIRS" und "WALKING_UPSTAIRS" entfernt:
indexNames = df[(df['Activity'] == 'WALKING_DOWNSTAIRS') | (df['Activity'] == 'WALKING_UPSTAIRS')].index
df.drop(indexNames , inplace=True)
df = df.reset_index(drop=True)
df['Activity'].value_counts().plot.bar()
<AxesSubplot:>
- Für jede Aktivität etwa die gleiche Anzahl von Beobachtungen. Wir könnten vor der Modellierung eine Überstichprobe aller Minderheitenklassen nehmen, um einen perfekt ausgewogenen Datensatz zu erhalten.
Wie viele Beobachtungen gibt es von jedem Testobjekt??¶
df['subject'].value_counts().plot.bar()
<AxesSubplot:>
Die obige Abbildung ist interessant. Normalerweise haben alle Versuchspersonen die gleiche Versuchsreihe durchgeführt. Daher würde man annehmen, dass die Anzahl der Beobachtungen von jedem Probanden nahezu gleich sein muss. Es gibt jedoch eine Spanne von 200 bis 300 Beobachtungen über alle Versuchspersonen hinweg. Ein Grund dafür könnte sein, dass der Wechsel von einer Aktivität zur nächsten in der Sequenz nicht scharf oder deutlich genug war und die Beobachter die unklaren Beobachtungen in diesen unstabilen Phasen nachträglich gelöscht haben.
Gesamtzahl der Beobachtungen¶
print('Number of observations: ' + str(df.shape[0]))
Number of observations: 7349
PCA zur Visualisierung¶
- **Die PCA ist eine einfache Methode zur Visualisierung hochdimensionaler Daten in einem niedrigdimensionalen Raum. (Vorsicht: wir bezahlen dafür mit einem Informationsverlust-->aber für Visualisierungszwecke ist es OK)
data_visualisation = df.drop('subject', axis =1).drop('Activity', axis=1).drop('type', axis =1)
s = StandardScaler()
data_visualisation = s.fit_transform(data_visualisation)
#We want 3
p = PCA(n_components = 3)
data_visualisation_transformed = p.fit_transform(data_visualisation)
#data_visualisation_transformed = p.transform(data_visualisation)
print('Features before PCA: ' + str(data_visualisation.shape[1]))
Features before PCA: 561
print('Features after PCA: ' + str(data_visualisation_transformed.shape[1]))
Features after PCA: 3
p.explained_variance_ratio_.sum()
0.619964301867613
x = []
for i in range(0, len(p.explained_variance_ratio_)):
x.append('PC' + str(i + 1))
y = np.array(p.explained_variance_ratio_)
z = np.cumsum(p.explained_variance_ratio_)
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.bar(x, y)
plt.plot(x, z)
[<matplotlib.lines.Line2D at 0x1c5212bc790>]
#labels 'STANDING', 'SITTING', 'LAYING', 'WALKING'
#fig, ax = plt.subplots()
plt.figure(figsize=(15,6))
for activity in df['Activity'].unique():
filtered_val = data_visualisation_transformed[df['Activity']==activity,:]
plt.scatter(filtered_val[:,0], filtered_val[:,1], label=activity, s=1.5)
plt.legend()
plt.show()
fig = plt.figure(figsize = (15, 6))
ax = fig.add_subplot(111, projection='3d')
for activity in df['Activity'].unique():
filtered_val = data_visualisation_transformed[df['Activity'] == activity, :]
ax.scatter(
filtered_val[:, 0],
filtered_val[:, 1],
filtered_val[:, 2],
label = activity,
s = 4
)
plt.legend()
plt.show()
### Interactive 3D-plot with plotly
# representations require a lot of computing power**
si = np.ones(7349)-0.7
fig = px.scatter_3d(data_visualisation_transformed,
x=data_visualisation_transformed[:, 0],
y=data_visualisation_transformed[:, 1],
z=data_visualisation_transformed[:, 2],
color=df['Activity'],
#size=si
)
fig.update_traces(marker=dict(size=2.5,line=dict(width=0.05,color='azure')),selector=dict(mode='markers'))
fig.show()
Ergebnisse Visualisierung mit PCA¶
- Nach der PCA/Transformation in 3 Hauptkomponenten können wir die 3 Klassen in angemessener Weise visuell trennen.
- Allerdings beschreiben die 3 Hauptkomponenten nur 62% der Varianz der ursprünglichen Daten. Das bedeutet, dass wir einen relativ großen Informationsverlust haben.
- Ein zweites Problem ist, dass es schwierig ist, Modelle zu interpretieren, die auf dem Ergebnis einer PCA basieren
-->wir verwenden die PCA nur zur Visualisierung und nicht zur Modellierung
Feature Übersicht¶
Aus dem Originalpapier über die Daten/Quelle der Daten können wir die Information erhalten, dass es die folgenden 17 Hauptmerkmale im Zeit- und Frequenzbereich des Signals gibt:
Name | Time | Freq. |
---|
Body Acc |1| 1 Gravity Acc |1| 0 Body Acc Jerk |1 |1 Body Angular Speed |1 |1 Body Angular Acc |1 |0 Body Acc Magnitude |1 |1 Gravity Acc Mag |1 |0 Body Acc Jerk Mag |1 |1 Body Angular Speed Mag |1 |1 Body Angular Acc Mag |1 |1
data_temp = df.drop('subject', axis =1).drop('Activity', axis=1).drop('type', axis =1)
print('Features: ' + str(data_temp.shape[1]))
Features: 561
data_temp.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7349 entries, 0 to 7348 Columns: 561 entries, tBodyAcc-mean()-X to angle(Z,gravityMean) dtypes: float64(561) memory usage: 31.5 MB
Multikollinearität prüfen¶
variables = data_temp
vif = pd.DataFrame()
# This takes many minutes to compute
# This is very cpu intensive
tempList = list()
total = variables.shape[1]
for i in range(total):
print(i, ' out of ', total-1)
x = variance_inflation_factor(variables.values, i)
tempList.append(x)
0 out of 560 1 out of 560 2 out of 560 3 out of 560 4 out of 560 5 out of 560 6 out of 560 7 out of 560 8 out of 560 9 out of 560 10 out of 560 11 out of 560 12 out of 560 13 out of 560 14 out of 560 15 out of 560 16 out of 560 17 out of 560 18 out of 560 19 out of 560 20 out of 560 21 out of 560 22 out of 560 23 out of 560 24 out of 560 25 out of 560 26 out of 560 27 out of 560 28 out of 560 29 out of 560 30 out of 560 31 out of 560 32 out of 560 33 out of 560 34 out of 560 35 out of 560 36 out of 560 37 out of 560 38 out of 560 39 out of 560 40 out of 560 41 out of 560 42 out of 560 43 out of 560 44 out of 560 45 out of 560 46 out of 560 47 out of 560 48 out of 560 49 out of 560 50 out of 560 51 out of 560 52 out of 560 53 out of 560 54 out of 560 55 out of 560 56 out of 560 57 out of 560 58 out of 560 59 out of 560 60 out of 560 61 out of 560 62 out of 560 63 out of 560 64 out of 560 65 out of 560 66 out of 560 67 out of 560 68 out of 560 69 out of 560 70 out of 560 71 out of 560 72 out of 560 73 out of 560 74 out of 560 75 out of 560 76 out of 560 77 out of 560 78 out of 560 79 out of 560 80 out of 560 81 out of 560 82 out of 560 83 out of 560 84 out of 560 85 out of 560 86 out of 560 87 out of 560 88 out of 560 89 out of 560 90 out of 560 91 out of 560 92 out of 560 93 out of 560 94 out of 560 95 out of 560 96 out of 560 97 out of 560 98 out of 560 99 out of 560 100 out of 560 101 out of 560 102 out of 560 103 out of 560 104 out of 560 105 out of 560 106 out of 560 107 out of 560 108 out of 560 109 out of 560 110 out of 560 111 out of 560 112 out of 560 113 out of 560 114 out of 560 115 out of 560 116 out of 560 117 out of 560 118 out of 560 119 out of 560 120 out of 560 121 out of 560 122 out of 560 123 out of 560 124 out of 560 125 out of 560 126 out of 560 127 out of 560 128 out of 560 129 out of 560 130 out of 560 131 out of 560 132 out of 560 133 out of 560 134 out of 560 135 out of 560 136 out of 560 137 out of 560 138 out of 560 139 out of 560 140 out of 560 141 out of 560 142 out of 560 143 out of 560 144 out of 560 145 out of 560 146 out of 560 147 out of 560 148 out of 560 149 out of 560 150 out of 560 151 out of 560 152 out of 560 153 out of 560 154 out of 560 155 out of 560 156 out of 560 157 out of 560 158 out of 560 159 out of 560 160 out of 560 161 out of 560 162 out of 560 163 out of 560 164 out of 560 165 out of 560 166 out of 560 167 out of 560 168 out of 560 169 out of 560 170 out of 560 171 out of 560 172 out of 560 173 out of 560 174 out of 560 175 out of 560 176 out of 560 177 out of 560 178 out of 560 179 out of 560 180 out of 560 181 out of 560 182 out of 560 183 out of 560 184 out of 560 185 out of 560 186 out of 560 187 out of 560 188 out of 560 189 out of 560 190 out of 560 191 out of 560 192 out of 560 193 out of 560 194 out of 560 195 out of 560 196 out of 560 197 out of 560 198 out of 560 199 out of 560 200 out of 560 201 out of 560
C:\Users\eebal\Anaconda3\lib\site-packages\statsmodels\stats\outliers_influence.py:193: RuntimeWarning: divide by zero encountered in double_scalars
202 out of 560 203 out of 560 204 out of 560 205 out of 560 206 out of 560 207 out of 560 208 out of 560 209 out of 560 210 out of 560 211 out of 560 212 out of 560 213 out of 560 214 out of 560 215 out of 560 216 out of 560 217 out of 560 218 out of 560 219 out of 560 220 out of 560 221 out of 560 222 out of 560 223 out of 560 224 out of 560 225 out of 560 226 out of 560 227 out of 560 228 out of 560 229 out of 560 230 out of 560 231 out of 560 232 out of 560 233 out of 560 234 out of 560 235 out of 560 236 out of 560 237 out of 560 238 out of 560 239 out of 560 240 out of 560 241 out of 560 242 out of 560 243 out of 560 244 out of 560 245 out of 560 246 out of 560 247 out of 560 248 out of 560 249 out of 560 250 out of 560 251 out of 560 252 out of 560 253 out of 560 254 out of 560 255 out of 560 256 out of 560 257 out of 560 258 out of 560 259 out of 560 260 out of 560 261 out of 560 262 out of 560 263 out of 560 264 out of 560 265 out of 560 266 out of 560 267 out of 560 268 out of 560 269 out of 560 270 out of 560 271 out of 560 272 out of 560 273 out of 560 274 out of 560 275 out of 560 276 out of 560 277 out of 560 278 out of 560 279 out of 560 280 out of 560 281 out of 560 282 out of 560 283 out of 560 284 out of 560 285 out of 560 286 out of 560 287 out of 560 288 out of 560 289 out of 560 290 out of 560 291 out of 560 292 out of 560 293 out of 560 294 out of 560 295 out of 560 296 out of 560 297 out of 560 298 out of 560 299 out of 560 300 out of 560 301 out of 560 302 out of 560 303 out of 560 304 out of 560 305 out of 560 306 out of 560 307 out of 560 308 out of 560 309 out of 560 310 out of 560 311 out of 560 312 out of 560 313 out of 560 314 out of 560 315 out of 560 316 out of 560 317 out of 560 318 out of 560 319 out of 560 320 out of 560 321 out of 560 322 out of 560 323 out of 560 324 out of 560 325 out of 560 326 out of 560 327 out of 560 328 out of 560 329 out of 560 330 out of 560 331 out of 560 332 out of 560 333 out of 560 334 out of 560 335 out of 560 336 out of 560 337 out of 560 338 out of 560 339 out of 560 340 out of 560 341 out of 560 342 out of 560 343 out of 560 344 out of 560 345 out of 560 346 out of 560 347 out of 560 348 out of 560 349 out of 560 350 out of 560 351 out of 560 352 out of 560 353 out of 560 354 out of 560 355 out of 560 356 out of 560 357 out of 560 358 out of 560 359 out of 560 360 out of 560 361 out of 560 362 out of 560 363 out of 560 364 out of 560 365 out of 560 366 out of 560 367 out of 560 368 out of 560 369 out of 560 370 out of 560 371 out of 560 372 out of 560 373 out of 560 374 out of 560 375 out of 560 376 out of 560 377 out of 560 378 out of 560 379 out of 560 380 out of 560 381 out of 560 382 out of 560 383 out of 560 384 out of 560 385 out of 560 386 out of 560 387 out of 560 388 out of 560 389 out of 560 390 out of 560 391 out of 560 392 out of 560 393 out of 560 394 out of 560 395 out of 560 396 out of 560 397 out of 560 398 out of 560 399 out of 560 400 out of 560 401 out of 560 402 out of 560 403 out of 560 404 out of 560 405 out of 560 406 out of 560 407 out of 560 408 out of 560 409 out of 560 410 out of 560 411 out of 560 412 out of 560 413 out of 560 414 out of 560 415 out of 560 416 out of 560 417 out of 560 418 out of 560 419 out of 560 420 out of 560 421 out of 560 422 out of 560 423 out of 560 424 out of 560 425 out of 560 426 out of 560 427 out of 560 428 out of 560 429 out of 560 430 out of 560 431 out of 560 432 out of 560 433 out of 560 434 out of 560 435 out of 560 436 out of 560 437 out of 560 438 out of 560 439 out of 560 440 out of 560 441 out of 560 442 out of 560 443 out of 560 444 out of 560 445 out of 560 446 out of 560 447 out of 560 448 out of 560 449 out of 560 450 out of 560 451 out of 560 452 out of 560 453 out of 560 454 out of 560 455 out of 560 456 out of 560 457 out of 560 458 out of 560 459 out of 560 460 out of 560 461 out of 560 462 out of 560 463 out of 560 464 out of 560 465 out of 560 466 out of 560 467 out of 560 468 out of 560 469 out of 560 470 out of 560 471 out of 560 472 out of 560 473 out of 560 474 out of 560 475 out of 560 476 out of 560 477 out of 560 478 out of 560 479 out of 560 480 out of 560 481 out of 560 482 out of 560 483 out of 560 484 out of 560 485 out of 560 486 out of 560 487 out of 560 488 out of 560 489 out of 560 490 out of 560 491 out of 560 492 out of 560 493 out of 560 494 out of 560 495 out of 560 496 out of 560 497 out of 560 498 out of 560 499 out of 560 500 out of 560 501 out of 560 502 out of 560 503 out of 560 504 out of 560 505 out of 560 506 out of 560 507 out of 560 508 out of 560 509 out of 560 510 out of 560 511 out of 560 512 out of 560 513 out of 560 514 out of 560 515 out of 560 516 out of 560 517 out of 560 518 out of 560 519 out of 560 520 out of 560 521 out of 560 522 out of 560 523 out of 560 524 out of 560 525 out of 560 526 out of 560 527 out of 560 528 out of 560 529 out of 560 530 out of 560 531 out of 560 532 out of 560 533 out of 560 534 out of 560 535 out of 560 536 out of 560 537 out of 560 538 out of 560 539 out of 560 540 out of 560 541 out of 560 542 out of 560 543 out of 560 544 out of 560 545 out of 560 546 out of 560 547 out of 560 548 out of 560 549 out of 560 550 out of 560 551 out of 560 552 out of 560 553 out of 560 554 out of 560 555 out of 560 556 out of 560 557 out of 560 558 out of 560 559 out of 560 560 out of 560
# add name of features to vif dataframe
vif["Features"] = variables.columns
# add the computed VIF Values to the vif dataframe
vif["VIF"] = tempList
vif[vif['VIF']>10]
VIF | Features | |
---|---|---|
0 | 1.673655e+02 | tBodyAcc-mean()-X |
2 | 2.458447e+01 | tBodyAcc-mean()-Z |
3 | 2.000913e+06 | tBodyAcc-std()-X |
4 | 5.503812e+05 | tBodyAcc-std()-Y |
5 | 5.281580e+05 | tBodyAcc-std()-Z |
... | ... | ... |
552 | 7.882126e+01 | fBodyBodyGyroJerkMag-skewness() |
553 | 1.882061e+02 | fBodyBodyGyroJerkMag-kurtosis() |
558 | 1.310881e+03 | angle(X,gravityMean) |
559 | 3.877562e+02 | angle(Y,gravityMean) |
560 | 2.950325e+02 | angle(Z,gravityMean) |
522 rows × 2 columns
#Drop all features with a VIF > 10, as they are correlating to much with other features
features_to_drop = vif[vif['VIF']>10].Features
data_with_important_features = data_temp.drop(features_to_drop, axis = 1)
data_with_important_features.head()
tBodyAcc-mean()-Y | tBodyAcc-correlation()-X,Y | tBodyAcc-correlation()-X,Z | tBodyAcc-correlation()-Y,Z | tGravityAcc-correlation()-X,Y | tGravityAcc-correlation()-X,Z | tGravityAcc-correlation()-Y,Z | tBodyAccJerk-mean()-X | tBodyAccJerk-mean()-Y | tBodyAccJerk-mean()-Z | ... | fBodyAccJerk-maxInds-X | fBodyAccJerk-maxInds-Y | fBodyAccJerk-maxInds-Z | fBodyGyro-meanFreq()-X | fBodyGyro-meanFreq()-Y | fBodyGyro-meanFreq()-Z | angle(tBodyAccMean,gravity) | angle(tBodyAccJerkMean),gravityMean) | angle(tBodyGyroMean,gravityMean) | angle(tBodyGyroJerkMean,gravityMean) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.020294 | 0.376314 | 0.435129 | 0.660790 | 0.570222 | 0.439027 | 0.986913 | 0.077996 | 0.005001 | -0.067831 | ... | 1.00 | -0.24 | -1.00 | -0.257549 | 0.097947 | 0.547151 | -0.112754 | 0.030400 | -0.464761 | -0.018446 |
1 | -0.016411 | -0.013429 | -0.072692 | 0.579382 | -0.831284 | -0.865711 | 0.974386 | 0.074007 | 0.005771 | 0.029377 | ... | -0.32 | -0.12 | -0.32 | -0.048167 | -0.401608 | -0.068178 | 0.053477 | -0.007435 | -0.732626 | 0.703511 |
2 | -0.019467 | -0.124698 | -0.181105 | 0.608900 | -0.181090 | 0.337936 | 0.643417 | 0.073636 | 0.003104 | -0.009046 | ... | -0.16 | -0.48 | -0.28 | -0.216685 | -0.017264 | -0.110720 | -0.118559 | 0.177899 | 0.100699 | 0.808529 |
3 | -0.026201 | -0.305693 | -0.362654 | 0.507459 | -0.991309 | -0.968821 | 0.984256 | 0.077321 | 0.020058 | -0.009865 | ... | -0.12 | -0.56 | -0.28 | 0.216862 | -0.135245 | -0.049728 | -0.036788 | -0.012892 | 0.640011 | -0.485366 |
4 | -0.016570 | -0.155804 | -0.189763 | 0.599213 | -0.408330 | -0.184840 | 0.964797 | 0.073444 | 0.019122 | 0.016780 | ... | -0.32 | -0.08 | 0.04 | -0.153343 | -0.088403 | -0.162230 | 0.123320 | 0.122542 | 0.693578 | -0.615971 |
5 rows × 39 columns
data_with_important_features.describe(include='all').transpose()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
tBodyAcc-mean()-Y | 7349.0 | -0.016299 | 0.038629 | -1.000000 | -0.021403 | -0.017029 | -0.012643 | 1.000000 |
tBodyAcc-correlation()-X,Y | 7349.0 | -0.024801 | 0.360763 | -1.000000 | -0.242409 | -0.058502 | 0.172579 | 1.000000 |
tBodyAcc-correlation()-X,Z | 7349.0 | -0.185987 | 0.338450 | -1.000000 | -0.395440 | -0.171765 | 0.020009 | 1.000000 |
tBodyAcc-correlation()-Y,Z | 7349.0 | 0.096760 | 0.415218 | -1.000000 | -0.187455 | 0.138724 | 0.405517 | 1.000000 |
tGravityAcc-correlation()-X,Y | 7349.0 | 0.103601 | 0.735996 | -1.000000 | -0.665946 | 0.231863 | 0.851334 | 1.000000 |
tGravityAcc-correlation()-X,Z | 7349.0 | -0.166366 | 0.726580 | -1.000000 | -0.879864 | -0.367198 | 0.567089 | 1.000000 |
tGravityAcc-correlation()-Y,Z | 7349.0 | 0.075266 | 0.735481 | -1.000000 | -0.688232 | 0.185269 | 0.823623 | 1.000000 |
tBodyAccJerk-mean()-X | 7349.0 | 0.077434 | 0.108154 | -0.581496 | 0.071259 | 0.075963 | 0.081112 | 0.855403 |
tBodyAccJerk-mean()-Y | 7349.0 | 0.009156 | 0.124891 | -0.948714 | 0.000186 | 0.010928 | 0.021387 | 0.835172 |
tBodyAccJerk-mean()-Z | 7349.0 | -0.003619 | 0.113880 | -0.807273 | -0.015246 | -0.001097 | 0.012727 | 0.735045 |
tBodyAccJerk-correlation()-X,Y | 7349.0 | -0.074377 | 0.242017 | -1.000000 | -0.228501 | -0.078196 | 0.075621 | 1.000000 |
tBodyAccJerk-correlation()-X,Z | 7349.0 | 0.039575 | 0.284099 | -1.000000 | -0.145351 | 0.062285 | 0.232011 | 1.000000 |
tBodyAccJerk-correlation()-Y,Z | 7349.0 | 0.054372 | 0.274488 | -1.000000 | -0.127156 | 0.042153 | 0.222467 | 1.000000 |
tBodyGyro-mean()-X | 7349.0 | -0.028776 | 0.074857 | -0.757669 | -0.034259 | -0.027690 | -0.021722 | 0.524123 |
tBodyGyro-mean()-Y | 7349.0 | -0.075976 | 0.086467 | -0.851973 | -0.089776 | -0.074841 | -0.062999 | 1.000000 |
tBodyGyro-entropy()-X | 7349.0 | -0.238753 | 0.401771 | -1.000000 | -0.550684 | -0.290427 | 0.102805 | 0.944059 |
tBodyGyro-entropy()-Y | 7349.0 | -0.205995 | 0.364969 | -1.000000 | -0.457364 | -0.209115 | 0.067220 | 1.000000 |
tBodyGyro-entropy()-Z | 7349.0 | -0.196045 | 0.459841 | -1.000000 | -0.563556 | -0.272572 | 0.228766 | 1.000000 |
tBodyGyro-correlation()-X,Y | 7349.0 | -0.188702 | 0.405215 | -1.000000 | -0.483125 | -0.208589 | 0.084417 | 1.000000 |
tBodyGyro-correlation()-X,Z | 7349.0 | 0.041870 | 0.408883 | -1.000000 | -0.220456 | 0.018146 | 0.302452 | 1.000000 |
tBodyGyro-correlation()-Y,Z | 7349.0 | -0.168283 | 0.440983 | -1.000000 | -0.530728 | -0.177793 | 0.150636 | 1.000000 |
tBodyGyroJerk-mean()-X | 7349.0 | -0.098036 | 0.086125 | -0.968976 | -0.104611 | -0.098272 | -0.092254 | 1.000000 |
tBodyGyroJerk-mean()-Y | 7349.0 | -0.041384 | 0.083025 | -1.000000 | -0.047188 | -0.040452 | -0.033379 | 0.720665 |
tBodyGyroJerk-mean()-Z | 7349.0 | -0.055329 | 0.087333 | -0.749480 | -0.064313 | -0.054572 | -0.047179 | 0.600600 |
tBodyGyroJerk-correlation()-X,Y | 7349.0 | 0.017183 | 0.278593 | -1.000000 | -0.169257 | 0.013469 | 0.198487 | 1.000000 |
tBodyGyroJerk-correlation()-X,Z | 7349.0 | 0.063215 | 0.266236 | -1.000000 | -0.106320 | 0.057007 | 0.226712 | 1.000000 |
tBodyGyroJerk-correlation()-Y,Z | 7349.0 | -0.111144 | 0.262790 | -1.000000 | -0.279851 | -0.112190 | 0.051061 | 1.000000 |
fBodyAcc-meanFreq()-Y | 7349.0 | 0.057152 | 0.247243 | -1.000000 | -0.099065 | 0.053560 | 0.228640 | 1.000000 |
fBodyAcc-meanFreq()-Z | 7349.0 | 0.104336 | 0.265157 | -1.000000 | -0.069322 | 0.118173 | 0.288123 | 1.000000 |
fBodyAccJerk-maxInds-X | 7349.0 | -0.294016 | 0.275268 | -1.000000 | -0.480000 | -0.280000 | -0.120000 | 1.000000 |
fBodyAccJerk-maxInds-Y | 7349.0 | -0.354530 | 0.273795 | -1.000000 | -0.560000 | -0.360000 | -0.200000 | 1.000000 |
fBodyAccJerk-maxInds-Z | 7349.0 | -0.261875 | 0.274175 | -1.000000 | -0.440000 | -0.280000 | -0.080000 | 1.000000 |
fBodyGyro-meanFreq()-X | 7349.0 | -0.064359 | 0.263692 | -1.000000 | -0.229611 | -0.054360 | 0.112420 | 1.000000 |
fBodyGyro-meanFreq()-Y | 7349.0 | -0.168671 | 0.274139 | -1.000000 | -0.357676 | -0.174484 | 0.018202 | 1.000000 |
fBodyGyro-meanFreq()-Z | 7349.0 | -0.013731 | 0.263507 | -1.000000 | -0.191156 | -0.019524 | 0.161677 | 1.000000 |
angle(tBodyAccMean,gravity) | 7349.0 | 0.010961 | 0.267908 | -0.977973 | -0.074345 | 0.008115 | 0.096642 | 0.977810 |
angle(tBodyAccJerkMean),gravityMean) | 7349.0 | 0.005884 | 0.369485 | -0.979326 | -0.206153 | 0.010343 | 0.211965 | 0.991899 |
angle(tBodyGyroMean,gravityMean) | 7349.0 | 0.013123 | 0.488843 | -1.000000 | -0.344634 | 0.010699 | 0.381334 | 0.994532 |
angle(tBodyGyroJerkMean,gravityMean) | 7349.0 | -0.006028 | 0.460188 | -1.000000 | -0.359519 | 0.001035 | 0.335377 | 1.000000 |
plt.figure(figsize=(28,15))
boxplot = sns.boxplot(x="variable", y="value", data=pd.melt(data_with_important_features.iloc[:,:19]))
boxplot.xaxis.set_ticklabels(boxplot.xaxis.get_ticklabels(), rotation=90, ha='right', fontsize=15)
plt.ylabel('values')
plt.xlabel('Feautures')
Text(0.5, 0, 'Feautures')
plt.figure(figsize=(28,15))
boxplot = sns.boxplot(x="variable", y="value", data=pd.melt(data_with_important_features.iloc[:,19:]))
boxplot.xaxis.set_ticklabels(boxplot.xaxis.get_ticklabels(), rotation=90, ha='right', fontsize=15)
plt.ylabel('values')
plt.xlabel('Feautures')
Text(0.5, 0, 'Feautures')
Zusammenfassung des Datenverständnisses¶
1 Spalte mit der Subjekt-ID 'subject'-->nur zur Filterung, muss entfernt werden
1 Zielspalte 'Aktivität'-->4 Klassen (GEHEN, SITZEN, LIEGEN, STEHEN), etwas unausgewogen
wir haben ein Klassifizierungsproblem mit mehreren Klassen überwachtes maschinelles Lernen
1 "support column" source-->is the observation from the train or test group-->train test split along this value
1 "Unterstützungsspalte"-Quelle-->ist die Beobachtung aus der Trainings- oder Testgruppe-->Train-Test-Aufteilung entlang dieses Wertes
Diese wurden nach Möglichkeit für jede der drei Achsen x, y und z berechnet
In der Summe sind das 561 Merkmale -->Die Analyse der einzelnen Merkmale im Detail ist aufgrund der großen Anzahl von Merkmalen sehr schwierig. Schließlich haben wir Multilinerarität in den Daten
Nach der Multicollineraty-Prüfung sind nur noch 39 Merkmale übrig. In den Boxplots sind bei einigen Merkmalen Ausreißer zu sehen, aber es ist schwierig, diese Ausreißer zu analysieren und zu interpretieren, da die Daten anscheinend bereits teilweise standardisiert sind und wir keine Messeinheiten haben. Daher akzeptieren wir im Moment diese Ausreißer
Trainings Test Split
data = df.copy()
data_train = data[data['type']=='train']
data_test = data[data['type']=='test']
X_train = data_train.drop('subject', axis =1).drop('Activity', axis=1).drop('type', axis =1)
X_train = X_train.drop(features_to_drop, axis =1)
y_train = data_train['Activity']
X_test = data_test.drop('subject', axis =1).drop('Activity', axis=1).drop('type', axis =1)
X_test = X_test.drop(features_to_drop, axis =1)
y_test = data_test['Activity']
print('X_train: ' + str(X_train.shape))
print('y_train: ' + str(y_train.shape))
print('X_test: ' + str(X_test.shape))
print('y_test: ' + str(y_test.shape))
X_train: (5293, 39) y_train: (5293,) X_test: (2056, 39) y_test: (2056,)
Oversampling von Daten¶
ros = RandomOverSampler(random_state=0)
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)
y_train_resampled.value_counts().plot.bar()
<AxesSubplot:>
Modellierung¶
Um verschiedene Modelle zu erstellen, ist es eine gute Praxis, Pipelines und GridSearchCV zu verwenden. Diese beiden Tools sind ein einfacher Weg, um Hyperparamter-Tuning und k-fold Cross-Validation auf verschiedenen Modellen durchzuführen. In diesem Fall bestehen die Pipelines aus einem StandardScaler und einem One vs Rest-Klassifikator
Erstellen & Evaluieren verschiedener Modelle
KNN¶
K-Nearest Neighbors
# Pipeline for KNN
pipeline1 = Pipeline([
("scaler", StandardScaler()),
("knn", OneVsRestClassifier(KNeighborsClassifier()))
])
pipeline1.get_params().keys()
dict_keys(['memory', 'steps', 'verbose', 'scaler', 'knn', 'scaler__copy', 'scaler__with_mean', 'scaler__with_std', 'knn__estimator__algorithm', 'knn__estimator__leaf_size', 'knn__estimator__metric', 'knn__estimator__metric_params', 'knn__estimator__n_jobs', 'knn__estimator__n_neighbors', 'knn__estimator__p', 'knn__estimator__weights', 'knn__estimator', 'knn__n_jobs'])
#GridSearchCV-Object
clf = GridSearchCV(pipeline1, verbose=10, param_grid = {
"knn__estimator__n_neighbors": [8, 10, 20, 30]
})
clf.fit(X_train_resampled, y_train_resampled)
Fitting 5 folds for each of 4 candidates, totalling 20 fits [CV] knn__estimator__n_neighbors=8 ...................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ....... knn__estimator__n_neighbors=8, score=0.769, total= 3.5s [CV] knn__estimator__n_neighbors=8 ...................................
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 3.4s remaining: 0.0s
[CV] ....... knn__estimator__n_neighbors=8, score=0.754, total= 3.3s [CV] knn__estimator__n_neighbors=8 ...................................
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 6.7s remaining: 0.0s
[CV] ....... knn__estimator__n_neighbors=8, score=0.801, total= 3.3s [CV] knn__estimator__n_neighbors=8 ...................................
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 10.0s remaining: 0.0s
[CV] ....... knn__estimator__n_neighbors=8, score=0.832, total= 3.5s [CV] knn__estimator__n_neighbors=8 ...................................
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 13.4s remaining: 0.0s
[CV] ....... knn__estimator__n_neighbors=8, score=0.818, total= 3.3s [CV] knn__estimator__n_neighbors=10 ..................................
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 16.7s remaining: 0.0s
[CV] ...... knn__estimator__n_neighbors=10, score=0.769, total= 3.4s [CV] knn__estimator__n_neighbors=10 ..................................
[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 20.1s remaining: 0.0s
[CV] ...... knn__estimator__n_neighbors=10, score=0.749, total= 3.5s [CV] knn__estimator__n_neighbors=10 ..................................
[Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 23.6s remaining: 0.0s
[CV] ...... knn__estimator__n_neighbors=10, score=0.815, total= 3.4s [CV] knn__estimator__n_neighbors=10 ..................................
[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 27.1s remaining: 0.0s
[CV] ...... knn__estimator__n_neighbors=10, score=0.828, total= 3.5s [CV] knn__estimator__n_neighbors=10 ..................................
[Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 30.6s remaining: 0.0s
[CV] ...... knn__estimator__n_neighbors=10, score=0.824, total= 3.5s [CV] knn__estimator__n_neighbors=20 .................................. [CV] ...... knn__estimator__n_neighbors=20, score=0.782, total= 3.5s [CV] knn__estimator__n_neighbors=20 .................................. [CV] ...... knn__estimator__n_neighbors=20, score=0.734, total= 3.5s [CV] knn__estimator__n_neighbors=20 .................................. [CV] ...... knn__estimator__n_neighbors=20, score=0.828, total= 3.5s [CV] knn__estimator__n_neighbors=20 .................................. [CV] ...... knn__estimator__n_neighbors=20, score=0.820, total= 3.3s [CV] knn__estimator__n_neighbors=20 .................................. [CV] ...... knn__estimator__n_neighbors=20, score=0.832, total= 3.4s [CV] knn__estimator__n_neighbors=30 .................................. [CV] ...... knn__estimator__n_neighbors=30, score=0.768, total= 3.6s [CV] knn__estimator__n_neighbors=30 .................................. [CV] ...... knn__estimator__n_neighbors=30, score=0.726, total= 3.6s [CV] knn__estimator__n_neighbors=30 .................................. [CV] ...... knn__estimator__n_neighbors=30, score=0.820, total= 3.6s [CV] knn__estimator__n_neighbors=30 .................................. [CV] ...... knn__estimator__n_neighbors=30, score=0.820, total= 3.7s [CV] knn__estimator__n_neighbors=30 .................................. [CV] ...... knn__estimator__n_neighbors=30, score=0.826, total= 3.5s
[Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 1.2min finished
GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()), ('knn', OneVsRestClassifier(estimator=KNeighborsClassifier()))]), param_grid={'knn__estimator__n_neighbors': [8, 10, 20, 30]}, verbose=10)
print('Best parameters ' + str(clf.best_params_))
print('Best model score: ' + str(clf.best_score_))
Best parameters {'knn__estimator__n_neighbors': 20} Best model score: 0.7988723899743437
results_KNN = pd.DataFrame(clf.cv_results_)
results_KNN
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_knn__estimator__n_neighbors | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.291415 | 0.020557 | 3.064704 | 0.088606 | 8 | {'knn__estimator__n_neighbors': 8} | 0.769094 | 0.753996 | 0.801066 | 0.832000 | 0.817778 | 0.794787 | 0.029247 | 3 |
1 | 0.305789 | 0.014108 | 3.177124 | 0.047759 | 10 | {'knn__estimator__n_neighbors': 10} | 0.769094 | 0.748668 | 0.815275 | 0.827556 | 0.824000 | 0.796919 | 0.031973 | 2 |
2 | 0.291104 | 0.008228 | 3.165741 | 0.071301 | 20 | {'knn__estimator__n_neighbors': 20} | 0.781528 | 0.733570 | 0.827709 | 0.819556 | 0.832000 | 0.798872 | 0.037207 | 1 |
3 | 0.291344 | 0.011009 | 3.311857 | 0.029414 | 30 | {'knn__estimator__n_neighbors': 30} | 0.768206 | 0.726465 | 0.819716 | 0.819556 | 0.825778 | 0.791944 | 0.038806 | 4 |
classification_report_test = classification_report(y_test, clf.best_estimator_.predict(X_test), output_dict=True)
classification_report_train = classification_report(y_train_resampled, clf.best_estimator_.predict(X_train_resampled), output_dict=True)
#helper function
def make_results(classification_report_train, classification_report_test, model_type):
df_rep1 = pd.DataFrame(classification_report_train)
df_rep1['source'] ='train'
df_rep1 = df_rep1.set_index([df_rep1.index,'source'])
df_rep2 = pd.DataFrame(classification_report_test)
df_rep2['source'] ='test'
df_rep2 = df_rep2.set_index([df_rep2.index,'source'])
frames = [df_rep1, df_rep2]
df_rep = pd.concat(frames)
df_rep['model_type'] = model_type
df_rep = df_rep.set_index([df_rep.index,'model_type'])
return df_rep.transpose()
results_KNN = make_results(classification_report_train, classification_report_test, 'KNN')
results_KNN
precision | recall | f1-score | support | precision | recall | f1-score | support | |
---|---|---|---|---|---|---|---|---|
source | train | train | train | train | test | test | test | test |
model_type | KNN | KNN | KNN | KNN | KNN | KNN | KNN | KNN |
LAYING | 0.912238 | 0.805259 | 0.855417 | 1407.000000 | 0.850000 | 0.633147 | 0.725720 | 537.000000 |
SITTING | 0.810458 | 0.793177 | 0.801724 | 1407.000000 | 0.631295 | 0.714868 | 0.670487 | 491.000000 |
STANDING | 0.808673 | 0.901208 | 0.852437 | 1407.000000 | 0.757679 | 0.834586 | 0.794275 | 532.000000 |
WALKING | 0.976405 | 1.000000 | 0.988062 | 1407.000000 | 0.957198 | 0.991935 | 0.974257 | 496.000000 |
accuracy | 0.874911 | 0.874911 | 0.874911 | 0.874911 | 0.791342 | 0.791342 | 0.791342 | 0.791342 |
macro avg | 0.876944 | 0.874911 | 0.874410 | 5628.000000 | 0.799043 | 0.793634 | 0.791185 | 2056.000000 |
weighted avg | 0.876944 | 0.874911 | 0.874410 | 5628.000000 | 0.799743 | 0.791342 | 0.790227 | 2056.000000 |
Random Forest¶
# Pipeline
pipeline2 = Pipeline([
("scaler", StandardScaler()),
("ranfo", OneVsRestClassifier(RandomForestClassifier()))
])
param_grid = {'ranfo__estimator__max_depth': [5, 15, 30]}
#GridSearchCV-Object
clf2 = GridSearchCV(pipeline2, param_grid = param_grid, verbose=10)
clf2.fit(X_train_resampled, y_train_resampled)
Fitting 5 folds for each of 3 candidates, totalling 15 fits [CV] ranfo__estimator__max_depth=5 ...................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ....... ranfo__estimator__max_depth=5, score=0.793, total= 7.9s [CV] ranfo__estimator__max_depth=5 ...................................
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 7.8s remaining: 0.0s
[CV] ....... ranfo__estimator__max_depth=5, score=0.752, total= 7.6s [CV] ranfo__estimator__max_depth=5 ...................................
[Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 15.4s remaining: 0.0s
[CV] ....... ranfo__estimator__max_depth=5, score=0.850, total= 7.7s [CV] ranfo__estimator__max_depth=5 ...................................
[Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 23.1s remaining: 0.0s
[CV] ....... ranfo__estimator__max_depth=5, score=0.843, total= 7.7s [CV] ranfo__estimator__max_depth=5 ...................................
[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 30.7s remaining: 0.0s
[CV] ....... ranfo__estimator__max_depth=5, score=0.830, total= 8.0s [CV] ranfo__estimator__max_depth=15 ..................................
[Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 38.8s remaining: 0.0s
[CV] ...... ranfo__estimator__max_depth=15, score=0.827, total= 13.2s [CV] ranfo__estimator__max_depth=15 ..................................
[Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 52.0s remaining: 0.0s
[CV] ...... ranfo__estimator__max_depth=15, score=0.813, total= 13.3s [CV] ranfo__estimator__max_depth=15 ..................................
[Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 1.1min remaining: 0.0s
[CV] ...... ranfo__estimator__max_depth=15, score=0.877, total= 13.3s [CV] ranfo__estimator__max_depth=15 ..................................
[Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 1.3min remaining: 0.0s
[CV] ...... ranfo__estimator__max_depth=15, score=0.876, total= 13.4s [CV] ranfo__estimator__max_depth=15 ..................................
[Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 1.5min remaining: 0.0s
[CV] ...... ranfo__estimator__max_depth=15, score=0.886, total= 13.6s [CV] ranfo__estimator__max_depth=30 .................................. [CV] ...... ranfo__estimator__max_depth=30, score=0.831, total= 13.3s [CV] ranfo__estimator__max_depth=30 .................................. [CV] ...... ranfo__estimator__max_depth=30, score=0.821, total= 13.6s [CV] ranfo__estimator__max_depth=30 .................................. [CV] ...... ranfo__estimator__max_depth=30, score=0.877, total= 13.8s [CV] ranfo__estimator__max_depth=30 .................................. [CV] ...... ranfo__estimator__max_depth=30, score=0.876, total= 13.8s [CV] ranfo__estimator__max_depth=30 .................................. [CV] ...... ranfo__estimator__max_depth=30, score=0.888, total= 14.3s
[Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 2.9min finished
GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()), ('ranfo', OneVsRestClassifier(estimator=RandomForestClassifier()))]), param_grid={'ranfo__estimator__max_depth': [5, 15, 30]}, verbose=10)
print('Best parameters ' + str(clf2.best_params_))
print('Best model score: ' + str(clf2.best_score_))
Best parameters {'ranfo__estimator__max_depth': 30} Best model score: 0.8583949477008093
classification_report_test = classification_report(y_test, clf2.best_estimator_.predict(X_test), output_dict=True)
classification_report_train = classification_report(y_train_resampled, clf2.best_estimator_.predict(X_train_resampled), output_dict=True)
results_RandomForest = make_results(classification_report_train, classification_report_test, 'RandomForest')
results_RandomForest
precision | recall | f1-score | support | precision | recall | f1-score | support | |
---|---|---|---|---|---|---|---|---|
source | train | train | train | train | test | test | test | test |
model_type | RandomForest | RandomForest | RandomForest | RandomForest | RandomForest | RandomForest | RandomForest | RandomForest |
LAYING | 1.0 | 1.0 | 1.0 | 1407.0 | 0.821705 | 0.789572 | 0.805318 | 537.00000 |
SITTING | 1.0 | 1.0 | 1.0 | 1407.0 | 0.784519 | 0.763747 | 0.773994 | 491.00000 |
STANDING | 1.0 | 1.0 | 1.0 | 1407.0 | 0.829710 | 0.860902 | 0.845018 | 532.00000 |
WALKING | 1.0 | 1.0 | 1.0 | 1407.0 | 0.970588 | 0.997984 | 0.984095 | 496.00000 |
accuracy | 1.0 | 1.0 | 1.0 | 1.0 | 0.852140 | 0.852140 | 0.852140 | 0.85214 |
macro avg | 1.0 | 1.0 | 1.0 | 5628.0 | 0.851631 | 0.853051 | 0.852106 | 2056.00000 |
weighted avg | 1.0 | 1.0 | 1.0 | 5628.0 | 0.850813 | 0.852140 | 0.851239 | 2056.00000 |
Logistische Regression¶
# Pipeline
pipeline3 = Pipeline([
("scaler", StandardScaler()),
("logreg", OneVsRestClassifier(LogisticRegression(solver='newton-cg')))
])
#GridSearchCV-Object
clf3 = GridSearchCV(pipeline3, param_grid = {})
clf3.fit(X_train_resampled, y_train_resampled)
GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()), ('logreg', OneVsRestClassifier(estimator=LogisticRegression(solver='newton-cg')))]), param_grid={})
print('Best parameters ' + str(clf3.best_params_))
print('Best model score: ' + str(clf3.best_score_))
Best parameters {} Best model score: 0.7738230905861456
classification_report_test = classification_report(y_test, clf3.best_estimator_.predict(X_test), output_dict=True)
classification_report_train = classification_report(y_train_resampled, clf3.best_estimator_.predict(X_train_resampled), output_dict=True)
results_LogisticRegression = make_results(classification_report_train, classification_report_test, 'LogReg')
results_LogisticRegression
precision | recall | f1-score | support | precision | recall | f1-score | support | |
---|---|---|---|---|---|---|---|---|
source | train | train | train | train | test | test | test | test |
model_type | LogReg | LogReg | LogReg | LogReg | LogReg | LogReg | LogReg | LogReg |
LAYING | 0.761204 | 0.772566 | 0.766843 | 1407.000000 | 0.690702 | 0.677840 | 0.684211 | 537.000000 |
SITTING | 0.747636 | 0.730633 | 0.739037 | 1407.000000 | 0.687970 | 0.745418 | 0.715543 | 491.000000 |
STANDING | 0.791356 | 0.754797 | 0.772645 | 1407.000000 | 0.833684 | 0.744361 | 0.786495 | 532.000000 |
WALKING | 0.924477 | 0.974414 | 0.948789 | 1407.000000 | 0.873563 | 0.919355 | 0.895874 | 496.000000 |
accuracy | 0.808102 | 0.808102 | 0.808102 | 0.808102 | 0.769455 | 0.769455 | 0.769455 | 0.769455 |
macro avg | 0.806169 | 0.808102 | 0.806828 | 5628.000000 | 0.771480 | 0.771743 | 0.770530 | 2056.000000 |
weighted avg | 0.806169 | 0.808102 | 0.806828 | 5628.000000 | 0.771161 | 0.769455 | 0.769222 | 2056.000000 |
Entscheidungsbaum¶
# Pipeline
pipeline4 = Pipeline([
("scaler", StandardScaler()),
("decisiontree", OneVsRestClassifier(DecisionTreeClassifier()))
])
#GridSearchCV-Object
clf4 = GridSearchCV(pipeline4, param_grid = {})
clf4.fit(X_train_resampled, y_train_resampled)
GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()), ('decisiontree', OneVsRestClassifier(estimator=DecisionTreeClassifier()))]), param_grid={})
print('Best parameters ' + str(clf4.best_params_))
print('Best model score: ' + str(clf4.best_score_))
Best parameters {} Best model score: 0.6778698243536609
classification_report_test = classification_report(y_test, clf4.best_estimator_.predict(X_test), output_dict=True)
classification_report_train = classification_report(y_train_resampled, clf4.best_estimator_.predict(X_train_resampled), output_dict=True)
results_DecisionTree = make_results(classification_report_train, classification_report_test, 'DecisionTree')
results_DecisionTree
precision | recall | f1-score | support | precision | recall | f1-score | support | |
---|---|---|---|---|---|---|---|---|
source | train | train | train | train | test | test | test | test |
model_type | DecisionTree | DecisionTree | DecisionTree | DecisionTree | DecisionTree | DecisionTree | DecisionTree | DecisionTree |
LAYING | 1.0 | 1.0 | 1.0 | 1407.0 | 0.803797 | 0.472998 | 0.595545 | 537.000000 |
SITTING | 1.0 | 1.0 | 1.0 | 1407.0 | 0.642132 | 0.515275 | 0.571751 | 491.000000 |
STANDING | 1.0 | 1.0 | 1.0 | 1407.0 | 0.695391 | 0.652256 | 0.673133 | 532.000000 |
WALKING | 1.0 | 1.0 | 1.0 | 1407.0 | 0.578512 | 0.987903 | 0.729710 | 496.000000 |
accuracy | 1.0 | 1.0 | 1.0 | 1.0 | 0.653696 | 0.653696 | 0.653696 | 0.653696 |
macro avg | 1.0 | 1.0 | 1.0 | 5628.0 | 0.679958 | 0.657108 | 0.642535 | 2056.000000 |
weighted avg | 1.0 | 1.0 | 1.0 | 5628.0 | 0.682790 | 0.653696 | 0.642306 | 2056.000000 |
SVM¶
Support Vector Machine
# Pipeline
pipeline5 = Pipeline([
("scaler", StandardScaler()),
("svm", OneVsRestClassifier(svm.SVC()))
])
param_grid = {
'svm__estimator__kernel': ['linear']}
#GridSearchCV-Object
clf5 = GridSearchCV(pipeline5, param_grid)
clf5.fit(X_train_resampled, y_train_resampled)
GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()), ('svm', OneVsRestClassifier(estimator=SVC()))]), param_grid={'svm__estimator__kernel': ['linear']})
print('Best parameters ' + str(clf5.best_params_))
print('Best model score: ' + str(clf5.best_score_))
Best parameters {'svm__estimator__kernel': 'linear'} Best model score: 0.776133570159858
results_SVM= pd.DataFrame(clf5.cv_results_)
results_SVM
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_svm__estimator__kernel | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | split4_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6.828282 | 0.334218 | 0.329354 | 0.019774 | linear | {'svm__estimator__kernel': 'linear'} | 0.740675 | 0.723801 | 0.784192 | 0.818667 | 0.813333 | 0.776134 | 0.038089 | 1 |
classification_report_test = classification_report(y_test, clf5.best_estimator_.predict(X_test), output_dict=True)
classification_report_train = classification_report(y_train_resampled, clf5.best_estimator_.predict(X_train_resampled), output_dict=True)
results_SVM = make_results(classification_report_train, classification_report_test, 'SVM')
results_SVM
precision | recall | f1-score | support | precision | recall | f1-score | support | |
---|---|---|---|---|---|---|---|---|
source | train | train | train | train | test | test | test | test |
model_type | SVM | SVM | SVM | SVM | SVM | SVM | SVM | SVM |
LAYING | 0.773826 | 0.773276 | 0.773551 | 1407.000000 | 0.689394 | 0.677840 | 0.683568 | 537.000000 |
SITTING | 0.753613 | 0.741294 | 0.747402 | 1407.000000 | 0.675978 | 0.739308 | 0.706226 | 491.000000 |
STANDING | 0.790169 | 0.765458 | 0.777617 | 1407.000000 | 0.837607 | 0.736842 | 0.784000 | 532.000000 |
WALKING | 0.931525 | 0.976546 | 0.953505 | 1407.000000 | 0.871893 | 0.919355 | 0.894995 | 496.000000 |
accuracy | 0.814144 | 0.814144 | 0.814144 | 0.814144 | 0.766051 | 0.766051 | 0.766051 | 0.766051 |
macro avg | 0.812283 | 0.814144 | 0.813019 | 5628.000000 | 0.768718 | 0.768336 | 0.767197 | 2056.000000 |
weighted avg | 0.812283 | 0.814144 | 0.813019 | 5628.000000 | 0.768568 | 0.766051 | 0.765972 | 2056.000000 |
Zusammenfassung¶
frames = [results_KNN,
results_RandomForest,
results_LogisticRegression,
results_DecisionTree,
results_SVM,
]
final_results = pd.concat(frames, axis=1).reindex(frames[0].index).transpose()
final_results.sort_index(level=1)
LAYING | SITTING | STANDING | WALKING | accuracy | macro avg | weighted avg | |||
---|---|---|---|---|---|---|---|---|---|
source | model_type | ||||||||
f1-score | test | DecisionTree | 0.595545 | 0.571751 | 0.673133 | 0.729710 | 0.653696 | 0.642535 | 0.642306 |
KNN | 0.725720 | 0.670487 | 0.794275 | 0.974257 | 0.791342 | 0.791185 | 0.790227 | ||
LogReg | 0.684211 | 0.715543 | 0.786495 | 0.895874 | 0.769455 | 0.770530 | 0.769222 | ||
RandomForest | 0.805318 | 0.773994 | 0.845018 | 0.984095 | 0.852140 | 0.852106 | 0.851239 | ||
SVM | 0.683568 | 0.706226 | 0.784000 | 0.894995 | 0.766051 | 0.767197 | 0.765972 | ||
precision | test | DecisionTree | 0.803797 | 0.642132 | 0.695391 | 0.578512 | 0.653696 | 0.679958 | 0.682790 |
KNN | 0.850000 | 0.631295 | 0.757679 | 0.957198 | 0.791342 | 0.799043 | 0.799743 | ||
LogReg | 0.690702 | 0.687970 | 0.833684 | 0.873563 | 0.769455 | 0.771480 | 0.771161 | ||
RandomForest | 0.821705 | 0.784519 | 0.829710 | 0.970588 | 0.852140 | 0.851631 | 0.850813 | ||
SVM | 0.689394 | 0.675978 | 0.837607 | 0.871893 | 0.766051 | 0.768718 | 0.768568 | ||
recall | test | DecisionTree | 0.472998 | 0.515275 | 0.652256 | 0.987903 | 0.653696 | 0.657108 | 0.653696 |
KNN | 0.633147 | 0.714868 | 0.834586 | 0.991935 | 0.791342 | 0.793634 | 0.791342 | ||
LogReg | 0.677840 | 0.745418 | 0.744361 | 0.919355 | 0.769455 | 0.771743 | 0.769455 | ||
RandomForest | 0.789572 | 0.763747 | 0.860902 | 0.997984 | 0.852140 | 0.853051 | 0.852140 | ||
SVM | 0.677840 | 0.739308 | 0.736842 | 0.919355 | 0.766051 | 0.768336 | 0.766051 | ||
support | test | DecisionTree | 537.000000 | 491.000000 | 532.000000 | 496.000000 | 0.653696 | 2056.000000 | 2056.000000 |
KNN | 537.000000 | 491.000000 | 532.000000 | 496.000000 | 0.791342 | 2056.000000 | 2056.000000 | ||
LogReg | 537.000000 | 491.000000 | 532.000000 | 496.000000 | 0.769455 | 2056.000000 | 2056.000000 | ||
RandomForest | 537.000000 | 491.000000 | 532.000000 | 496.000000 | 0.852140 | 2056.000000 | 2056.000000 | ||
SVM | 537.000000 | 491.000000 | 532.000000 | 496.000000 | 0.766051 | 2056.000000 | 2056.000000 | ||
f1-score | train | DecisionTree | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
KNN | 0.855417 | 0.801724 | 0.852437 | 0.988062 | 0.874911 | 0.874410 | 0.874410 | ||
LogReg | 0.766843 | 0.739037 | 0.772645 | 0.948789 | 0.808102 | 0.806828 | 0.806828 | ||
RandomForest | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SVM | 0.773551 | 0.747402 | 0.777617 | 0.953505 | 0.814144 | 0.813019 | 0.813019 | ||
precision | train | DecisionTree | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
KNN | 0.912238 | 0.810458 | 0.808673 | 0.976405 | 0.874911 | 0.876944 | 0.876944 | ||
LogReg | 0.761204 | 0.747636 | 0.791356 | 0.924477 | 0.808102 | 0.806169 | 0.806169 | ||
RandomForest | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SVM | 0.773826 | 0.753613 | 0.790169 | 0.931525 | 0.814144 | 0.812283 | 0.812283 | ||
recall | train | DecisionTree | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
KNN | 0.805259 | 0.793177 | 0.901208 | 1.000000 | 0.874911 | 0.874911 | 0.874911 | ||
LogReg | 0.772566 | 0.730633 | 0.754797 | 0.974414 | 0.808102 | 0.808102 | 0.808102 | ||
RandomForest | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ||
SVM | 0.773276 | 0.741294 | 0.765458 | 0.976546 | 0.814144 | 0.814144 | 0.814144 | ||
support | train | DecisionTree | 1407.000000 | 1407.000000 | 1407.000000 | 1407.000000 | 1.000000 | 5628.000000 | 5628.000000 |
KNN | 1407.000000 | 1407.000000 | 1407.000000 | 1407.000000 | 0.874911 | 5628.000000 | 5628.000000 | ||
LogReg | 1407.000000 | 1407.000000 | 1407.000000 | 1407.000000 | 0.808102 | 5628.000000 | 5628.000000 | ||
RandomForest | 1407.000000 | 1407.000000 | 1407.000000 | 1407.000000 | 1.000000 | 5628.000000 | 5628.000000 | ||
SVM | 1407.000000 | 1407.000000 | 1407.000000 | 1407.000000 | 0.814144 | 5628.000000 | 5628.000000 |
final_results.sort_index(level=1).to_csv('Results_VIF_lower10.csv')
5. Evaluation¶
#helper function
def make_confusion_matrix(y_test, X_test, the_clf):
y_pred = the_clf.best_estimator_.predict(X_test)
cf_matrix = confusion_matrix(y_test,y_pred)
df_cf_matrix = pd.DataFrame(cf_matrix,columns=the_clf.best_estimator_.steps[1][1].classes_)
df_cf_matrix.index = the_clf.best_estimator_.steps[1][1].classes_
plt.figure(figsize=(28,7))
plt.subplot(1,3,1) # first heatmap
heatmap1 = sns.heatmap(df_cf_matrix, annot=True,fmt='d');
heatmap1.yaxis.set_ticklabels(heatmap1.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=10)
heatmap1.xaxis.set_ticklabels(heatmap1.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=10)
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.subplot(1,3,2) # second heatmap
heatmap2 = sns.heatmap(df_cf_matrix/np.sum(df_cf_matrix), annot=True, fmt='.2%')
heatmap2.yaxis.set_ticklabels(heatmap2.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=10)
heatmap2.xaxis.set_ticklabels(heatmap2.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=10)
plt.ylabel('True label')
plt.xlabel('Predicted label')
fig_name = str(the_clf.best_estimator_.steps[1][1].estimators_[0]) + '_confusionMatrix.png'
plt.savefig(fig_name, dpi=150, format='png')
Erzielte Genauigkeit¶
accuracy_train_test = pd.DataFrame()
accuracy_train_test['accuracy_train'] = final_results.xs("train", level=1).xs("precision", level=0)['accuracy']
accuracy_train_test['accuracy_test'] = final_results.xs("test", level=1).xs("precision", level=0)['accuracy']
accuracy_train_test['model'] = accuracy_train_test.index
accuracy_train_test['diff_train_test'] = accuracy_train_test['accuracy_train'] - accuracy_train_test['accuracy_test']
accuracy_train_test
accuracy_train | accuracy_test | model | diff_train_test | |
---|---|---|---|---|
model_type | ||||
KNN | 0.874911 | 0.791342 | KNN | 0.083569 |
RandomForest | 1.000000 | 0.852140 | RandomForest | 0.147860 |
LogReg | 0.808102 | 0.769455 | LogReg | 0.038647 |
DecisionTree | 1.000000 | 0.653696 | DecisionTree | 0.346304 |
SVM | 0.814144 | 0.766051 | SVM | 0.048093 |
plt.figure(figsize=(10,7))
df_temp = accuracy_train_test.melt('model', var_name='cols', value_name='vals')
ax = sns.barplot(x="model", y="vals", hue="cols", data=df_temp)
#for index, row in df_temp:
# ax.text(row.name,row.vals, round(row.vals,2), color='black', ha="center")
for p in ax.patches:
ax.annotate(format(p.get_height(), '.3f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
xytext = (0, 9),
textcoords = 'offset points')
plt.ylabel('Accuracy')
plt.xlabel('Model')
fig_name = 'Accuracy_performance.png'
plt.savefig(fig_name, dpi=150, format='png')
df_temp
model | cols | vals | |
---|---|---|---|
0 | KNN | accuracy_train | 0.874911 |
1 | RandomForest | accuracy_train | 1.000000 |
2 | LogReg | accuracy_train | 0.808102 |
3 | DecisionTree | accuracy_train | 1.000000 |
4 | SVM | accuracy_train | 0.814144 |
5 | KNN | accuracy_test | 0.791342 |
6 | RandomForest | accuracy_test | 0.852140 |
7 | LogReg | accuracy_test | 0.769455 |
8 | DecisionTree | accuracy_test | 0.653696 |
9 | SVM | accuracy_test | 0.766051 |
10 | KNN | diff_train_test | 0.083569 |
11 | RandomForest | diff_train_test | 0.147860 |
12 | LogReg | diff_train_test | 0.038647 |
13 | DecisionTree | diff_train_test | 0.346304 |
14 | SVM | diff_train_test | 0.048093 |
- KNN --> Die Leistung auf Testdaten ist ca. 8% niedriger als auf Trainingsdaten. Wahrscheinlich passt sich das Modell zu sehr an die Trainingsdaten an und lässt sich nicht gut auf ungesehene Testdaten verallgemeinern.
- Random Forest --> Die Leistung bei den Testdaten ist 15% geringer als bei den Trainingsdaten. Das Modell übererfüllt die Trainingsdaten und verallgemeinert sich nicht gut auf die ungesehenen Testdaten.
- Logistische Regression --> Trainings- und Testleistung sind sehr ähnlich (diff=3.9%). Dies bedeutet wahrscheinlich, dass das erstellte Modell gut auf neue Daten verallgemeinert.
- Entscheidungsbaum --> Die Leistung bei den Testdaten ist 34% niedriger als bei den Trainingsdaten. Das Modell passt sich zu sehr an die Trainingsdaten an und verallgemeinert nicht gut auf die unbekannten Testdaten.
- Support Vector Machine --> Trainings- und Testleistung sind sehr ähnlich (diff=4.8%). Das bedeutet wahrscheinlich, dass das erstellte Modell gut auf neue Daten verallgemeinert. -->Überprüfen wir die Konfusionsmatrix der beiden besten Modelle
Konfusionsmatrix-Logistische Regression¶
make_confusion_matrix(y_test,X_test, clf3)
Konfusionsmatrix-SVM¶
make_confusion_matrix(y_test,X_test, clf5)
final_results.loc[(slice(None), 'test', ('LogReg','SVM')), :].to_csv('LogReg_SVM_results.csv')
final_results.loc[(slice(None), 'test', ('LogReg','SVM')), :]
LAYING | SITTING | STANDING | WALKING | accuracy | macro avg | weighted avg | |||
---|---|---|---|---|---|---|---|---|---|
source | model_type | ||||||||
precision | test | LogReg | 0.690702 | 0.687970 | 0.833684 | 0.873563 | 0.769455 | 0.771480 | 0.771161 |
recall | test | LogReg | 0.677840 | 0.745418 | 0.744361 | 0.919355 | 0.769455 | 0.771743 | 0.769455 |
f1-score | test | LogReg | 0.684211 | 0.715543 | 0.786495 | 0.895874 | 0.769455 | 0.770530 | 0.769222 |
support | test | LogReg | 537.000000 | 491.000000 | 532.000000 | 496.000000 | 0.769455 | 2056.000000 | 2056.000000 |
precision | test | SVM | 0.689394 | 0.675978 | 0.837607 | 0.871893 | 0.766051 | 0.768718 | 0.768568 |
recall | test | SVM | 0.677840 | 0.739308 | 0.736842 | 0.919355 | 0.766051 | 0.768336 | 0.766051 |
f1-score | test | SVM | 0.683568 | 0.706226 | 0.784000 | 0.894995 | 0.766051 | 0.767197 | 0.765972 |
support | test | SVM | 537.000000 | 491.000000 | 532.000000 | 496.000000 | 0.766051 | 2056.000000 | 2056.000000 |
Wichtigkeit der Merkmale¶
- Ein One vs Rest Classifier besteht aus n einzelnen Klassifikatoren (mit n=Anzahl der Klassen). Das macht es schwierig, die Merkmalsbedeutung für den gesamten OvsR-Klassifikator zu interpretieren. Eine Möglichkeit besteht darin, die Koeffizienten für jedes einzelne Modell zu bestimmen und den Mittelwert über jeden Koeffizienten zu bilden.
def get_the_feature_importances(clfobject,est_name, functype):
feat_impts = []
#estimator have feature_importances_ attribute
if functype ==1:
for esti in clfobject.best_estimator_.named_steps[est_name].estimators_:
feat_impts.append(np.abs(esti.feature_importances_))
#estimator have coef_ attribute
if functype ==2:
for esti in clfobject.best_estimator_.named_steps[est_name].estimators_:
feat_impts.append(np.abs(esti.coef_[0]))
return np.mean(feat_impts, axis=0)
feature_importance = pd.DataFrame({},[])
feature_importance['Features'] = X_test.columns
feature_importance['Importance'] = get_the_feature_importances(clf3,'logreg',2)
Ergebnisse¶
- Das beste Modell ist der OvsR-Klassifikator mit logistischer Regression.
- Es hat eine gute Genauigkeit (0.77:test, 0.81:train) und eine gute Präzision (0.87:test) für die Klasse 'WALKING'.
- Die 10 wichtigsten Eigenschaften dieses OvsR-Klassifikators sind die folgenden:
feature_importance.sort_values('Importance',ascending=0).head(10).to_csv('FeatureImportanceTop10.csv')
feature_importance.sort_values('Importance',ascending=0).to_csv('FeatureImportanceAll.csv')
feature_importance.sort_values('Importance',ascending=0).head(10)
Features | Importance | |
---|---|---|
15 | tBodyGyro-entropy()-X | 1.170518 |
16 | tBodyGyro-entropy()-Y | 1.070693 |
3 | tBodyAcc-correlation()-Y,Z | 0.791616 |
17 | tBodyGyro-entropy()-Z | 0.790347 |
14 | tBodyGyro-mean()-Y | 0.643740 |
32 | fBodyGyro-meanFreq()-X | 0.537990 |
29 | fBodyAccJerk-maxInds-X | 0.510996 |
18 | tBodyGyro-correlation()-X,Y | 0.501925 |
13 | tBodyGyro-mean()-X | 0.490046 |
19 | tBodyGyro-correlation()-X,Z | 0.468363 |
-->Die meisten dieser wichtigen Funktionen stammen vom Gyroskopsensor im Smartphone.
Endgültiges Modell für das Deployment¶
final_model = clf3.best_estimator_
6. Deployment¶
Teste die Prognose lokal
# recap: print first rows of predictors (here: training data without predicted column "Survived")
X_test.head(4)
tBodyAcc-mean()-Y | tBodyAcc-correlation()-X,Y | tBodyAcc-correlation()-X,Z | tBodyAcc-correlation()-Y,Z | tGravityAcc-correlation()-X,Y | tGravityAcc-correlation()-X,Z | tGravityAcc-correlation()-Y,Z | tBodyAccJerk-mean()-X | tBodyAccJerk-mean()-Y | tBodyAccJerk-mean()-Z | ... | fBodyAccJerk-maxInds-X | fBodyAccJerk-maxInds-Y | fBodyAccJerk-maxInds-Z | fBodyGyro-meanFreq()-X | fBodyGyro-meanFreq()-Y | fBodyGyro-meanFreq()-Z | angle(tBodyAccMean,gravity) | angle(tBodyAccJerkMean),gravityMean) | angle(tBodyGyroMean,gravityMean) | angle(tBodyGyroJerkMean,gravityMean) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5293 | -0.023285 | 0.076989 | -0.490546 | -0.709003 | 0.980580 | -0.996352 | -0.960117 | 0.072046 | 0.045754 | -0.106043 | ... | -0.52 | 0.08 | 0.32 | 0.184035 | -0.059323 | 0.438107 | 0.006462 | 0.162920 | -0.825886 | 0.271151 |
5294 | -0.013163 | -0.104983 | -0.429134 | 0.399177 | 0.945233 | -0.911415 | -0.738535 | 0.070181 | -0.017876 | -0.001721 | ... | -0.16 | -0.32 | -0.40 | 0.018109 | -0.227266 | -0.151698 | -0.083495 | 0.017500 | -0.434375 | 0.920593 |
5295 | -0.026050 | 0.305653 | -0.323848 | 0.279786 | 0.548432 | -0.334864 | 0.590302 | 0.069368 | -0.004908 | -0.013673 | ... | -0.64 | -0.40 | -0.44 | -0.479145 | -0.210084 | 0.049310 | -0.034956 | 0.202302 | 0.064103 | 0.145068 |
5296 | -0.032614 | -0.063792 | -0.167111 | 0.544916 | 0.985534 | 0.653169 | 0.746518 | 0.074853 | 0.032274 | 0.012141 | ... | -0.44 | -0.56 | -0.48 | -0.496954 | -0.499906 | -0.258896 | -0.017067 | 0.154438 | 0.340134 | 0.296407 |
4 rows × 39 columns
# recap: print first rows of training data
data_test['Activity'].head(4)
5293 STANDING 5294 STANDING 5295 STANDING 5296 STANDING Name: Activity, dtype: object
final_model.predict(X_test.head(4))
array(['LAYING', 'STANDING', 'STANDING', 'STANDING'], dtype='<U8')
-->final_model works and can be deplyoed