Analyse der Bewegung und Aktivität von freilaufenden Rindern¶

1. Business Understanding¶

Landwirte wollen analysieren, wie sich ihr Vieh auf der Weide bewegt und verhält. (siehe Readme für weitere Informationen)

2. Daten und Datenverständnis¶

2.1. Import von relevanten Daten¶

In [34]:
import numpy as np # standard for data processing
import pandas as pd # standard for data processing

import plotly.graph_objects as go # creates plots
import seaborn as sns
from matplotlib import pyplot as plt
import plotly.express as px
from mpl_toolkits.mplot3d import Axes3D

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.ensemble import RandomForestClassifier

from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report

Daten lesen¶

Quelle der Daten:

Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra und Jorge L. Reyes-Ortiz. Ein öffentlich zugänglicher Datensatz für die Erkennung menschlicher Aktivitäten mit Smartphones. 21. Europäisches Symposium über Künstliche Neuronale Netze, Computational Intelligence und Maschinelles Lernen, ESANN 2013. Brügge, Belgien 24-26 April 2013. Heruntergeladen von Kaggle: https://www.kaggle.com/uciml/human-activity-recognition-with-smartphones

Es gibt zwei getrennte Datensätze aus der Testgruppe und der Trainingsgruppe. Diese werden für die weitere Analyse zunächst noch einmal zusammengefasst. Die Daten werden identifiziert, ob sie aus dem Trainings- oder Testdatensatz stammen, damit sie später wieder aufgeteilt werden können.

In [35]:
df_train = pd.read_csv('https://storage.googleapis.com/ml-service-repository-datastorage/Analysis_of_the_movement_and_activity_of_free-ranging_cattle_test.csv', delimiter=',')
In [36]:
df_test = pd.read_csv('https://storage.googleapis.com/ml-service-repository-datastorage/Analysis_of_the_movement_and_activity_of_free-ranging_cattle_test.csv', delimiter=',')
In [37]:
df_train['type']='train'
In [38]:
df_test['type']='test'
In [39]:
df = pd.concat([df_train, df_test], axis=0).reset_index(drop=True)
In [40]:
df.head()
Out[40]:
tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z tBodyAcc-std()-X tBodyAcc-std()-Y tBodyAcc-std()-Z tBodyAcc-mad()-X tBodyAcc-mad()-Y tBodyAcc-mad()-Z tBodyAcc-max()-X ... angle(tBodyAccMean,gravity) angle(tBodyAccJerkMean),gravityMean) angle(tBodyGyroMean,gravityMean) angle(tBodyGyroJerkMean,gravityMean) angle(X,gravityMean) angle(Y,gravityMean) angle(Z,gravityMean) subject Activity type
0 0.288585 -0.020294 -0.132905 -0.995279 -0.983111 -0.913526 -0.995112 -0.983185 -0.923527 -0.934724 ... -0.112754 0.030400 -0.464761 -0.018446 -0.841247 0.179941 -0.058627 1 STANDING train
1 0.278419 -0.016411 -0.123520 -0.998245 -0.975300 -0.960322 -0.998807 -0.974914 -0.957686 -0.943068 ... 0.053477 -0.007435 -0.732626 0.703511 -0.844788 0.180289 -0.054317 1 STANDING train
2 0.279653 -0.019467 -0.113462 -0.995380 -0.967187 -0.978944 -0.996520 -0.963668 -0.977469 -0.938692 ... -0.118559 0.177899 0.100699 0.808529 -0.848933 0.180637 -0.049118 1 STANDING train
3 0.279174 -0.026201 -0.123283 -0.996091 -0.983403 -0.990675 -0.997099 -0.982750 -0.989302 -0.938692 ... -0.036788 -0.012892 0.640011 -0.485366 -0.848649 0.181935 -0.047663 1 STANDING train
4 0.276629 -0.016570 -0.115362 -0.998139 -0.980817 -0.990482 -0.998321 -0.979672 -0.990441 -0.942469 ... 0.123320 0.122542 0.693578 -0.615971 -0.847865 0.185151 -0.043892 1 STANDING train

5 rows × 564 columns

  1. Datenexploration und -aufbereitung
In [41]:
print('Total number of observations: ' + str(df.shape[0]))
Total number of observations: 10299

Auf fehlende Werte prüfen¶

In [42]:
df.isna().sum()
Out[42]:
tBodyAcc-mean()-X       0
tBodyAcc-mean()-Y       0
tBodyAcc-mean()-Z       0
tBodyAcc-std()-X        0
tBodyAcc-std()-Y        0
                       ..
angle(Y,gravityMean)    0
angle(Z,gravityMean)    0
subject                 0
Activity                0
type                    0
Length: 564, dtype: int64
In [43]:
df.isna().sum().sum()
Out[43]:
0

--> **Keine fehlenden Werte im Data Frame

Auf Duplikate prüfen¶

In [44]:
df[df.duplicated()]
Out[44]:
tBodyAcc-mean()-X tBodyAcc-mean()-Y tBodyAcc-mean()-Z tBodyAcc-std()-X tBodyAcc-std()-Y tBodyAcc-std()-Z tBodyAcc-mad()-X tBodyAcc-mad()-Y tBodyAcc-mad()-Z tBodyAcc-max()-X ... angle(tBodyAccMean,gravity) angle(tBodyAccJerkMean),gravityMean) angle(tBodyGyroMean,gravityMean) angle(tBodyGyroJerkMean,gravityMean) angle(X,gravityMean) angle(Y,gravityMean) angle(Z,gravityMean) subject Activity type

0 rows × 564 columns

--> keine Duplikatzeilen in den Daten

Ziel Variable¶

Wir haben ein Klassifizierungsproblem und die Zielspalte ist die Spalte "Aktivität".

In [45]:
#Possible classes/labels
df['Activity'].unique()
Out[45]:
array(['STANDING', 'SITTING', 'LAYING', 'WALKING', 'WALKING_DOWNSTAIRS',
       'WALKING_UPSTAIRS'], dtype=object)
In [46]:
df['Activity'].value_counts().plot.bar()
Out[46]:
<AxesSubplot:>
No description has been provided for this image
  • Im Smart Farming Kontext dieser Aufgabe würde ein Rind normalerweise nicht auf einer Treppe laufen. Daher werden die Zeilen mit den Bezeichnungen "WALKING_DOWNSTAIRS" und "WALKING_UPSTAIRS" entfernt:
In [47]:
indexNames = df[(df['Activity'] == 'WALKING_DOWNSTAIRS') | (df['Activity'] == 'WALKING_UPSTAIRS')].index
df.drop(indexNames , inplace=True)
df = df.reset_index(drop=True)
In [48]:
df['Activity'].value_counts().plot.bar()
Out[48]:
<AxesSubplot:>
No description has been provided for this image
  • Für jede Aktivität etwa die gleiche Anzahl von Beobachtungen. Wir könnten vor der Modellierung eine Überstichprobe aller Minderheitenklassen nehmen, um einen perfekt ausgewogenen Datensatz zu erhalten.

Wie viele Beobachtungen gibt es von jedem Testobjekt??¶

In [49]:
df['subject'].value_counts().plot.bar()
Out[49]:
<AxesSubplot:>
No description has been provided for this image

Die obige Abbildung ist interessant. Normalerweise haben alle Versuchspersonen die gleiche Versuchsreihe durchgeführt. Daher würde man annehmen, dass die Anzahl der Beobachtungen von jedem Probanden nahezu gleich sein muss. Es gibt jedoch eine Spanne von 200 bis 300 Beobachtungen über alle Versuchspersonen hinweg. Ein Grund dafür könnte sein, dass der Wechsel von einer Aktivität zur nächsten in der Sequenz nicht scharf oder deutlich genug war und die Beobachter die unklaren Beobachtungen in diesen unstabilen Phasen nachträglich gelöscht haben.

Gesamtzahl der Beobachtungen¶

In [50]:
print('Number of observations: ' + str(df.shape[0]))
Number of observations: 7349

PCA zur Visualisierung¶

  • **Die PCA ist eine einfache Methode zur Visualisierung hochdimensionaler Daten in einem niedrigdimensionalen Raum. (Vorsicht: wir bezahlen dafür mit einem Informationsverlust-->aber für Visualisierungszwecke ist es OK)
In [51]:
data_visualisation = df.drop('subject', axis =1).drop('Activity', axis=1).drop('type', axis =1)
In [52]:
s = StandardScaler()
data_visualisation = s.fit_transform(data_visualisation)
In [53]:
#We want 3 
p = PCA(n_components = 3)
data_visualisation_transformed = p.fit_transform(data_visualisation)
#data_visualisation_transformed = p.transform(data_visualisation)
In [54]:
print('Features before PCA: ' + str(data_visualisation.shape[1]))
Features before PCA: 561
In [55]:
print('Features after PCA: ' + str(data_visualisation_transformed.shape[1]))
Features after PCA: 3
In [56]:
p.explained_variance_ratio_.sum()
Out[56]:
0.619964301867613
In [57]:
x = []
for i in range(0, len(p.explained_variance_ratio_)):
    x.append('PC' + str(i + 1))
y = np.array(p.explained_variance_ratio_)
z = np.cumsum(p.explained_variance_ratio_)
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.bar(x, y)
plt.plot(x, z)
Out[57]:
[<matplotlib.lines.Line2D at 0x1c5212bc790>]
No description has been provided for this image
In [58]:
#labels 'STANDING', 'SITTING', 'LAYING', 'WALKING'

#fig, ax = plt.subplots()
plt.figure(figsize=(15,6))
for activity in df['Activity'].unique():
    filtered_val = data_visualisation_transformed[df['Activity']==activity,:]
    plt.scatter(filtered_val[:,0], filtered_val[:,1], label=activity, s=1.5)

plt.legend()
plt.show()
No description has been provided for this image
In [59]:
fig = plt.figure(figsize = (15, 6))
ax = fig.add_subplot(111, projection='3d')

for activity in df['Activity'].unique():
    filtered_val = data_visualisation_transformed[df['Activity'] == activity, :]
    ax.scatter(
        filtered_val[:, 0], 
        filtered_val[:, 1], 
        filtered_val[:, 2], 
        label = activity, 
        s = 4
    )

plt.legend()
plt.show()
No description has been provided for this image
In [60]:
### Interactive 3D-plot with plotly

# representations require a lot of computing power**

si = np.ones(7349)-0.7
fig = px.scatter_3d(data_visualisation_transformed, 
                    x=data_visualisation_transformed[:, 0], 
                    y=data_visualisation_transformed[:, 1], 
                    z=data_visualisation_transformed[:, 2],
                    color=df['Activity'],
                    #size=si
                    )
fig.update_traces(marker=dict(size=2.5,line=dict(width=0.05,color='azure')),selector=dict(mode='markers'))
fig.show()

Ergebnisse Visualisierung mit PCA¶

  • Nach der PCA/Transformation in 3 Hauptkomponenten können wir die 3 Klassen in angemessener Weise visuell trennen.
  • Allerdings beschreiben die 3 Hauptkomponenten nur 62% der Varianz der ursprünglichen Daten. Das bedeutet, dass wir einen relativ großen Informationsverlust haben.
  • Ein zweites Problem ist, dass es schwierig ist, Modelle zu interpretieren, die auf dem Ergebnis einer PCA basieren

-->wir verwenden die PCA nur zur Visualisierung und nicht zur Modellierung

Feature Übersicht¶

Aus dem Originalpapier über die Daten/Quelle der Daten können wir die Information erhalten, dass es die folgenden 17 Hauptmerkmale im Zeit- und Frequenzbereich des Signals gibt:

Name Time Freq.

Body Acc |1| 1 Gravity Acc |1| 0 Body Acc Jerk |1 |1 Body Angular Speed |1 |1 Body Angular Acc |1 |0 Body Acc Magnitude |1 |1 Gravity Acc Mag |1 |0 Body Acc Jerk Mag |1 |1 Body Angular Speed Mag |1 |1 Body Angular Acc Mag |1 |1

In [61]:
data_temp = df.drop('subject', axis =1).drop('Activity', axis=1).drop('type', axis =1)
In [62]:
print('Features: ' + str(data_temp.shape[1]))
Features: 561
In [63]:
data_temp.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7349 entries, 0 to 7348
Columns: 561 entries, tBodyAcc-mean()-X to angle(Z,gravityMean)
dtypes: float64(561)
memory usage: 31.5 MB

Multikollinearität prüfen¶

In [65]:
variables = data_temp
vif = pd.DataFrame()
In [66]:
# This takes many minutes to compute 
# This is very cpu intensive

tempList = list()
total = variables.shape[1]
for i in range(total):
    print(i, ' out of ', total-1)
    x = variance_inflation_factor(variables.values, i)
    tempList.append(x)
0  out of  560
1  out of  560
2  out of  560
3  out of  560
4  out of  560
5  out of  560
6  out of  560
7  out of  560
8  out of  560
9  out of  560
10  out of  560
11  out of  560
12  out of  560
13  out of  560
14  out of  560
15  out of  560
16  out of  560
17  out of  560
18  out of  560
19  out of  560
20  out of  560
21  out of  560
22  out of  560
23  out of  560
24  out of  560
25  out of  560
26  out of  560
27  out of  560
28  out of  560
29  out of  560
30  out of  560
31  out of  560
32  out of  560
33  out of  560
34  out of  560
35  out of  560
36  out of  560
37  out of  560
38  out of  560
39  out of  560
40  out of  560
41  out of  560
42  out of  560
43  out of  560
44  out of  560
45  out of  560
46  out of  560
47  out of  560
48  out of  560
49  out of  560
50  out of  560
51  out of  560
52  out of  560
53  out of  560
54  out of  560
55  out of  560
56  out of  560
57  out of  560
58  out of  560
59  out of  560
60  out of  560
61  out of  560
62  out of  560
63  out of  560
64  out of  560
65  out of  560
66  out of  560
67  out of  560
68  out of  560
69  out of  560
70  out of  560
71  out of  560
72  out of  560
73  out of  560
74  out of  560
75  out of  560
76  out of  560
77  out of  560
78  out of  560
79  out of  560
80  out of  560
81  out of  560
82  out of  560
83  out of  560
84  out of  560
85  out of  560
86  out of  560
87  out of  560
88  out of  560
89  out of  560
90  out of  560
91  out of  560
92  out of  560
93  out of  560
94  out of  560
95  out of  560
96  out of  560
97  out of  560
98  out of  560
99  out of  560
100  out of  560
101  out of  560
102  out of  560
103  out of  560
104  out of  560
105  out of  560
106  out of  560
107  out of  560
108  out of  560
109  out of  560
110  out of  560
111  out of  560
112  out of  560
113  out of  560
114  out of  560
115  out of  560
116  out of  560
117  out of  560
118  out of  560
119  out of  560
120  out of  560
121  out of  560
122  out of  560
123  out of  560
124  out of  560
125  out of  560
126  out of  560
127  out of  560
128  out of  560
129  out of  560
130  out of  560
131  out of  560
132  out of  560
133  out of  560
134  out of  560
135  out of  560
136  out of  560
137  out of  560
138  out of  560
139  out of  560
140  out of  560
141  out of  560
142  out of  560
143  out of  560
144  out of  560
145  out of  560
146  out of  560
147  out of  560
148  out of  560
149  out of  560
150  out of  560
151  out of  560
152  out of  560
153  out of  560
154  out of  560
155  out of  560
156  out of  560
157  out of  560
158  out of  560
159  out of  560
160  out of  560
161  out of  560
162  out of  560
163  out of  560
164  out of  560
165  out of  560
166  out of  560
167  out of  560
168  out of  560
169  out of  560
170  out of  560
171  out of  560
172  out of  560
173  out of  560
174  out of  560
175  out of  560
176  out of  560
177  out of  560
178  out of  560
179  out of  560
180  out of  560
181  out of  560
182  out of  560
183  out of  560
184  out of  560
185  out of  560
186  out of  560
187  out of  560
188  out of  560
189  out of  560
190  out of  560
191  out of  560
192  out of  560
193  out of  560
194  out of  560
195  out of  560
196  out of  560
197  out of  560
198  out of  560
199  out of  560
200  out of  560
201  out of  560
C:\Users\eebal\Anaconda3\lib\site-packages\statsmodels\stats\outliers_influence.py:193: RuntimeWarning:

divide by zero encountered in double_scalars

202  out of  560
203  out of  560
204  out of  560
205  out of  560
206  out of  560
207  out of  560
208  out of  560
209  out of  560
210  out of  560
211  out of  560
212  out of  560
213  out of  560
214  out of  560
215  out of  560
216  out of  560
217  out of  560
218  out of  560
219  out of  560
220  out of  560
221  out of  560
222  out of  560
223  out of  560
224  out of  560
225  out of  560
226  out of  560
227  out of  560
228  out of  560
229  out of  560
230  out of  560
231  out of  560
232  out of  560
233  out of  560
234  out of  560
235  out of  560
236  out of  560
237  out of  560
238  out of  560
239  out of  560
240  out of  560
241  out of  560
242  out of  560
243  out of  560
244  out of  560
245  out of  560
246  out of  560
247  out of  560
248  out of  560
249  out of  560
250  out of  560
251  out of  560
252  out of  560
253  out of  560
254  out of  560
255  out of  560
256  out of  560
257  out of  560
258  out of  560
259  out of  560
260  out of  560
261  out of  560
262  out of  560
263  out of  560
264  out of  560
265  out of  560
266  out of  560
267  out of  560
268  out of  560
269  out of  560
270  out of  560
271  out of  560
272  out of  560
273  out of  560
274  out of  560
275  out of  560
276  out of  560
277  out of  560
278  out of  560
279  out of  560
280  out of  560
281  out of  560
282  out of  560
283  out of  560
284  out of  560
285  out of  560
286  out of  560
287  out of  560
288  out of  560
289  out of  560
290  out of  560
291  out of  560
292  out of  560
293  out of  560
294  out of  560
295  out of  560
296  out of  560
297  out of  560
298  out of  560
299  out of  560
300  out of  560
301  out of  560
302  out of  560
303  out of  560
304  out of  560
305  out of  560
306  out of  560
307  out of  560
308  out of  560
309  out of  560
310  out of  560
311  out of  560
312  out of  560
313  out of  560
314  out of  560
315  out of  560
316  out of  560
317  out of  560
318  out of  560
319  out of  560
320  out of  560
321  out of  560
322  out of  560
323  out of  560
324  out of  560
325  out of  560
326  out of  560
327  out of  560
328  out of  560
329  out of  560
330  out of  560
331  out of  560
332  out of  560
333  out of  560
334  out of  560
335  out of  560
336  out of  560
337  out of  560
338  out of  560
339  out of  560
340  out of  560
341  out of  560
342  out of  560
343  out of  560
344  out of  560
345  out of  560
346  out of  560
347  out of  560
348  out of  560
349  out of  560
350  out of  560
351  out of  560
352  out of  560
353  out of  560
354  out of  560
355  out of  560
356  out of  560
357  out of  560
358  out of  560
359  out of  560
360  out of  560
361  out of  560
362  out of  560
363  out of  560
364  out of  560
365  out of  560
366  out of  560
367  out of  560
368  out of  560
369  out of  560
370  out of  560
371  out of  560
372  out of  560
373  out of  560
374  out of  560
375  out of  560
376  out of  560
377  out of  560
378  out of  560
379  out of  560
380  out of  560
381  out of  560
382  out of  560
383  out of  560
384  out of  560
385  out of  560
386  out of  560
387  out of  560
388  out of  560
389  out of  560
390  out of  560
391  out of  560
392  out of  560
393  out of  560
394  out of  560
395  out of  560
396  out of  560
397  out of  560
398  out of  560
399  out of  560
400  out of  560
401  out of  560
402  out of  560
403  out of  560
404  out of  560
405  out of  560
406  out of  560
407  out of  560
408  out of  560
409  out of  560
410  out of  560
411  out of  560
412  out of  560
413  out of  560
414  out of  560
415  out of  560
416  out of  560
417  out of  560
418  out of  560
419  out of  560
420  out of  560
421  out of  560
422  out of  560
423  out of  560
424  out of  560
425  out of  560
426  out of  560
427  out of  560
428  out of  560
429  out of  560
430  out of  560
431  out of  560
432  out of  560
433  out of  560
434  out of  560
435  out of  560
436  out of  560
437  out of  560
438  out of  560
439  out of  560
440  out of  560
441  out of  560
442  out of  560
443  out of  560
444  out of  560
445  out of  560
446  out of  560
447  out of  560
448  out of  560
449  out of  560
450  out of  560
451  out of  560
452  out of  560
453  out of  560
454  out of  560
455  out of  560
456  out of  560
457  out of  560
458  out of  560
459  out of  560
460  out of  560
461  out of  560
462  out of  560
463  out of  560
464  out of  560
465  out of  560
466  out of  560
467  out of  560
468  out of  560
469  out of  560
470  out of  560
471  out of  560
472  out of  560
473  out of  560
474  out of  560
475  out of  560
476  out of  560
477  out of  560
478  out of  560
479  out of  560
480  out of  560
481  out of  560
482  out of  560
483  out of  560
484  out of  560
485  out of  560
486  out of  560
487  out of  560
488  out of  560
489  out of  560
490  out of  560
491  out of  560
492  out of  560
493  out of  560
494  out of  560
495  out of  560
496  out of  560
497  out of  560
498  out of  560
499  out of  560
500  out of  560
501  out of  560
502  out of  560
503  out of  560
504  out of  560
505  out of  560
506  out of  560
507  out of  560
508  out of  560
509  out of  560
510  out of  560
511  out of  560
512  out of  560
513  out of  560
514  out of  560
515  out of  560
516  out of  560
517  out of  560
518  out of  560
519  out of  560
520  out of  560
521  out of  560
522  out of  560
523  out of  560
524  out of  560
525  out of  560
526  out of  560
527  out of  560
528  out of  560
529  out of  560
530  out of  560
531  out of  560
532  out of  560
533  out of  560
534  out of  560
535  out of  560
536  out of  560
537  out of  560
538  out of  560
539  out of  560
540  out of  560
541  out of  560
542  out of  560
543  out of  560
544  out of  560
545  out of  560
546  out of  560
547  out of  560
548  out of  560
549  out of  560
550  out of  560
551  out of  560
552  out of  560
553  out of  560
554  out of  560
555  out of  560
556  out of  560
557  out of  560
558  out of  560
559  out of  560
560  out of  560
In [67]:
# add name of features to vif dataframe
vif["Features"] = variables.columns

# add the computed VIF Values to the vif dataframe
vif["VIF"] = tempList
In [68]:
vif[vif['VIF']>10]
Out[68]:
VIF Features
0 1.673655e+02 tBodyAcc-mean()-X
2 2.458447e+01 tBodyAcc-mean()-Z
3 2.000913e+06 tBodyAcc-std()-X
4 5.503812e+05 tBodyAcc-std()-Y
5 5.281580e+05 tBodyAcc-std()-Z
... ... ...
552 7.882126e+01 fBodyBodyGyroJerkMag-skewness()
553 1.882061e+02 fBodyBodyGyroJerkMag-kurtosis()
558 1.310881e+03 angle(X,gravityMean)
559 3.877562e+02 angle(Y,gravityMean)
560 2.950325e+02 angle(Z,gravityMean)

522 rows × 2 columns

In [69]:
#Drop all features with a VIF > 10, as they are correlating to much with other features
features_to_drop = vif[vif['VIF']>10].Features
In [70]:
data_with_important_features = data_temp.drop(features_to_drop, axis = 1)
In [71]:
data_with_important_features.head()
Out[71]:
tBodyAcc-mean()-Y tBodyAcc-correlation()-X,Y tBodyAcc-correlation()-X,Z tBodyAcc-correlation()-Y,Z tGravityAcc-correlation()-X,Y tGravityAcc-correlation()-X,Z tGravityAcc-correlation()-Y,Z tBodyAccJerk-mean()-X tBodyAccJerk-mean()-Y tBodyAccJerk-mean()-Z ... fBodyAccJerk-maxInds-X fBodyAccJerk-maxInds-Y fBodyAccJerk-maxInds-Z fBodyGyro-meanFreq()-X fBodyGyro-meanFreq()-Y fBodyGyro-meanFreq()-Z angle(tBodyAccMean,gravity) angle(tBodyAccJerkMean),gravityMean) angle(tBodyGyroMean,gravityMean) angle(tBodyGyroJerkMean,gravityMean)
0 -0.020294 0.376314 0.435129 0.660790 0.570222 0.439027 0.986913 0.077996 0.005001 -0.067831 ... 1.00 -0.24 -1.00 -0.257549 0.097947 0.547151 -0.112754 0.030400 -0.464761 -0.018446
1 -0.016411 -0.013429 -0.072692 0.579382 -0.831284 -0.865711 0.974386 0.074007 0.005771 0.029377 ... -0.32 -0.12 -0.32 -0.048167 -0.401608 -0.068178 0.053477 -0.007435 -0.732626 0.703511
2 -0.019467 -0.124698 -0.181105 0.608900 -0.181090 0.337936 0.643417 0.073636 0.003104 -0.009046 ... -0.16 -0.48 -0.28 -0.216685 -0.017264 -0.110720 -0.118559 0.177899 0.100699 0.808529
3 -0.026201 -0.305693 -0.362654 0.507459 -0.991309 -0.968821 0.984256 0.077321 0.020058 -0.009865 ... -0.12 -0.56 -0.28 0.216862 -0.135245 -0.049728 -0.036788 -0.012892 0.640011 -0.485366
4 -0.016570 -0.155804 -0.189763 0.599213 -0.408330 -0.184840 0.964797 0.073444 0.019122 0.016780 ... -0.32 -0.08 0.04 -0.153343 -0.088403 -0.162230 0.123320 0.122542 0.693578 -0.615971

5 rows × 39 columns

In [72]:
data_with_important_features.describe(include='all').transpose()
Out[72]:
count mean std min 25% 50% 75% max
tBodyAcc-mean()-Y 7349.0 -0.016299 0.038629 -1.000000 -0.021403 -0.017029 -0.012643 1.000000
tBodyAcc-correlation()-X,Y 7349.0 -0.024801 0.360763 -1.000000 -0.242409 -0.058502 0.172579 1.000000
tBodyAcc-correlation()-X,Z 7349.0 -0.185987 0.338450 -1.000000 -0.395440 -0.171765 0.020009 1.000000
tBodyAcc-correlation()-Y,Z 7349.0 0.096760 0.415218 -1.000000 -0.187455 0.138724 0.405517 1.000000
tGravityAcc-correlation()-X,Y 7349.0 0.103601 0.735996 -1.000000 -0.665946 0.231863 0.851334 1.000000
tGravityAcc-correlation()-X,Z 7349.0 -0.166366 0.726580 -1.000000 -0.879864 -0.367198 0.567089 1.000000
tGravityAcc-correlation()-Y,Z 7349.0 0.075266 0.735481 -1.000000 -0.688232 0.185269 0.823623 1.000000
tBodyAccJerk-mean()-X 7349.0 0.077434 0.108154 -0.581496 0.071259 0.075963 0.081112 0.855403
tBodyAccJerk-mean()-Y 7349.0 0.009156 0.124891 -0.948714 0.000186 0.010928 0.021387 0.835172
tBodyAccJerk-mean()-Z 7349.0 -0.003619 0.113880 -0.807273 -0.015246 -0.001097 0.012727 0.735045
tBodyAccJerk-correlation()-X,Y 7349.0 -0.074377 0.242017 -1.000000 -0.228501 -0.078196 0.075621 1.000000
tBodyAccJerk-correlation()-X,Z 7349.0 0.039575 0.284099 -1.000000 -0.145351 0.062285 0.232011 1.000000
tBodyAccJerk-correlation()-Y,Z 7349.0 0.054372 0.274488 -1.000000 -0.127156 0.042153 0.222467 1.000000
tBodyGyro-mean()-X 7349.0 -0.028776 0.074857 -0.757669 -0.034259 -0.027690 -0.021722 0.524123
tBodyGyro-mean()-Y 7349.0 -0.075976 0.086467 -0.851973 -0.089776 -0.074841 -0.062999 1.000000
tBodyGyro-entropy()-X 7349.0 -0.238753 0.401771 -1.000000 -0.550684 -0.290427 0.102805 0.944059
tBodyGyro-entropy()-Y 7349.0 -0.205995 0.364969 -1.000000 -0.457364 -0.209115 0.067220 1.000000
tBodyGyro-entropy()-Z 7349.0 -0.196045 0.459841 -1.000000 -0.563556 -0.272572 0.228766 1.000000
tBodyGyro-correlation()-X,Y 7349.0 -0.188702 0.405215 -1.000000 -0.483125 -0.208589 0.084417 1.000000
tBodyGyro-correlation()-X,Z 7349.0 0.041870 0.408883 -1.000000 -0.220456 0.018146 0.302452 1.000000
tBodyGyro-correlation()-Y,Z 7349.0 -0.168283 0.440983 -1.000000 -0.530728 -0.177793 0.150636 1.000000
tBodyGyroJerk-mean()-X 7349.0 -0.098036 0.086125 -0.968976 -0.104611 -0.098272 -0.092254 1.000000
tBodyGyroJerk-mean()-Y 7349.0 -0.041384 0.083025 -1.000000 -0.047188 -0.040452 -0.033379 0.720665
tBodyGyroJerk-mean()-Z 7349.0 -0.055329 0.087333 -0.749480 -0.064313 -0.054572 -0.047179 0.600600
tBodyGyroJerk-correlation()-X,Y 7349.0 0.017183 0.278593 -1.000000 -0.169257 0.013469 0.198487 1.000000
tBodyGyroJerk-correlation()-X,Z 7349.0 0.063215 0.266236 -1.000000 -0.106320 0.057007 0.226712 1.000000
tBodyGyroJerk-correlation()-Y,Z 7349.0 -0.111144 0.262790 -1.000000 -0.279851 -0.112190 0.051061 1.000000
fBodyAcc-meanFreq()-Y 7349.0 0.057152 0.247243 -1.000000 -0.099065 0.053560 0.228640 1.000000
fBodyAcc-meanFreq()-Z 7349.0 0.104336 0.265157 -1.000000 -0.069322 0.118173 0.288123 1.000000
fBodyAccJerk-maxInds-X 7349.0 -0.294016 0.275268 -1.000000 -0.480000 -0.280000 -0.120000 1.000000
fBodyAccJerk-maxInds-Y 7349.0 -0.354530 0.273795 -1.000000 -0.560000 -0.360000 -0.200000 1.000000
fBodyAccJerk-maxInds-Z 7349.0 -0.261875 0.274175 -1.000000 -0.440000 -0.280000 -0.080000 1.000000
fBodyGyro-meanFreq()-X 7349.0 -0.064359 0.263692 -1.000000 -0.229611 -0.054360 0.112420 1.000000
fBodyGyro-meanFreq()-Y 7349.0 -0.168671 0.274139 -1.000000 -0.357676 -0.174484 0.018202 1.000000
fBodyGyro-meanFreq()-Z 7349.0 -0.013731 0.263507 -1.000000 -0.191156 -0.019524 0.161677 1.000000
angle(tBodyAccMean,gravity) 7349.0 0.010961 0.267908 -0.977973 -0.074345 0.008115 0.096642 0.977810
angle(tBodyAccJerkMean),gravityMean) 7349.0 0.005884 0.369485 -0.979326 -0.206153 0.010343 0.211965 0.991899
angle(tBodyGyroMean,gravityMean) 7349.0 0.013123 0.488843 -1.000000 -0.344634 0.010699 0.381334 0.994532
angle(tBodyGyroJerkMean,gravityMean) 7349.0 -0.006028 0.460188 -1.000000 -0.359519 0.001035 0.335377 1.000000
In [73]:
plt.figure(figsize=(28,15))
boxplot = sns.boxplot(x="variable", y="value", data=pd.melt(data_with_important_features.iloc[:,:19]))
boxplot.xaxis.set_ticklabels(boxplot.xaxis.get_ticklabels(), rotation=90, ha='right', fontsize=15)
plt.ylabel('values')
plt.xlabel('Feautures')
Out[73]:
Text(0.5, 0, 'Feautures')
No description has been provided for this image
In [74]:
plt.figure(figsize=(28,15))
boxplot = sns.boxplot(x="variable", y="value", data=pd.melt(data_with_important_features.iloc[:,19:]))
boxplot.xaxis.set_ticklabels(boxplot.xaxis.get_ticklabels(), rotation=90, ha='right', fontsize=15)
plt.ylabel('values')
plt.xlabel('Feautures')
Out[74]:
Text(0.5, 0, 'Feautures')
No description has been provided for this image

Zusammenfassung des Datenverständnisses¶

  • 1 Spalte mit der Subjekt-ID 'subject'-->nur zur Filterung, muss entfernt werden

  • 1 Zielspalte 'Aktivität'-->4 Klassen (GEHEN, SITZEN, LIEGEN, STEHEN), etwas unausgewogen

  • wir haben ein Klassifizierungsproblem mit mehreren Klassen überwachtes maschinelles Lernen

  • 1 "support column" source-->is the observation from the train or test group-->train test split along this value

  • 1 "Unterstützungsspalte"-Quelle-->ist die Beobachtung aus der Trainings- oder Testgruppe-->Train-Test-Aufteilung entlang dieses Wertes

  • Diese wurden nach Möglichkeit für jede der drei Achsen x, y und z berechnet

  • In der Summe sind das 561 Merkmale -->Die Analyse der einzelnen Merkmale im Detail ist aufgrund der großen Anzahl von Merkmalen sehr schwierig. Schließlich haben wir Multilinerarität in den Daten

  • Nach der Multicollineraty-Prüfung sind nur noch 39 Merkmale übrig. In den Boxplots sind bei einigen Merkmalen Ausreißer zu sehen, aber es ist schwierig, diese Ausreißer zu analysieren und zu interpretieren, da die Daten anscheinend bereits teilweise standardisiert sind und wir keine Messeinheiten haben. Daher akzeptieren wir im Moment diese Ausreißer

Trainings Test Split

¶

In [75]:
data = df.copy()
In [76]:
data_train = data[data['type']=='train']
data_test = data[data['type']=='test']
In [78]:
X_train = data_train.drop('subject', axis =1).drop('Activity', axis=1).drop('type', axis =1)
X_train = X_train.drop(features_to_drop, axis =1)
y_train = data_train['Activity']

X_test = data_test.drop('subject', axis =1).drop('Activity', axis=1).drop('type', axis =1)
X_test = X_test.drop(features_to_drop, axis =1)
y_test = data_test['Activity']
In [79]:
print('X_train: ' + str(X_train.shape))
print('y_train: ' + str(y_train.shape))
print('X_test: ' + str(X_test.shape))
print('y_test: ' + str(y_test.shape))
X_train: (5293, 39)
y_train: (5293,)
X_test: (2056, 39)
y_test: (2056,)

Oversampling von Daten¶

In [80]:
ros = RandomOverSampler(random_state=0)
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)
In [81]:
y_train_resampled.value_counts().plot.bar()
Out[81]:
<AxesSubplot:>
No description has been provided for this image

Modellierung¶

Um verschiedene Modelle zu erstellen, ist es eine gute Praxis, Pipelines und GridSearchCV zu verwenden. Diese beiden Tools sind ein einfacher Weg, um Hyperparamter-Tuning und k-fold Cross-Validation auf verschiedenen Modellen durchzuführen. In diesem Fall bestehen die Pipelines aus einem StandardScaler und einem One vs Rest-Klassifikator

Erstellen & Evaluieren verschiedener Modelle

KNN¶

K-Nearest Neighbors

In [82]:
# Pipeline for KNN
pipeline1 = Pipeline([
    ("scaler", StandardScaler()),
    ("knn", OneVsRestClassifier(KNeighborsClassifier()))
])
In [83]:
pipeline1.get_params().keys()
Out[83]:
dict_keys(['memory', 'steps', 'verbose', 'scaler', 'knn', 'scaler__copy', 'scaler__with_mean', 'scaler__with_std', 'knn__estimator__algorithm', 'knn__estimator__leaf_size', 'knn__estimator__metric', 'knn__estimator__metric_params', 'knn__estimator__n_jobs', 'knn__estimator__n_neighbors', 'knn__estimator__p', 'knn__estimator__weights', 'knn__estimator', 'knn__n_jobs'])
In [84]:
#GridSearchCV-Object
clf = GridSearchCV(pipeline1, verbose=10, param_grid = {
    "knn__estimator__n_neighbors": [8, 10, 20, 30]
})
clf.fit(X_train_resampled, y_train_resampled)
Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] knn__estimator__n_neighbors=8 ...................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ....... knn__estimator__n_neighbors=8, score=0.769, total=   3.5s
[CV] knn__estimator__n_neighbors=8 ...................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    3.4s remaining:    0.0s
[CV] ....... knn__estimator__n_neighbors=8, score=0.754, total=   3.3s
[CV] knn__estimator__n_neighbors=8 ...................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    6.7s remaining:    0.0s
[CV] ....... knn__estimator__n_neighbors=8, score=0.801, total=   3.3s
[CV] knn__estimator__n_neighbors=8 ...................................
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   10.0s remaining:    0.0s
[CV] ....... knn__estimator__n_neighbors=8, score=0.832, total=   3.5s
[CV] knn__estimator__n_neighbors=8 ...................................
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   13.4s remaining:    0.0s
[CV] ....... knn__estimator__n_neighbors=8, score=0.818, total=   3.3s
[CV] knn__estimator__n_neighbors=10 ..................................
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   16.7s remaining:    0.0s
[CV] ...... knn__estimator__n_neighbors=10, score=0.769, total=   3.4s
[CV] knn__estimator__n_neighbors=10 ..................................
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   20.1s remaining:    0.0s
[CV] ...... knn__estimator__n_neighbors=10, score=0.749, total=   3.5s
[CV] knn__estimator__n_neighbors=10 ..................................
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   23.6s remaining:    0.0s
[CV] ...... knn__estimator__n_neighbors=10, score=0.815, total=   3.4s
[CV] knn__estimator__n_neighbors=10 ..................................
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   27.1s remaining:    0.0s
[CV] ...... knn__estimator__n_neighbors=10, score=0.828, total=   3.5s
[CV] knn__estimator__n_neighbors=10 ..................................
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   30.6s remaining:    0.0s
[CV] ...... knn__estimator__n_neighbors=10, score=0.824, total=   3.5s
[CV] knn__estimator__n_neighbors=20 ..................................
[CV] ...... knn__estimator__n_neighbors=20, score=0.782, total=   3.5s
[CV] knn__estimator__n_neighbors=20 ..................................
[CV] ...... knn__estimator__n_neighbors=20, score=0.734, total=   3.5s
[CV] knn__estimator__n_neighbors=20 ..................................
[CV] ...... knn__estimator__n_neighbors=20, score=0.828, total=   3.5s
[CV] knn__estimator__n_neighbors=20 ..................................
[CV] ...... knn__estimator__n_neighbors=20, score=0.820, total=   3.3s
[CV] knn__estimator__n_neighbors=20 ..................................
[CV] ...... knn__estimator__n_neighbors=20, score=0.832, total=   3.4s
[CV] knn__estimator__n_neighbors=30 ..................................
[CV] ...... knn__estimator__n_neighbors=30, score=0.768, total=   3.6s
[CV] knn__estimator__n_neighbors=30 ..................................
[CV] ...... knn__estimator__n_neighbors=30, score=0.726, total=   3.6s
[CV] knn__estimator__n_neighbors=30 ..................................
[CV] ...... knn__estimator__n_neighbors=30, score=0.820, total=   3.6s
[CV] knn__estimator__n_neighbors=30 ..................................
[CV] ...... knn__estimator__n_neighbors=30, score=0.820, total=   3.7s
[CV] knn__estimator__n_neighbors=30 ..................................
[CV] ...... knn__estimator__n_neighbors=30, score=0.826, total=   3.5s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:  1.2min finished
Out[84]:
GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('knn',
                                        OneVsRestClassifier(estimator=KNeighborsClassifier()))]),
             param_grid={'knn__estimator__n_neighbors': [8, 10, 20, 30]},
             verbose=10)
In [85]:
print('Best parameters ' + str(clf.best_params_))
print('Best model score: ' + str(clf.best_score_))
Best parameters {'knn__estimator__n_neighbors': 20}
Best model score: 0.7988723899743437
In [86]:
results_KNN = pd.DataFrame(clf.cv_results_)
results_KNN
Out[86]:
mean_fit_time std_fit_time mean_score_time std_score_time param_knn__estimator__n_neighbors params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.291415 0.020557 3.064704 0.088606 8 {'knn__estimator__n_neighbors': 8} 0.769094 0.753996 0.801066 0.832000 0.817778 0.794787 0.029247 3
1 0.305789 0.014108 3.177124 0.047759 10 {'knn__estimator__n_neighbors': 10} 0.769094 0.748668 0.815275 0.827556 0.824000 0.796919 0.031973 2
2 0.291104 0.008228 3.165741 0.071301 20 {'knn__estimator__n_neighbors': 20} 0.781528 0.733570 0.827709 0.819556 0.832000 0.798872 0.037207 1
3 0.291344 0.011009 3.311857 0.029414 30 {'knn__estimator__n_neighbors': 30} 0.768206 0.726465 0.819716 0.819556 0.825778 0.791944 0.038806 4
In [87]:
classification_report_test = classification_report(y_test, clf.best_estimator_.predict(X_test), output_dict=True)
classification_report_train = classification_report(y_train_resampled, clf.best_estimator_.predict(X_train_resampled), output_dict=True)
In [88]:
#helper function
def make_results(classification_report_train, classification_report_test, model_type):
    df_rep1 = pd.DataFrame(classification_report_train)
    df_rep1['source'] ='train'
    df_rep1 = df_rep1.set_index([df_rep1.index,'source'])
    df_rep2 = pd.DataFrame(classification_report_test)
    df_rep2['source'] ='test'
    df_rep2 = df_rep2.set_index([df_rep2.index,'source'])
    frames = [df_rep1, df_rep2]
    df_rep = pd.concat(frames)
    df_rep['model_type'] = model_type
    df_rep = df_rep.set_index([df_rep.index,'model_type'])
    return df_rep.transpose()
In [89]:
results_KNN = make_results(classification_report_train, classification_report_test, 'KNN')
results_KNN
Out[89]:
precision recall f1-score support precision recall f1-score support
source train train train train test test test test
model_type KNN KNN KNN KNN KNN KNN KNN KNN
LAYING 0.912238 0.805259 0.855417 1407.000000 0.850000 0.633147 0.725720 537.000000
SITTING 0.810458 0.793177 0.801724 1407.000000 0.631295 0.714868 0.670487 491.000000
STANDING 0.808673 0.901208 0.852437 1407.000000 0.757679 0.834586 0.794275 532.000000
WALKING 0.976405 1.000000 0.988062 1407.000000 0.957198 0.991935 0.974257 496.000000
accuracy 0.874911 0.874911 0.874911 0.874911 0.791342 0.791342 0.791342 0.791342
macro avg 0.876944 0.874911 0.874410 5628.000000 0.799043 0.793634 0.791185 2056.000000
weighted avg 0.876944 0.874911 0.874410 5628.000000 0.799743 0.791342 0.790227 2056.000000

Random Forest¶

In [90]:
# Pipeline
pipeline2 = Pipeline([
    ("scaler", StandardScaler()),
    ("ranfo", OneVsRestClassifier(RandomForestClassifier()))
])
In [91]:
param_grid = {'ranfo__estimator__max_depth': [5, 15, 30]}
In [92]:
#GridSearchCV-Object
clf2 = GridSearchCV(pipeline2, param_grid = param_grid, verbose=10)
clf2.fit(X_train_resampled, y_train_resampled)
Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] ranfo__estimator__max_depth=5 ...................................
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] ....... ranfo__estimator__max_depth=5, score=0.793, total=   7.9s
[CV] ranfo__estimator__max_depth=5 ...................................
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.8s remaining:    0.0s
[CV] ....... ranfo__estimator__max_depth=5, score=0.752, total=   7.6s
[CV] ranfo__estimator__max_depth=5 ...................................
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   15.4s remaining:    0.0s
[CV] ....... ranfo__estimator__max_depth=5, score=0.850, total=   7.7s
[CV] ranfo__estimator__max_depth=5 ...................................
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   23.1s remaining:    0.0s
[CV] ....... ranfo__estimator__max_depth=5, score=0.843, total=   7.7s
[CV] ranfo__estimator__max_depth=5 ...................................
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   30.7s remaining:    0.0s
[CV] ....... ranfo__estimator__max_depth=5, score=0.830, total=   8.0s
[CV] ranfo__estimator__max_depth=15 ..................................
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   38.8s remaining:    0.0s
[CV] ...... ranfo__estimator__max_depth=15, score=0.827, total=  13.2s
[CV] ranfo__estimator__max_depth=15 ..................................
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   52.0s remaining:    0.0s
[CV] ...... ranfo__estimator__max_depth=15, score=0.813, total=  13.3s
[CV] ranfo__estimator__max_depth=15 ..................................
[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:  1.1min remaining:    0.0s
[CV] ...... ranfo__estimator__max_depth=15, score=0.877, total=  13.3s
[CV] ranfo__estimator__max_depth=15 ..................................
[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:  1.3min remaining:    0.0s
[CV] ...... ranfo__estimator__max_depth=15, score=0.876, total=  13.4s
[CV] ranfo__estimator__max_depth=15 ..................................
[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  1.5min remaining:    0.0s
[CV] ...... ranfo__estimator__max_depth=15, score=0.886, total=  13.6s
[CV] ranfo__estimator__max_depth=30 ..................................
[CV] ...... ranfo__estimator__max_depth=30, score=0.831, total=  13.3s
[CV] ranfo__estimator__max_depth=30 ..................................
[CV] ...... ranfo__estimator__max_depth=30, score=0.821, total=  13.6s
[CV] ranfo__estimator__max_depth=30 ..................................
[CV] ...... ranfo__estimator__max_depth=30, score=0.877, total=  13.8s
[CV] ranfo__estimator__max_depth=30 ..................................
[CV] ...... ranfo__estimator__max_depth=30, score=0.876, total=  13.8s
[CV] ranfo__estimator__max_depth=30 ..................................
[CV] ...... ranfo__estimator__max_depth=30, score=0.888, total=  14.3s
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  2.9min finished
Out[92]:
GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('ranfo',
                                        OneVsRestClassifier(estimator=RandomForestClassifier()))]),
             param_grid={'ranfo__estimator__max_depth': [5, 15, 30]},
             verbose=10)
In [93]:
print('Best parameters ' + str(clf2.best_params_))
print('Best model score: ' + str(clf2.best_score_))
Best parameters {'ranfo__estimator__max_depth': 30}
Best model score: 0.8583949477008093
In [94]:
classification_report_test = classification_report(y_test, clf2.best_estimator_.predict(X_test), output_dict=True)
classification_report_train = classification_report(y_train_resampled, clf2.best_estimator_.predict(X_train_resampled), output_dict=True)
In [95]:
results_RandomForest = make_results(classification_report_train, classification_report_test, 'RandomForest')
results_RandomForest
Out[95]:
precision recall f1-score support precision recall f1-score support
source train train train train test test test test
model_type RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest RandomForest
LAYING 1.0 1.0 1.0 1407.0 0.821705 0.789572 0.805318 537.00000
SITTING 1.0 1.0 1.0 1407.0 0.784519 0.763747 0.773994 491.00000
STANDING 1.0 1.0 1.0 1407.0 0.829710 0.860902 0.845018 532.00000
WALKING 1.0 1.0 1.0 1407.0 0.970588 0.997984 0.984095 496.00000
accuracy 1.0 1.0 1.0 1.0 0.852140 0.852140 0.852140 0.85214
macro avg 1.0 1.0 1.0 5628.0 0.851631 0.853051 0.852106 2056.00000
weighted avg 1.0 1.0 1.0 5628.0 0.850813 0.852140 0.851239 2056.00000

Logistische Regression¶

In [96]:
# Pipeline
pipeline3 = Pipeline([
    ("scaler", StandardScaler()),
    ("logreg", OneVsRestClassifier(LogisticRegression(solver='newton-cg')))
])
In [97]:
#GridSearchCV-Object
clf3 = GridSearchCV(pipeline3, param_grid = {})
clf3.fit(X_train_resampled, y_train_resampled)
Out[97]:
GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('logreg',
                                        OneVsRestClassifier(estimator=LogisticRegression(solver='newton-cg')))]),
             param_grid={})
In [98]:
print('Best parameters ' + str(clf3.best_params_))
print('Best model score: ' + str(clf3.best_score_))
Best parameters {}
Best model score: 0.7738230905861456
In [99]:
classification_report_test = classification_report(y_test, clf3.best_estimator_.predict(X_test), output_dict=True)
classification_report_train = classification_report(y_train_resampled, clf3.best_estimator_.predict(X_train_resampled), output_dict=True)
In [100]:
results_LogisticRegression = make_results(classification_report_train, classification_report_test, 'LogReg')
results_LogisticRegression
Out[100]:
precision recall f1-score support precision recall f1-score support
source train train train train test test test test
model_type LogReg LogReg LogReg LogReg LogReg LogReg LogReg LogReg
LAYING 0.761204 0.772566 0.766843 1407.000000 0.690702 0.677840 0.684211 537.000000
SITTING 0.747636 0.730633 0.739037 1407.000000 0.687970 0.745418 0.715543 491.000000
STANDING 0.791356 0.754797 0.772645 1407.000000 0.833684 0.744361 0.786495 532.000000
WALKING 0.924477 0.974414 0.948789 1407.000000 0.873563 0.919355 0.895874 496.000000
accuracy 0.808102 0.808102 0.808102 0.808102 0.769455 0.769455 0.769455 0.769455
macro avg 0.806169 0.808102 0.806828 5628.000000 0.771480 0.771743 0.770530 2056.000000
weighted avg 0.806169 0.808102 0.806828 5628.000000 0.771161 0.769455 0.769222 2056.000000

Entscheidungsbaum¶

In [101]:
# Pipeline
pipeline4 = Pipeline([
    ("scaler", StandardScaler()),
    ("decisiontree", OneVsRestClassifier(DecisionTreeClassifier()))
])
In [102]:
#GridSearchCV-Object
clf4 = GridSearchCV(pipeline4, param_grid = {})
clf4.fit(X_train_resampled, y_train_resampled)
Out[102]:
GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('decisiontree',
                                        OneVsRestClassifier(estimator=DecisionTreeClassifier()))]),
             param_grid={})
In [103]:
print('Best parameters ' + str(clf4.best_params_))
print('Best model score: ' + str(clf4.best_score_))
Best parameters {}
Best model score: 0.6778698243536609
In [104]:
classification_report_test = classification_report(y_test, clf4.best_estimator_.predict(X_test), output_dict=True)
classification_report_train = classification_report(y_train_resampled, clf4.best_estimator_.predict(X_train_resampled), output_dict=True)
In [105]:
results_DecisionTree = make_results(classification_report_train, classification_report_test, 'DecisionTree')
results_DecisionTree
Out[105]:
precision recall f1-score support precision recall f1-score support
source train train train train test test test test
model_type DecisionTree DecisionTree DecisionTree DecisionTree DecisionTree DecisionTree DecisionTree DecisionTree
LAYING 1.0 1.0 1.0 1407.0 0.803797 0.472998 0.595545 537.000000
SITTING 1.0 1.0 1.0 1407.0 0.642132 0.515275 0.571751 491.000000
STANDING 1.0 1.0 1.0 1407.0 0.695391 0.652256 0.673133 532.000000
WALKING 1.0 1.0 1.0 1407.0 0.578512 0.987903 0.729710 496.000000
accuracy 1.0 1.0 1.0 1.0 0.653696 0.653696 0.653696 0.653696
macro avg 1.0 1.0 1.0 5628.0 0.679958 0.657108 0.642535 2056.000000
weighted avg 1.0 1.0 1.0 5628.0 0.682790 0.653696 0.642306 2056.000000

SVM¶

Support Vector Machine

In [106]:
# Pipeline
pipeline5 = Pipeline([
    ("scaler", StandardScaler()),
    ("svm", OneVsRestClassifier(svm.SVC()))
])
In [107]:
param_grid = {
    'svm__estimator__kernel': ['linear']}
In [108]:
#GridSearchCV-Object
clf5 = GridSearchCV(pipeline5, param_grid)
clf5.fit(X_train_resampled, y_train_resampled)
Out[108]:
GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('svm',
                                        OneVsRestClassifier(estimator=SVC()))]),
             param_grid={'svm__estimator__kernel': ['linear']})
In [109]:
print('Best parameters ' + str(clf5.best_params_))
print('Best model score: ' + str(clf5.best_score_))
Best parameters {'svm__estimator__kernel': 'linear'}
Best model score: 0.776133570159858
In [110]:
results_SVM= pd.DataFrame(clf5.cv_results_)
results_SVM
Out[110]:
mean_fit_time std_fit_time mean_score_time std_score_time param_svm__estimator__kernel params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 6.828282 0.334218 0.329354 0.019774 linear {'svm__estimator__kernel': 'linear'} 0.740675 0.723801 0.784192 0.818667 0.813333 0.776134 0.038089 1
In [111]:
classification_report_test = classification_report(y_test, clf5.best_estimator_.predict(X_test), output_dict=True)
classification_report_train = classification_report(y_train_resampled, clf5.best_estimator_.predict(X_train_resampled), output_dict=True)
In [112]:
results_SVM = make_results(classification_report_train, classification_report_test, 'SVM')
results_SVM
Out[112]:
precision recall f1-score support precision recall f1-score support
source train train train train test test test test
model_type SVM SVM SVM SVM SVM SVM SVM SVM
LAYING 0.773826 0.773276 0.773551 1407.000000 0.689394 0.677840 0.683568 537.000000
SITTING 0.753613 0.741294 0.747402 1407.000000 0.675978 0.739308 0.706226 491.000000
STANDING 0.790169 0.765458 0.777617 1407.000000 0.837607 0.736842 0.784000 532.000000
WALKING 0.931525 0.976546 0.953505 1407.000000 0.871893 0.919355 0.894995 496.000000
accuracy 0.814144 0.814144 0.814144 0.814144 0.766051 0.766051 0.766051 0.766051
macro avg 0.812283 0.814144 0.813019 5628.000000 0.768718 0.768336 0.767197 2056.000000
weighted avg 0.812283 0.814144 0.813019 5628.000000 0.768568 0.766051 0.765972 2056.000000

Zusammenfassung¶

In [113]:
frames = [results_KNN, 
           results_RandomForest,
           results_LogisticRegression,
           results_DecisionTree,
           results_SVM,
           ]
final_results = pd.concat(frames, axis=1).reindex(frames[0].index).transpose()
In [114]:
final_results.sort_index(level=1)
Out[114]:
LAYING SITTING STANDING WALKING accuracy macro avg weighted avg
source model_type
f1-score test DecisionTree 0.595545 0.571751 0.673133 0.729710 0.653696 0.642535 0.642306
KNN 0.725720 0.670487 0.794275 0.974257 0.791342 0.791185 0.790227
LogReg 0.684211 0.715543 0.786495 0.895874 0.769455 0.770530 0.769222
RandomForest 0.805318 0.773994 0.845018 0.984095 0.852140 0.852106 0.851239
SVM 0.683568 0.706226 0.784000 0.894995 0.766051 0.767197 0.765972
precision test DecisionTree 0.803797 0.642132 0.695391 0.578512 0.653696 0.679958 0.682790
KNN 0.850000 0.631295 0.757679 0.957198 0.791342 0.799043 0.799743
LogReg 0.690702 0.687970 0.833684 0.873563 0.769455 0.771480 0.771161
RandomForest 0.821705 0.784519 0.829710 0.970588 0.852140 0.851631 0.850813
SVM 0.689394 0.675978 0.837607 0.871893 0.766051 0.768718 0.768568
recall test DecisionTree 0.472998 0.515275 0.652256 0.987903 0.653696 0.657108 0.653696
KNN 0.633147 0.714868 0.834586 0.991935 0.791342 0.793634 0.791342
LogReg 0.677840 0.745418 0.744361 0.919355 0.769455 0.771743 0.769455
RandomForest 0.789572 0.763747 0.860902 0.997984 0.852140 0.853051 0.852140
SVM 0.677840 0.739308 0.736842 0.919355 0.766051 0.768336 0.766051
support test DecisionTree 537.000000 491.000000 532.000000 496.000000 0.653696 2056.000000 2056.000000
KNN 537.000000 491.000000 532.000000 496.000000 0.791342 2056.000000 2056.000000
LogReg 537.000000 491.000000 532.000000 496.000000 0.769455 2056.000000 2056.000000
RandomForest 537.000000 491.000000 532.000000 496.000000 0.852140 2056.000000 2056.000000
SVM 537.000000 491.000000 532.000000 496.000000 0.766051 2056.000000 2056.000000
f1-score train DecisionTree 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
KNN 0.855417 0.801724 0.852437 0.988062 0.874911 0.874410 0.874410
LogReg 0.766843 0.739037 0.772645 0.948789 0.808102 0.806828 0.806828
RandomForest 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
SVM 0.773551 0.747402 0.777617 0.953505 0.814144 0.813019 0.813019
precision train DecisionTree 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
KNN 0.912238 0.810458 0.808673 0.976405 0.874911 0.876944 0.876944
LogReg 0.761204 0.747636 0.791356 0.924477 0.808102 0.806169 0.806169
RandomForest 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
SVM 0.773826 0.753613 0.790169 0.931525 0.814144 0.812283 0.812283
recall train DecisionTree 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
KNN 0.805259 0.793177 0.901208 1.000000 0.874911 0.874911 0.874911
LogReg 0.772566 0.730633 0.754797 0.974414 0.808102 0.808102 0.808102
RandomForest 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
SVM 0.773276 0.741294 0.765458 0.976546 0.814144 0.814144 0.814144
support train DecisionTree 1407.000000 1407.000000 1407.000000 1407.000000 1.000000 5628.000000 5628.000000
KNN 1407.000000 1407.000000 1407.000000 1407.000000 0.874911 5628.000000 5628.000000
LogReg 1407.000000 1407.000000 1407.000000 1407.000000 0.808102 5628.000000 5628.000000
RandomForest 1407.000000 1407.000000 1407.000000 1407.000000 1.000000 5628.000000 5628.000000
SVM 1407.000000 1407.000000 1407.000000 1407.000000 0.814144 5628.000000 5628.000000
In [115]:
final_results.sort_index(level=1).to_csv('Results_VIF_lower10.csv')

5. Evaluation¶

In [116]:
#helper function
def make_confusion_matrix(y_test, X_test, the_clf):
    y_pred = the_clf.best_estimator_.predict(X_test)
    cf_matrix = confusion_matrix(y_test,y_pred)
    df_cf_matrix = pd.DataFrame(cf_matrix,columns=the_clf.best_estimator_.steps[1][1].classes_)
    df_cf_matrix.index = the_clf.best_estimator_.steps[1][1].classes_
    
    plt.figure(figsize=(28,7))

    plt.subplot(1,3,1) # first heatmap
    heatmap1 = sns.heatmap(df_cf_matrix, annot=True,fmt='d');
    heatmap1.yaxis.set_ticklabels(heatmap1.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=10)
    heatmap1.xaxis.set_ticklabels(heatmap1.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=10)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

    plt.subplot(1,3,2) # second heatmap
    heatmap2 = sns.heatmap(df_cf_matrix/np.sum(df_cf_matrix), annot=True, fmt='.2%')
    heatmap2.yaxis.set_ticklabels(heatmap2.yaxis.get_ticklabels(), rotation=0, ha='right', fontsize=10)
    heatmap2.xaxis.set_ticklabels(heatmap2.xaxis.get_ticklabels(), rotation=45, ha='right', fontsize=10)
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

    fig_name = str(the_clf.best_estimator_.steps[1][1].estimators_[0]) + '_confusionMatrix.png'
    plt.savefig(fig_name, dpi=150, format='png')

Erzielte Genauigkeit¶

In [117]:
accuracy_train_test = pd.DataFrame()
accuracy_train_test['accuracy_train'] = final_results.xs("train", level=1).xs("precision", level=0)['accuracy']
accuracy_train_test['accuracy_test'] = final_results.xs("test", level=1).xs("precision", level=0)['accuracy']
accuracy_train_test['model'] = accuracy_train_test.index

accuracy_train_test['diff_train_test'] = accuracy_train_test['accuracy_train'] - accuracy_train_test['accuracy_test']
accuracy_train_test
Out[117]:
accuracy_train accuracy_test model diff_train_test
model_type
KNN 0.874911 0.791342 KNN 0.083569
RandomForest 1.000000 0.852140 RandomForest 0.147860
LogReg 0.808102 0.769455 LogReg 0.038647
DecisionTree 1.000000 0.653696 DecisionTree 0.346304
SVM 0.814144 0.766051 SVM 0.048093
In [118]:
plt.figure(figsize=(10,7))
df_temp = accuracy_train_test.melt('model', var_name='cols',  value_name='vals')
ax = sns.barplot(x="model", y="vals", hue="cols", data=df_temp)
#for index, row in df_temp:
#    ax.text(row.name,row.vals, round(row.vals,2), color='black', ha="center")
for p in ax.patches:
    ax.annotate(format(p.get_height(), '.3f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   xytext = (0, 9), 
                   textcoords = 'offset points')
plt.ylabel('Accuracy')
plt.xlabel('Model')

fig_name = 'Accuracy_performance.png'
plt.savefig(fig_name, dpi=150, format='png')
No description has been provided for this image
In [119]:
df_temp
Out[119]:
model cols vals
0 KNN accuracy_train 0.874911
1 RandomForest accuracy_train 1.000000
2 LogReg accuracy_train 0.808102
3 DecisionTree accuracy_train 1.000000
4 SVM accuracy_train 0.814144
5 KNN accuracy_test 0.791342
6 RandomForest accuracy_test 0.852140
7 LogReg accuracy_test 0.769455
8 DecisionTree accuracy_test 0.653696
9 SVM accuracy_test 0.766051
10 KNN diff_train_test 0.083569
11 RandomForest diff_train_test 0.147860
12 LogReg diff_train_test 0.038647
13 DecisionTree diff_train_test 0.346304
14 SVM diff_train_test 0.048093
  • KNN --> Die Leistung auf Testdaten ist ca. 8% niedriger als auf Trainingsdaten. Wahrscheinlich passt sich das Modell zu sehr an die Trainingsdaten an und lässt sich nicht gut auf ungesehene Testdaten verallgemeinern.
  • Random Forest --> Die Leistung bei den Testdaten ist 15% geringer als bei den Trainingsdaten. Das Modell übererfüllt die Trainingsdaten und verallgemeinert sich nicht gut auf die ungesehenen Testdaten.
  • Logistische Regression --> Trainings- und Testleistung sind sehr ähnlich (diff=3.9%). Dies bedeutet wahrscheinlich, dass das erstellte Modell gut auf neue Daten verallgemeinert.
  • Entscheidungsbaum --> Die Leistung bei den Testdaten ist 34% niedriger als bei den Trainingsdaten. Das Modell passt sich zu sehr an die Trainingsdaten an und verallgemeinert nicht gut auf die unbekannten Testdaten.
  • Support Vector Machine --> Trainings- und Testleistung sind sehr ähnlich (diff=4.8%). Das bedeutet wahrscheinlich, dass das erstellte Modell gut auf neue Daten verallgemeinert. -->Überprüfen wir die Konfusionsmatrix der beiden besten Modelle

Konfusionsmatrix-Logistische Regression¶

In [120]:
make_confusion_matrix(y_test,X_test, clf3)
No description has been provided for this image

Konfusionsmatrix-SVM¶

In [121]:
make_confusion_matrix(y_test,X_test, clf5)
No description has been provided for this image
In [122]:
final_results.loc[(slice(None), 'test', ('LogReg','SVM')), :].to_csv('LogReg_SVM_results.csv')
final_results.loc[(slice(None), 'test', ('LogReg','SVM')), :]
Out[122]:
LAYING SITTING STANDING WALKING accuracy macro avg weighted avg
source model_type
precision test LogReg 0.690702 0.687970 0.833684 0.873563 0.769455 0.771480 0.771161
recall test LogReg 0.677840 0.745418 0.744361 0.919355 0.769455 0.771743 0.769455
f1-score test LogReg 0.684211 0.715543 0.786495 0.895874 0.769455 0.770530 0.769222
support test LogReg 537.000000 491.000000 532.000000 496.000000 0.769455 2056.000000 2056.000000
precision test SVM 0.689394 0.675978 0.837607 0.871893 0.766051 0.768718 0.768568
recall test SVM 0.677840 0.739308 0.736842 0.919355 0.766051 0.768336 0.766051
f1-score test SVM 0.683568 0.706226 0.784000 0.894995 0.766051 0.767197 0.765972
support test SVM 537.000000 491.000000 532.000000 496.000000 0.766051 2056.000000 2056.000000

Wichtigkeit der Merkmale¶

  • Ein One vs Rest Classifier besteht aus n einzelnen Klassifikatoren (mit n=Anzahl der Klassen). Das macht es schwierig, die Merkmalsbedeutung für den gesamten OvsR-Klassifikator zu interpretieren. Eine Möglichkeit besteht darin, die Koeffizienten für jedes einzelne Modell zu bestimmen und den Mittelwert über jeden Koeffizienten zu bilden.
In [123]:
def get_the_feature_importances(clfobject,est_name, functype):
    feat_impts = []
    #estimator have feature_importances_ attribute
    if functype ==1:
        for esti in clfobject.best_estimator_.named_steps[est_name].estimators_:
            feat_impts.append(np.abs(esti.feature_importances_))
    #estimator have coef_ attribute
    if functype ==2:
        for esti in clfobject.best_estimator_.named_steps[est_name].estimators_:
            feat_impts.append(np.abs(esti.coef_[0]))
        
    return np.mean(feat_impts, axis=0)
In [124]:
feature_importance = pd.DataFrame({},[])
feature_importance['Features'] = X_test.columns
feature_importance['Importance'] = get_the_feature_importances(clf3,'logreg',2)

Ergebnisse¶

  • Das beste Modell ist der OvsR-Klassifikator mit logistischer Regression.
  • Es hat eine gute Genauigkeit (0.77:test, 0.81:train) und eine gute Präzision (0.87:test) für die Klasse 'WALKING'.
  • Die 10 wichtigsten Eigenschaften dieses OvsR-Klassifikators sind die folgenden:
In [125]:
feature_importance.sort_values('Importance',ascending=0).head(10).to_csv('FeatureImportanceTop10.csv')
feature_importance.sort_values('Importance',ascending=0).to_csv('FeatureImportanceAll.csv')
feature_importance.sort_values('Importance',ascending=0).head(10)
Out[125]:
Features Importance
15 tBodyGyro-entropy()-X 1.170518
16 tBodyGyro-entropy()-Y 1.070693
3 tBodyAcc-correlation()-Y,Z 0.791616
17 tBodyGyro-entropy()-Z 0.790347
14 tBodyGyro-mean()-Y 0.643740
32 fBodyGyro-meanFreq()-X 0.537990
29 fBodyAccJerk-maxInds-X 0.510996
18 tBodyGyro-correlation()-X,Y 0.501925
13 tBodyGyro-mean()-X 0.490046
19 tBodyGyro-correlation()-X,Z 0.468363

-->Die meisten dieser wichtigen Funktionen stammen vom Gyroskopsensor im Smartphone.

Endgültiges Modell für das Deployment¶

In [126]:
final_model = clf3.best_estimator_

6. Deployment¶

Teste die Prognose lokal

In [127]:
# recap: print first rows of predictors (here: training data without predicted column "Survived")
X_test.head(4)
Out[127]:
tBodyAcc-mean()-Y tBodyAcc-correlation()-X,Y tBodyAcc-correlation()-X,Z tBodyAcc-correlation()-Y,Z tGravityAcc-correlation()-X,Y tGravityAcc-correlation()-X,Z tGravityAcc-correlation()-Y,Z tBodyAccJerk-mean()-X tBodyAccJerk-mean()-Y tBodyAccJerk-mean()-Z ... fBodyAccJerk-maxInds-X fBodyAccJerk-maxInds-Y fBodyAccJerk-maxInds-Z fBodyGyro-meanFreq()-X fBodyGyro-meanFreq()-Y fBodyGyro-meanFreq()-Z angle(tBodyAccMean,gravity) angle(tBodyAccJerkMean),gravityMean) angle(tBodyGyroMean,gravityMean) angle(tBodyGyroJerkMean,gravityMean)
5293 -0.023285 0.076989 -0.490546 -0.709003 0.980580 -0.996352 -0.960117 0.072046 0.045754 -0.106043 ... -0.52 0.08 0.32 0.184035 -0.059323 0.438107 0.006462 0.162920 -0.825886 0.271151
5294 -0.013163 -0.104983 -0.429134 0.399177 0.945233 -0.911415 -0.738535 0.070181 -0.017876 -0.001721 ... -0.16 -0.32 -0.40 0.018109 -0.227266 -0.151698 -0.083495 0.017500 -0.434375 0.920593
5295 -0.026050 0.305653 -0.323848 0.279786 0.548432 -0.334864 0.590302 0.069368 -0.004908 -0.013673 ... -0.64 -0.40 -0.44 -0.479145 -0.210084 0.049310 -0.034956 0.202302 0.064103 0.145068
5296 -0.032614 -0.063792 -0.167111 0.544916 0.985534 0.653169 0.746518 0.074853 0.032274 0.012141 ... -0.44 -0.56 -0.48 -0.496954 -0.499906 -0.258896 -0.017067 0.154438 0.340134 0.296407

4 rows × 39 columns

In [128]:
# recap: print first rows of training data
data_test['Activity'].head(4)
Out[128]:
5293    STANDING
5294    STANDING
5295    STANDING
5296    STANDING
Name: Activity, dtype: object
In [129]:
final_model.predict(X_test.head(4))
Out[129]:
array(['LAYING', 'STANDING', 'STANDING', 'STANDING'], dtype='<U8')

-->final_model works and can be deplyoed