1. Business Understanding¶

Rücksendungen sind Teil des Geschäftsmodells vieler Modegeschäfte. Die Rückgabequote eines fiktiven Unternehmens liegt bei etwa 50 %, was das Unternehmen unnötig viel Geld kostet. Hinzu kommt: Gerade im Modebranche gibt es viele Kunden, die schon vor ihrer Bestellung wissen, dass sie dass sie den Artikel mit hoher Wahrscheinlichkeit zurückschicken werden. Sie bestellen zum Beispiel ein und dasselbe Hemd in verschiedenen Größen, weil sie nicht sicher sind, welches passen wird. nicht sicher, welches passen wird. Ist es möglich, maschinelles Lernen einzusetzen, um die Größe anhand bestimmter Parameter zu bestimmen Parameter zu bestimmen und Rücksendungen aufgrund falscher Größen zu minimieren?

2. Daten und Datenverständnis¶

Der Datensatz wurde von Kaggle heruntergeladen und stammte von ModCloth, einem Online-Geschäft für Damenbekleidung. Bei den Daten handelt es sich um tatsächlich verkaufte Kleidung und die Zusatzinformationen der zusätzliche Informationen über die Passform beim Kunden. Die Daten beinhalten: item_id, Taille, Größe, Qualität, Körbchengröße, Hüfte, BH-Größe, Kategorie, Büstengröße, Länge, Passform, user_id, Schuhgröße, Schuhbreite, review_summary und prüfung_text.

2.1. Import von relevanten Modulen¶

InĀ [55]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import confusion_matrix, classification_report

2.2. Daten einlesen¶

InĀ [56]:
my_file = "https://storage.googleapis.com/ml-service-repository-datastorage/Size_prediction_for_online_fashion_retailer_modcloth_final_data.json"
df = pd.read_json(my_file, lines = True)

df.head()
Out[56]:
item_id waist size quality cup size hips bra size category bust height user_name length fit user_id shoe size shoe width review_summary review_text
0 123373 29.0 7 5.0 d 38.0 34.0 new 36 5ft 6in Emily just right small 991571 NaN NaN NaN NaN
1 123373 31.0 13 3.0 b 30.0 36.0 new NaN 5ft 2in sydneybraden2001 just right small 587883 NaN NaN NaN NaN
2 123373 30.0 7 2.0 b NaN 32.0 new NaN 5ft 7in Ugggh slightly long small 395665 9.0 NaN NaN NaN
3 123373 NaN 21 5.0 dd/e NaN NaN new NaN NaN alexmeyer626 just right fit 875643 NaN NaN NaN NaN
4 123373 NaN 18 5.0 b NaN 36.0 new NaN 5ft 2in dberrones1 slightly long small 944840 NaN NaN NaN NaN

2.3. Daten bereinigen¶

InĀ [57]:
df[df.duplicated(keep=False)]
Out[57]:
item_id waist size quality cup size hips bra size category bust height user_name length fit user_id shoe size shoe width review_summary review_text
1230 126885 NaN 32 3.0 d 53.0 42.0 new NaN 5ft 3in Brandy slightly long fit 94385 NaN NaN NaN NaN
1231 126885 NaN 32 3.0 d 53.0 42.0 new NaN 5ft 3in Brandy slightly long fit 94385 NaN NaN NaN NaN
1264 126885 NaN 26 3.0 d NaN 40.0 new NaN 5ft 9in megmattmt just right fit 67002 10.5 wide NaN NaN
1265 126885 NaN 26 3.0 d NaN 40.0 new NaN 5ft 9in megmattmt just right fit 67002 10.5 wide NaN NaN
1370 126885 NaN 38 4.0 c 49.0 48.0 new NaN 5ft 11in kelli.andrews very long large 826087 NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
81598 806479 NaN 20 5.0 c 42.0 38.0 outerwear NaN 5ft 6in karinabuendia slightly long large 756267 10.0 NaN I loved this coat since t I loved this coat since the moment I saw it bu...
82085 806856 NaN 20 5.0 j 47.0 36.0 outerwear 47 5ft 10in bookworm.bedore just right fit 315616 10.5 average Love it! Fit and material Love it! Fit and material is A+.
82086 806856 NaN 20 5.0 j 47.0 36.0 outerwear 47 5ft 10in bookworm.bedore just right fit 315616 10.5 average Love it! Fit and material Love it! Fit and material is A+.
82398 806856 NaN 4 5.0 NaN NaN NaN outerwear NaN 5ft 5in ktanner779 just right large 858162 NaN NaN LOVE this coat. Super war LOVE this coat. Super warm, beautiful color an...
82399 806856 NaN 4 5.0 NaN NaN NaN outerwear NaN 5ft 5in ktanner779 just right large 858162 NaN NaN LOVE this coat. Super war LOVE this coat. Super warm, beautiful color an...

754 rows Ɨ 18 columns

InĀ [58]:
df = df.drop_duplicates(keep='first')
InĀ [59]:
df.describe(include='all')
Out[59]:
item_id waist size quality cup size hips bra size category bust height user_name length fit user_id shoe size shoe width review_summary review_text
count 82413.000000 2881.000000 82413.000000 82345.000000 76185 55804.000000 76421.000000 82413 11796 81308 82413 82378 82413 82413.000000 27790.000000 18521 75704 75704
unique NaN NaN NaN NaN 12 NaN NaN 7 40 41 32429 5 3 NaN NaN 3 61713 73313
top NaN NaN NaN NaN c NaN NaN new 36 5ft 4in Sarah just right fit NaN NaN average Love it! Love it!
freq NaN NaN NaN NaN 18270 NaN NaN 21395 2042 11876 727 61660 56516 NaN NaN 13030 184 152
mean 469417.251295 31.319681 12.659714 3.949092 NaN 40.358559 35.971605 NaN NaN NaN NaN NaN NaN 498819.325701 8.145592 NaN NaN NaN
std 214067.804253 5.303712 8.270768 0.992837 NaN 5.827906 3.224445 NaN NaN NaN NaN NaN NaN 286325.583966 1.335898 NaN NaN NaN
min 123373.000000 20.000000 0.000000 1.000000 NaN 30.000000 28.000000 NaN NaN NaN NaN NaN NaN 6.000000 5.000000 NaN NaN NaN
25% 314980.000000 28.000000 8.000000 3.000000 NaN 36.000000 34.000000 NaN NaN NaN NaN NaN NaN 252928.000000 7.000000 NaN NaN NaN
50% 454030.000000 30.000000 12.000000 4.000000 NaN 39.000000 36.000000 NaN NaN NaN NaN NaN NaN 497756.000000 8.000000 NaN NaN NaN
75% 658440.000000 34.000000 15.000000 5.000000 NaN 43.000000 38.000000 NaN NaN NaN NaN NaN NaN 744641.000000 9.000000 NaN NaN NaN
max 807722.000000 50.000000 38.000000 5.000000 NaN 60.000000 48.000000 NaN NaN NaN NaN NaN NaN 999972.000000 38.000000 NaN NaN NaN
InĀ [60]:
df.isnull().sum()
Out[60]:
item_id               0
waist             79532
size                  0
quality              68
cup size           6228
hips              26609
bra size           5992
category              0
bust              70617
height             1105
user_name             0
length               35
fit                   0
user_id               0
shoe size         54623
shoe width        63892
review_summary     6709
review_text        6709
dtype: int64
InĀ [61]:
df.dtypes
Out[61]:
item_id             int64
waist             float64
size                int64
quality           float64
cup size           object
hips              float64
bra size          float64
category           object
bust               object
height             object
user_name          object
length             object
fit                object
user_id             int64
shoe size         float64
shoe width         object
review_summary     object
review_text        object
dtype: object
InĀ [62]:
df.shape[1]
Out[62]:
18
InĀ [63]:
df.shape[0]
Out[63]:
82413

3. Datenaufbereitung¶

3.1. Test auf Multikollinearität¶

InĀ [64]:
f,ax=plt.subplots(figsize = (18,18))
sns.heatmap(df.corr(),annot= True,linewidths=0.5,fmt = ".1f",ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Map')
plt.show()
No description has been provided for this image
InĀ [65]:
df = df.drop(['user_id'], axis=1)
InĀ [66]:
df = df.drop(['user_name'], axis=1)
InĀ [67]:
df = df.drop(['item_id'], axis=1)
InĀ [68]:
df = df.drop(['review_text'], axis=1)
InĀ [69]:
df = df.drop(['review_summary'], axis=1)
InĀ [70]:
df = df.drop(['shoe width'], axis=1)
InĀ [71]:
df = df.drop(['shoe size'], axis=1)
InĀ [72]:
df = df.drop(['waist'], axis=1)
InĀ [73]:
df = df.drop(['bust'], axis=1)
InĀ [74]:
df = df.dropna(axis = 0, subset = ['hips'])
InĀ [75]:
df = df.dropna(axis = 0, subset = ['cup size'])
InĀ [76]:
df = df.dropna(axis = 0, subset = ['height'])
InĀ [77]:
df = df.drop(['quality'], axis=1)
InĀ [78]:
df = df.drop(['length'], axis=1)
InĀ [79]:
df = df.dropna(axis = 0, subset = ['bra size'])
InĀ [80]:
df.loc[(df.fit == 'small'),'fit']=0
InĀ [81]:
df.loc[(df.fit == 'fit'),'fit']=1
InĀ [82]:
df.loc[(df.fit == 'large'),'fit']=0
InĀ [83]:
df["fit"] = df["fit"].astype(str).astype(int)
InĀ [84]:
df = df[df.fit != 0]
InĀ [85]:
df.fit.unique()
Out[85]:
array([1], dtype=int64)
InĀ [86]:
df.isnull().sum()
Out[86]:
size        0
cup size    0
hips        0
bra size    0
category    0
height      0
fit         0
dtype: int64
InĀ [87]:
df.describe(include='all')
Out[87]:
size cup size hips bra size category height fit
count 37187.000000 37187 37187.000000 37187.000000 37187 37187 37187.0
unique NaN 12 NaN NaN 7 32 NaN
top NaN c NaN NaN new 5ft 4in NaN
freq NaN 8949 NaN NaN 9463 5625 NaN
mean 11.794310 NaN 40.066529 35.669024 NaN NaN 1.0
std 7.661827 NaN 5.692938 3.101530 NaN NaN 0.0
min 0.000000 NaN 30.000000 28.000000 NaN NaN 1.0
25% 8.000000 NaN 36.000000 34.000000 NaN NaN 1.0
50% 10.000000 NaN 39.000000 36.000000 NaN NaN 1.0
75% 15.000000 NaN 43.000000 38.000000 NaN NaN 1.0
max 38.000000 NaN 60.000000 48.000000 NaN NaN 1.0
InĀ [88]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
variable = df[['bra size', 'fit', 'hips', 'size']]
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(variable.values, i) for i in range(variable.shape[1])]
vif["Features"] = variable.columns
InĀ [89]:
vif
Out[89]:
VIF Features
0 2.666437 bra size
1 306.634974 fit
2 2.368706 hips
3 3.356016 size
InĀ [90]:
df = df.drop(['fit'], axis=1)
InĀ [91]:
df.hist(figsize=(25,25), bins=50)
Out[91]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001A78129BAC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A7818C6A08>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001A7818B7248>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A7818A6288>]],
      dtype=object)
No description has been provided for this image
InĀ [92]:
# Create a figure instance
fig = plt.figure(1, figsize=(9, 6))

# Create an axes instance
ax = fig.add_subplot(111)

# Create the boxplot
bp = ax.boxplot(df['size'])
No description has been provided for this image
InĀ [93]:
df.head()
Out[93]:
size cup size hips bra size category height
9 13 dd/e 41.0 36.0 new 5ft 6in
14 3 b 36.0 34.0 new 5ft 3in
15 27 c 50.0 40.0 new 5ft 4in
16 18 d 44.0 36.0 new 5ft 2in
19 9 dd/e 35.0 34.0 new 5ft 5in

3.2. Umkodierung von kategorialen Variablen¶

InĀ [94]:
df_dummies = pd.get_dummies(df, drop_first=True) # 0-1 encoding for categorical values
df_dummies.head()
Out[94]:
size hips bra size cup size_aa cup size_b cup size_c cup size_d cup size_dd/e cup size_ddd/f cup size_dddd/g ... height_6ft height_6ft 1in height_6ft 2in height_6ft 3in height_6ft 4in height_6ft 5in height_6ft 6in height_6ft 8in height_7ft 11in height_7ft 7in
9 13 41.0 36.0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
14 3 36.0 34.0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
15 27 50.0 40.0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
16 18 44.0 36.0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19 9 35.0 34.0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows Ɨ 51 columns

InĀ [95]:
#project.save_data("df_dummies.csv", df_dummies.to_csv(index=False))
InĀ [96]:
df_dummies.describe(include="all")
Out[96]:
size hips bra size cup size_aa cup size_b cup size_c cup size_d cup size_dd/e cup size_ddd/f cup size_dddd/g ... height_6ft height_6ft 1in height_6ft 2in height_6ft 3in height_6ft 4in height_6ft 5in height_6ft 6in height_6ft 8in height_7ft 11in height_7ft 7in
count 37187.000000 37187.000000 37187.000000 37187.000000 37187.000000 37187.000000 37187.000000 37187.000000 37187.000000 37187.000000 ... 37187.000000 37187.000000 37187.000000 37187.000000 37187.000000 37187.000000 37187.000000 37187.000000 37187.000000 37187.000000
mean 11.794310 40.066529 35.669024 0.004410 0.202517 0.240649 0.210047 0.157932 0.072929 0.025708 ... 0.007072 0.001748 0.001183 0.000054 0.000027 0.000134 0.000054 0.000027 0.000215 0.000027
std 7.661827 5.692938 3.101530 0.066263 0.401881 0.427483 0.407347 0.364682 0.260023 0.158265 ... 0.083801 0.041772 0.034378 0.007334 0.005186 0.011595 0.007334 0.005186 0.014666 0.005186
min 0.000000 30.000000 28.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 8.000000 36.000000 34.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 10.000000 39.000000 36.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 15.000000 43.000000 38.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 38.000000 60.000000 48.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows Ɨ 51 columns

InĀ [97]:
df.shape[1]
Out[97]:
6
InĀ [98]:
df.shape[0]
Out[98]:
37187

4. Modellierung und Evaluation¶

InĀ [99]:
df_dummies.head()
Out[99]:
size hips bra size cup size_aa cup size_b cup size_c cup size_d cup size_dd/e cup size_ddd/f cup size_dddd/g ... height_6ft height_6ft 1in height_6ft 2in height_6ft 3in height_6ft 4in height_6ft 5in height_6ft 6in height_6ft 8in height_7ft 11in height_7ft 7in
9 13 41.0 36.0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
14 3 36.0 34.0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
15 27 50.0 40.0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
16 18 44.0 36.0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
19 9 35.0 34.0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows Ɨ 51 columns

4.1. Test- und Trainingsdaten¶

InĀ [100]:
target = df_dummies['size']
predictors = df_dummies.drop(['size'], axis = 1)
InĀ [101]:
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, random_state=365) 

4.2. Merkmalsskalierung¶

InĀ [102]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

4.3. Evaluation¶

InĀ [103]:
reg = LinearRegression()
reg.fit(X_train, y_train)
Out[103]:
LinearRegression()
InĀ [104]:
print('training performance')
print(reg.score(X_train,y_train))
print('test performance')
print(reg.score(X_test,y_test))
training performance
0.7372938259154406
test performance
0.7370702534364146
InĀ [105]:
y_pred = reg.predict(X_test)
test = pd.DataFrame({'Predicted':y_pred,'Actual':y_test})
fig= plt.figure(figsize=(16,8))
test = test.reset_index()
test = test.drop(['index'],axis=1)
plt.plot(test[:50])
plt.legend(['Actual','Predicted'])
sns.jointplot(x='Actual',y='Predicted',data=test,kind='reg',);
No description has been provided for this image
No description has been provided for this image
InĀ [106]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
Out[106]:
0.7370702534364146