1. Geschäftsverständnis¶

Der Audio-Streaming-Dienst wurde 2006 von dem schwedischen Startup Spotify Technology S.A. gegründet und ist in mehr als 90 verschiedenen Ländern verfügbar. Für das Unternehmen Spotify ist eines der wichtigsten Potenziale und Herausforderungen die Erstellung einer einzigartigen personalisierten Playlist für jeden Benutzer, die dem Musikgeschmack des Hörers entspricht. Dabei wird erwogen, die oben genannten Herausforderungen mithilfe von Spotify KI und Machine-Learning-Algorithmen zu lösen, um eine maßgeschneiderte Playlist für den Benutzer zu generieren. Die übergreifende Fragestellung dieser Fallstudie lautet: Wie werden bei Spotify Machine-Learning-Technologien eingesetzt, um das Hörerlebnis der Kunden zu verbessern?

2. Daten und Datenverständnis¶

Die gesammelten Daten für Spotify werden zu einem einheitlichen Datensatz zusammengeführt und auf das Problem hin überprüft, um zu sehen, ob Erkenntnisse aus den Informationen extrahiert werden können. Die Attribute haben unterschiedliche Bereiche. Diese Daten werden sorgfältig vorbereitet, damit der endgültige Datensatz erstellt oder mittels Data-Mining-Methoden verarbeitet werden kann. Zu diesem Zweck werden die Daten geladen und analysiert. Anschließend bilden diese wiederum die Grundlage für die nächste Phase der Modellierung.

2.1 Import von relevanten Modulen¶

In [3]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
# from project_lib import Project
# project = Project(project_id='148a63b8-8cd0-4ae9-acd7-382d6f8da2d1', project_access_token='p-b4ba4eff97677a6d061b1f9eb562ffce54a4e527')
# pc = project.project_context
In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

2.2 Daten einlesen¶

In [10]:
#Fetch the local file
# my_file = project.get_file("data.csv")

#Read the CSV data file from the object storage into a pandas DataFrame
#my_file.seek(0)

#raw_data = pd.read_csv(my_file)
raw_data = pd.read_csv("https://storage.googleapis.com/ml-service-repository-datastorage/Generation_of_Individual_Playlists_Generation-of-Individual-Playlists-data.csv")

raw_data.head()
Out[10]:
valence year acousticness artists danceability duration_ms energy explicit id instrumentalness key liveness loudness mode name popularity release_date speechiness tempo
0 0.0594 1921 0.982 ['Sergei Rachmaninoff', 'James Levine', 'Berli... 0.279 831667 0.211 0 4BJqT0PrAfrxzMOxytFOIz 0.878000 10 0.665 -20.096 1 Piano Concerto No. 3 in D Minor, Op. 30: III. ... 4 1921 0.0366 80.954
1 0.9630 1921 0.732 ['Dennis Day'] 0.819 180533 0.341 0 7xPhfUan2yNtyFG0cUWkt8 0.000000 7 0.160 -12.441 1 Clancy Lowered the Boom 5 1921 0.4150 60.936
2 0.0394 1921 0.961 ['KHP Kridhamardawa Karaton Ngayogyakarta Hadi... 0.328 500062 0.166 0 1o6I8BglA6ylDMrIELygv1 0.913000 3 0.101 -14.850 1 Gati Bali 5 1921 0.0339 110.339
3 0.1650 1921 0.967 ['Frank Parker'] 0.275 210000 0.309 0 3ftBPsC5vPBKxYSee08FDH 0.000028 5 0.381 -9.316 1 Danny Boy 3 1921 0.0354 100.109
4 0.2530 1921 0.957 ['Phil Regan'] 0.418 166693 0.193 0 4d6HGyGT8e121BsdKmw9v6 0.000002 3 0.229 -10.096 1 When Irish Eyes Are Smiling 2 1921 0.0380 101.665
In [11]:
#raw_data.head(5)
In [12]:
#descriptive statistics for all columns
raw_data.describe(include = 'all')
Out[12]:
valence year acousticness artists danceability duration_ms energy explicit id instrumentalness key liveness loudness mode name popularity release_date speechiness tempo
count 170653.000000 170653.000000 170653.000000 170653 170653.000000 1.706530e+05 170653.000000 170653.000000 170653 170653.000000 170653.000000 170653.000000 170653.000000 170653.000000 170653 170653.000000 170653 170653.000000 170653.000000
unique NaN NaN NaN 34088 NaN NaN NaN NaN 170653 NaN NaN NaN NaN NaN 133638 NaN 11244 NaN NaN
top NaN NaN NaN ['Эрнест Хемингуэй'] NaN NaN NaN NaN 6Af5jOEKb2bef8XE74ptnw NaN NaN NaN NaN NaN White Christmas NaN 1945 NaN NaN
freq NaN NaN NaN 1211 NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN 73 NaN 1446 NaN NaN
mean 0.528587 1976.787241 0.502115 NaN 0.537396 2.309483e+05 0.482389 0.084575 NaN 0.167010 5.199844 0.205839 -11.467990 0.706902 NaN 31.431794 NaN 0.098393 116.861590
std 0.263171 25.917853 0.376032 NaN 0.176138 1.261184e+05 0.267646 0.278249 NaN 0.313475 3.515094 0.174805 5.697943 0.455184 NaN 21.826615 NaN 0.162740 30.708533
min 0.000000 1921.000000 0.000000 NaN 0.000000 5.108000e+03 0.000000 0.000000 NaN 0.000000 0.000000 0.000000 -60.000000 0.000000 NaN 0.000000 NaN 0.000000 0.000000
25% 0.317000 1956.000000 0.102000 NaN 0.415000 1.698270e+05 0.255000 0.000000 NaN 0.000000 2.000000 0.098800 -14.615000 0.000000 NaN 11.000000 NaN 0.034900 93.421000
50% 0.540000 1977.000000 0.516000 NaN 0.548000 2.074670e+05 0.471000 0.000000 NaN 0.000216 5.000000 0.136000 -10.580000 1.000000 NaN 33.000000 NaN 0.045000 114.729000
75% 0.747000 1999.000000 0.893000 NaN 0.668000 2.624000e+05 0.703000 0.000000 NaN 0.102000 8.000000 0.261000 -7.183000 1.000000 NaN 48.000000 NaN 0.075600 135.537000
max 1.000000 2020.000000 0.996000 NaN 0.988000 5.403500e+06 1.000000 1.000000 NaN 1.000000 11.000000 1.000000 3.855000 1.000000 NaN 100.000000 NaN 0.970000 243.507000

2.3 Datenbereinigung¶

In [13]:
# check for duplicate rows
raw_data[raw_data.duplicated(keep = False)]
Out[13]:
valence year acousticness artists danceability duration_ms energy explicit id instrumentalness key liveness loudness mode name popularity release_date speechiness tempo

Auf fehlende Werte prüfen¶

In [14]:
raw_data.isnull().sum()
Out[14]:
valence             0
year                0
acousticness        0
artists             0
danceability        0
duration_ms         0
energy              0
explicit            0
id                  0
instrumentalness    0
key                 0
liveness            0
loudness            0
mode                0
name                0
popularity          0
release_date        0
speechiness         0
tempo               0
dtype: int64

Auswahl der Prädiktoren¶

In [15]:
#predictive modeling
#plot a correlation matrix to find out which variables are correlated to each other
f,ax=plt.subplots(figsize = (18,18))
sns.heatmap(raw_data.corr(),annot= True,linewidths=0.2, fmt = ".1f",ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Map')
plt.show()
No description has been provided for this image

Features entfernen¶

In [17]:
raw_data= raw_data.drop(['name', 'release_date'], axis=1)
In [18]:
raw_data = raw_data.drop(['id'], axis=1)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-18-5840d43c2e41> in <module>
----> 1 raw_data = raw_data.drop(['id'], axis=1)

~\Anaconda3\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   4161                 weight  1.0     0.8
   4162         """
-> 4163         return super().drop(
   4164             labels=labels,
   4165             axis=axis,

~\Anaconda3\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   3885         for axis, labels in axes.items():
   3886             if labels is not None:
-> 3887                 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   3888 
   3889         if inplace:

~\Anaconda3\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors)
   3919                 new_axis = axis.drop(labels, level=level, errors=errors)
   3920             else:
-> 3921                 new_axis = axis.drop(labels, errors=errors)
   3922             result = self.reindex(**{axis_name: new_axis})
   3923 

~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors)
   5280         if mask.any():
   5281             if errors != "ignore":
-> 5282                 raise KeyError(f"{labels[mask]} not found in axis")
   5283             indexer = indexer[~mask]
   5284         return self.delete(indexer)

KeyError: "['id'] not found in axis"
In [19]:
raw_data = raw_data.drop(['artists', 'energy'], axis=1)
In [20]:
raw_data = raw_data.drop(['year'], axis=1)
In [21]:
raw_data.head()
Out[21]:
valence acousticness danceability duration_ms explicit instrumentalness key liveness loudness mode popularity speechiness tempo
0 0.0594 0.982 0.279 831667 0 0.878000 10 0.665 -20.096 1 4 0.0366 80.954
1 0.9630 0.732 0.819 180533 0 0.000000 7 0.160 -12.441 1 5 0.4150 60.936
2 0.0394 0.961 0.328 500062 0 0.913000 3 0.101 -14.850 1 5 0.0339 110.339
3 0.1650 0.967 0.275 210000 0 0.000028 5 0.381 -9.316 1 3 0.0354 100.109
4 0.2530 0.957 0.418 166693 0 0.000002 3 0.229 -10.096 1 2 0.0380 101.665
In [23]:
f,ax=plt.subplots(figsize = (18,18))
sns.heatmap(raw_data.corr(),annot= True,linewidths=0.5,fmt = ".1f",ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Map')
plt.show()
No description has been provided for this image
In [24]:
#Detect outliers and handle them
raw_data.hist(figsize=(25,25), bins=50)
Out[24]:
array([[<AxesSubplot:title={'center':'valence'}>,
        <AxesSubplot:title={'center':'acousticness'}>,
        <AxesSubplot:title={'center':'danceability'}>,
        <AxesSubplot:title={'center':'duration_ms'}>],
       [<AxesSubplot:title={'center':'explicit'}>,
        <AxesSubplot:title={'center':'instrumentalness'}>,
        <AxesSubplot:title={'center':'key'}>,
        <AxesSubplot:title={'center':'liveness'}>],
       [<AxesSubplot:title={'center':'loudness'}>,
        <AxesSubplot:title={'center':'mode'}>,
        <AxesSubplot:title={'center':'popularity'}>,
        <AxesSubplot:title={'center':'speechiness'}>],
       [<AxesSubplot:title={'center':'tempo'}>, <AxesSubplot:>,
        <AxesSubplot:>, <AxesSubplot:>]], dtype=object)
No description has been provided for this image
In [25]:
#raw_data.describe()

3. Datenaufbereitung¶

In [26]:
df_dummies = pd.get_dummies(raw_data, drop_first=True) # 0-1 encoding for categorical values
df_dummies.head()
Out[26]:
valence acousticness danceability duration_ms explicit instrumentalness key liveness loudness mode popularity speechiness tempo
0 0.0594 0.982 0.279 831667 0 0.878000 10 0.665 -20.096 1 4 0.0366 80.954
1 0.9630 0.732 0.819 180533 0 0.000000 7 0.160 -12.441 1 5 0.4150 60.936
2 0.0394 0.961 0.328 500062 0 0.913000 3 0.101 -14.850 1 5 0.0339 110.339
3 0.1650 0.967 0.275 210000 0 0.000028 5 0.381 -9.316 1 3 0.0354 100.109
4 0.2530 0.957 0.418 166693 0 0.000002 3 0.229 -10.096 1 2 0.0380 101.665

4. Datenmodellierung¶

In [28]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
In [29]:
target = df_dummies['popularity']
predictors = df_dummies.drop(['popularity'], axis=1)

4.1 Test and Training sets¶

In [30]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, random_state=365)

Vor Verwendung von Modellen Spalten standardisieren¶

In [31]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

4.2 Lineares-Regressionsmodell und Evaluation¶

In [33]:
reg = LinearRegression()
reg.fit(X_train, y_train)
Out[33]:
LinearRegression()
In [34]:
print('training performance')
print(reg.score(X_train,y_train))
print('test performance')
print(reg.score(X_test,y_test))
training performance
0.44567392235354486
test performance
0.4477587073403365
In [35]:
y_pred = reg.predict(X_test)
test = pd.DataFrame({'Predicted':y_pred,'Actual':y_test})
fig= plt.figure(figsize=(16,8))
test = test.reset_index()
test = test.drop(['index'],axis=1)
plt.plot(test[:50])
plt.legend(['Actual','Predicted'])
sns.jointplot(x='Actual',y='Predicted',data=test,kind='reg',);
No description has been provided for this image
No description has been provided for this image

4. Deployment¶

In [ ]:
# Wurde nicht durchgeführt, da es optional war