1. Geschäftsverständnis¶
Der Audio-Streaming-Dienst wurde 2006 von dem schwedischen Startup Spotify Technology S.A. gegründet und ist in mehr als 90 verschiedenen Ländern verfügbar. Für das Unternehmen Spotify ist eines der wichtigsten Potenziale und Herausforderungen die Erstellung einer einzigartigen personalisierten Playlist für jeden Benutzer, die dem Musikgeschmack des Hörers entspricht. Dabei wird erwogen, die oben genannten Herausforderungen mithilfe von Spotify KI und Machine-Learning-Algorithmen zu lösen, um eine maßgeschneiderte Playlist für den Benutzer zu generieren. Die übergreifende Fragestellung dieser Fallstudie lautet: Wie werden bei Spotify Machine-Learning-Technologien eingesetzt, um das Hörerlebnis der Kunden zu verbessern?
2. Daten und Datenverständnis¶
Die gesammelten Daten für Spotify werden zu einem einheitlichen Datensatz zusammengeführt und auf das Problem hin überprüft, um zu sehen, ob Erkenntnisse aus den Informationen extrahiert werden können. Die Attribute haben unterschiedliche Bereiche. Diese Daten werden sorgfältig vorbereitet, damit der endgültige Datensatz erstellt oder mittels Data-Mining-Methoden verarbeitet werden kann. Zu diesem Zweck werden die Daten geladen und analysiert. Anschließend bilden diese wiederum die Grundlage für die nächste Phase der Modellierung.
2.1 Import von relevanten Modulen¶
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
# from project_lib import Project
# project = Project(project_id='148a63b8-8cd0-4ae9-acd7-382d6f8da2d1', project_access_token='p-b4ba4eff97677a6d061b1f9eb562ffce54a4e527')
# pc = project.project_context
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
2.2 Daten einlesen¶
#Fetch the local file
# my_file = project.get_file("data.csv")
#Read the CSV data file from the object storage into a pandas DataFrame
#my_file.seek(0)
#raw_data = pd.read_csv(my_file)
raw_data = pd.read_csv("https://storage.googleapis.com/ml-service-repository-datastorage/Generation_of_Individual_Playlists_Generation-of-Individual-Playlists-data.csv")
raw_data.head()
valence | year | acousticness | artists | danceability | duration_ms | energy | explicit | id | instrumentalness | key | liveness | loudness | mode | name | popularity | release_date | speechiness | tempo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0594 | 1921 | 0.982 | ['Sergei Rachmaninoff', 'James Levine', 'Berli... | 0.279 | 831667 | 0.211 | 0 | 4BJqT0PrAfrxzMOxytFOIz | 0.878000 | 10 | 0.665 | -20.096 | 1 | Piano Concerto No. 3 in D Minor, Op. 30: III. ... | 4 | 1921 | 0.0366 | 80.954 |
1 | 0.9630 | 1921 | 0.732 | ['Dennis Day'] | 0.819 | 180533 | 0.341 | 0 | 7xPhfUan2yNtyFG0cUWkt8 | 0.000000 | 7 | 0.160 | -12.441 | 1 | Clancy Lowered the Boom | 5 | 1921 | 0.4150 | 60.936 |
2 | 0.0394 | 1921 | 0.961 | ['KHP Kridhamardawa Karaton Ngayogyakarta Hadi... | 0.328 | 500062 | 0.166 | 0 | 1o6I8BglA6ylDMrIELygv1 | 0.913000 | 3 | 0.101 | -14.850 | 1 | Gati Bali | 5 | 1921 | 0.0339 | 110.339 |
3 | 0.1650 | 1921 | 0.967 | ['Frank Parker'] | 0.275 | 210000 | 0.309 | 0 | 3ftBPsC5vPBKxYSee08FDH | 0.000028 | 5 | 0.381 | -9.316 | 1 | Danny Boy | 3 | 1921 | 0.0354 | 100.109 |
4 | 0.2530 | 1921 | 0.957 | ['Phil Regan'] | 0.418 | 166693 | 0.193 | 0 | 4d6HGyGT8e121BsdKmw9v6 | 0.000002 | 3 | 0.229 | -10.096 | 1 | When Irish Eyes Are Smiling | 2 | 1921 | 0.0380 | 101.665 |
#raw_data.head(5)
#descriptive statistics for all columns
raw_data.describe(include = 'all')
valence | year | acousticness | artists | danceability | duration_ms | energy | explicit | id | instrumentalness | key | liveness | loudness | mode | name | popularity | release_date | speechiness | tempo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 170653.000000 | 170653.000000 | 170653.000000 | 170653 | 170653.000000 | 1.706530e+05 | 170653.000000 | 170653.000000 | 170653 | 170653.000000 | 170653.000000 | 170653.000000 | 170653.000000 | 170653.000000 | 170653 | 170653.000000 | 170653 | 170653.000000 | 170653.000000 |
unique | NaN | NaN | NaN | 34088 | NaN | NaN | NaN | NaN | 170653 | NaN | NaN | NaN | NaN | NaN | 133638 | NaN | 11244 | NaN | NaN |
top | NaN | NaN | NaN | ['Эрнест Хемингуэй'] | NaN | NaN | NaN | NaN | 6Af5jOEKb2bef8XE74ptnw | NaN | NaN | NaN | NaN | NaN | White Christmas | NaN | 1945 | NaN | NaN |
freq | NaN | NaN | NaN | 1211 | NaN | NaN | NaN | NaN | 1 | NaN | NaN | NaN | NaN | NaN | 73 | NaN | 1446 | NaN | NaN |
mean | 0.528587 | 1976.787241 | 0.502115 | NaN | 0.537396 | 2.309483e+05 | 0.482389 | 0.084575 | NaN | 0.167010 | 5.199844 | 0.205839 | -11.467990 | 0.706902 | NaN | 31.431794 | NaN | 0.098393 | 116.861590 |
std | 0.263171 | 25.917853 | 0.376032 | NaN | 0.176138 | 1.261184e+05 | 0.267646 | 0.278249 | NaN | 0.313475 | 3.515094 | 0.174805 | 5.697943 | 0.455184 | NaN | 21.826615 | NaN | 0.162740 | 30.708533 |
min | 0.000000 | 1921.000000 | 0.000000 | NaN | 0.000000 | 5.108000e+03 | 0.000000 | 0.000000 | NaN | 0.000000 | 0.000000 | 0.000000 | -60.000000 | 0.000000 | NaN | 0.000000 | NaN | 0.000000 | 0.000000 |
25% | 0.317000 | 1956.000000 | 0.102000 | NaN | 0.415000 | 1.698270e+05 | 0.255000 | 0.000000 | NaN | 0.000000 | 2.000000 | 0.098800 | -14.615000 | 0.000000 | NaN | 11.000000 | NaN | 0.034900 | 93.421000 |
50% | 0.540000 | 1977.000000 | 0.516000 | NaN | 0.548000 | 2.074670e+05 | 0.471000 | 0.000000 | NaN | 0.000216 | 5.000000 | 0.136000 | -10.580000 | 1.000000 | NaN | 33.000000 | NaN | 0.045000 | 114.729000 |
75% | 0.747000 | 1999.000000 | 0.893000 | NaN | 0.668000 | 2.624000e+05 | 0.703000 | 0.000000 | NaN | 0.102000 | 8.000000 | 0.261000 | -7.183000 | 1.000000 | NaN | 48.000000 | NaN | 0.075600 | 135.537000 |
max | 1.000000 | 2020.000000 | 0.996000 | NaN | 0.988000 | 5.403500e+06 | 1.000000 | 1.000000 | NaN | 1.000000 | 11.000000 | 1.000000 | 3.855000 | 1.000000 | NaN | 100.000000 | NaN | 0.970000 | 243.507000 |
2.3 Datenbereinigung¶
# check for duplicate rows
raw_data[raw_data.duplicated(keep = False)]
valence | year | acousticness | artists | danceability | duration_ms | energy | explicit | id | instrumentalness | key | liveness | loudness | mode | name | popularity | release_date | speechiness | tempo |
---|
Auf fehlende Werte prüfen¶
raw_data.isnull().sum()
valence 0 year 0 acousticness 0 artists 0 danceability 0 duration_ms 0 energy 0 explicit 0 id 0 instrumentalness 0 key 0 liveness 0 loudness 0 mode 0 name 0 popularity 0 release_date 0 speechiness 0 tempo 0 dtype: int64
Auswahl der Prädiktoren¶
#predictive modeling
#plot a correlation matrix to find out which variables are correlated to each other
f,ax=plt.subplots(figsize = (18,18))
sns.heatmap(raw_data.corr(),annot= True,linewidths=0.2, fmt = ".1f",ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Map')
plt.show()
Features entfernen¶
raw_data= raw_data.drop(['name', 'release_date'], axis=1)
raw_data = raw_data.drop(['id'], axis=1)
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-18-5840d43c2e41> in <module> ----> 1 raw_data = raw_data.drop(['id'], axis=1) ~\Anaconda3\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, errors) 4161 weight 1.0 0.8 4162 """ -> 4163 return super().drop( 4164 labels=labels, 4165 axis=axis, ~\Anaconda3\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors) 3885 for axis, labels in axes.items(): 3886 if labels is not None: -> 3887 obj = obj._drop_axis(labels, axis, level=level, errors=errors) 3888 3889 if inplace: ~\Anaconda3\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors) 3919 new_axis = axis.drop(labels, level=level, errors=errors) 3920 else: -> 3921 new_axis = axis.drop(labels, errors=errors) 3922 result = self.reindex(**{axis_name: new_axis}) 3923 ~\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors) 5280 if mask.any(): 5281 if errors != "ignore": -> 5282 raise KeyError(f"{labels[mask]} not found in axis") 5283 indexer = indexer[~mask] 5284 return self.delete(indexer) KeyError: "['id'] not found in axis"
raw_data = raw_data.drop(['artists', 'energy'], axis=1)
raw_data = raw_data.drop(['year'], axis=1)
raw_data.head()
valence | acousticness | danceability | duration_ms | explicit | instrumentalness | key | liveness | loudness | mode | popularity | speechiness | tempo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0594 | 0.982 | 0.279 | 831667 | 0 | 0.878000 | 10 | 0.665 | -20.096 | 1 | 4 | 0.0366 | 80.954 |
1 | 0.9630 | 0.732 | 0.819 | 180533 | 0 | 0.000000 | 7 | 0.160 | -12.441 | 1 | 5 | 0.4150 | 60.936 |
2 | 0.0394 | 0.961 | 0.328 | 500062 | 0 | 0.913000 | 3 | 0.101 | -14.850 | 1 | 5 | 0.0339 | 110.339 |
3 | 0.1650 | 0.967 | 0.275 | 210000 | 0 | 0.000028 | 5 | 0.381 | -9.316 | 1 | 3 | 0.0354 | 100.109 |
4 | 0.2530 | 0.957 | 0.418 | 166693 | 0 | 0.000002 | 3 | 0.229 | -10.096 | 1 | 2 | 0.0380 | 101.665 |
f,ax=plt.subplots(figsize = (18,18))
sns.heatmap(raw_data.corr(),annot= True,linewidths=0.5,fmt = ".1f",ax=ax)
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.title('Correlation Map')
plt.show()
#Detect outliers and handle them
raw_data.hist(figsize=(25,25), bins=50)
array([[<AxesSubplot:title={'center':'valence'}>, <AxesSubplot:title={'center':'acousticness'}>, <AxesSubplot:title={'center':'danceability'}>, <AxesSubplot:title={'center':'duration_ms'}>], [<AxesSubplot:title={'center':'explicit'}>, <AxesSubplot:title={'center':'instrumentalness'}>, <AxesSubplot:title={'center':'key'}>, <AxesSubplot:title={'center':'liveness'}>], [<AxesSubplot:title={'center':'loudness'}>, <AxesSubplot:title={'center':'mode'}>, <AxesSubplot:title={'center':'popularity'}>, <AxesSubplot:title={'center':'speechiness'}>], [<AxesSubplot:title={'center':'tempo'}>, <AxesSubplot:>, <AxesSubplot:>, <AxesSubplot:>]], dtype=object)
#raw_data.describe()
3. Datenaufbereitung¶
df_dummies = pd.get_dummies(raw_data, drop_first=True) # 0-1 encoding for categorical values
df_dummies.head()
valence | acousticness | danceability | duration_ms | explicit | instrumentalness | key | liveness | loudness | mode | popularity | speechiness | tempo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0594 | 0.982 | 0.279 | 831667 | 0 | 0.878000 | 10 | 0.665 | -20.096 | 1 | 4 | 0.0366 | 80.954 |
1 | 0.9630 | 0.732 | 0.819 | 180533 | 0 | 0.000000 | 7 | 0.160 | -12.441 | 1 | 5 | 0.4150 | 60.936 |
2 | 0.0394 | 0.961 | 0.328 | 500062 | 0 | 0.913000 | 3 | 0.101 | -14.850 | 1 | 5 | 0.0339 | 110.339 |
3 | 0.1650 | 0.967 | 0.275 | 210000 | 0 | 0.000028 | 5 | 0.381 | -9.316 | 1 | 3 | 0.0354 | 100.109 |
4 | 0.2530 | 0.957 | 0.418 | 166693 | 0 | 0.000002 | 3 | 0.229 | -10.096 | 1 | 2 | 0.0380 | 101.665 |
4. Datenmodellierung¶
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
target = df_dummies['popularity']
predictors = df_dummies.drop(['popularity'], axis=1)
4.1 Test and Training sets¶
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.2, random_state=365)
Vor Verwendung von Modellen Spalten standardisieren¶
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
4.2 Lineares-Regressionsmodell und Evaluation¶
reg = LinearRegression()
reg.fit(X_train, y_train)
LinearRegression()
print('training performance')
print(reg.score(X_train,y_train))
print('test performance')
print(reg.score(X_test,y_test))
training performance 0.44567392235354486 test performance 0.4477587073403365
y_pred = reg.predict(X_test)
test = pd.DataFrame({'Predicted':y_pred,'Actual':y_test})
fig= plt.figure(figsize=(16,8))
test = test.reset_index()
test = test.drop(['index'],axis=1)
plt.plot(test[:50])
plt.legend(['Actual','Predicted'])
sns.jointplot(x='Actual',y='Predicted',data=test,kind='reg',);
4. Deployment¶
# Wurde nicht durchgeführt, da es optional war