Image obtained from Udacity’s Data Science Nanodegree Course

Data Science, Music Streaming, and Big Data. Behold the Sparkify.

9 min readMar 8, 2021

Motivation

As I work in a company where Big Data is fundamental, learning Spark was almost mandatory. Luckily, Udacity presented me with this Capstone Project called Sparkify, a fictional music streaming service that would like to predict Churn based on its users’ behaviors.
I had a lot of fun doing this project and, after this post, I can proudly announce that I am getting my Udacity’s Data Science Nanodegree Certificate!
For those who are not familiar with Spark, I hope you can get a glimpse of this powerful tool, and for those that know it well, I expect your feedback!
So let’s get to it!

Introduction

First of all, Churn is a measurement of accounts that cancel or choose not to renew their subscriptions. Since we are talking about a music streaming service, high Churn rates can badly affect our revenue.

The data was provided by Udacity and the complete Jupyter Notebook can be found on my GitHub Repository.

With the data, I intend to answer the following questions:

Are there powerful features to predict Churn?
Which model performs better? The RandomForest or the LogisticRegression?
Which features did the models give more importance to?

Code

Imports and Setup

ETL — Extract, Transform, and Load

As always, start reading the data:

sparkify = spark.read.json('mini_sparkify_event_data.json')

Note: This Spark Dataframe is a subset (128 Mb) of a bigger file (12 Gb) and it is used to quickly test a Pipeline the can be scaled up and sent to a cluster on AWS or IBM Cloud.

This dataframe contains 286500 records the following schema:

Now let’s take a row as a sample:

We see that every row is related to a song a user listened to. This is what we’re going to call the session from now on. Since we want to predict whether a user is likely to Churn, this dataset must be transformed from one-row-one-session to one-row-one-user, but we’ll get to it later on. First, we need to do some clean-up to find powerful features.

Drop Missing UserIds

From the previous code block, we see that userId is a string of digits. From the 286500 records, although, 8346 are missing and must be dropped since we do not have any information about these users. This can be done with a single line of code:

sparkify = sparkify.where("userId != ''")

Clean Timestamp Columns

From the sample row we took, we see that the columns ts and registration are in microseconds. Let’s convert them to seconds and then to complete date format.

EDA — Exploratory Data Analysis

Now that we have a clean dataset, we can look for powerful features.

I divided my features into two groups:

Pivot Features: obtained from pivot tables transformations
GroupBy Features: obtained from GroupBy transformations

Pivot Features

To obtain them, I will rely on two functions I defined: make_pivot and histogram.

First, let’s pivot the page column and define Churn.

#Pivot Transformation
page_pivot = make_pivot(sparkify, 'page', fill_na=True)#Renaming column to Define Churn 
page_pivot = page_pivot.withColumnRenamed('Cancellation 
Confirmation','Churn')

The page_pivot looks like this:

Now I am going to call histogram and display in here the results of Thumbs Down, Error, Home, NextSong, Add Friend, and About (The other columns’ plots are available on my GitHub Repository).

We can see from the plots above that some thresholds are likely to identify users that did not Churn. For example:

If a user clicked NextSong more the 2500 times, it is less likely that he/she will not Churn.
If a user added more than 50 friends, it is less likely that he/she will not Churn.

This logic can be extended to other features obtained from other pivot columns, such as auth, status, and method. Their plots will not be included in the post, but they can be found on my GitHub Repository.

GroupBy Features

The GroupBy features do not need a predefined function, but they need some preprocessing and they require a Left Join with page_pivot dataset obtained previously to get information about Churn.

Starting with Gender:

The resulting plot looks like this:

Now, let’s evaluate the gender’s behavior according to session:

From the two previous results, we see that fewer women are using Sparkify, but they use the app more often and are less likely to Churn compared to men.

Now let’s evaluate the time features, specifically the time between sessions:

To do that, we need to use a Window

The resulting plot looks like this:

We can see that the max and avg time between sessions distributions are different for users who Churned and those who didn’t.

EDA’s Conclusion

In this session, I wanted to identify powerful features. Fortunately, we found multiple features that can be included in the Machine Learning Models:

Pivot Features: page, auth, status, and method
GroupBy Features: max(tbs), avg(tbs), gender

So this gets us the answer to our first question:

1. Are there powerful features to predict Churn?

Yes, there are !!

Complete Machine Learning Pipeline

Now that we know what we want, let’s build an entire Pipeline.

Note: The six following code blocks contain some amazing PySpark features that I decided to add to the post in case someone wants to use it as a reference, but feel free to skip them and go straight to the Pipeline. Also, I would like to acknowledge where I got part of the code from This Stack Overflow Post

Starting with the ETL, where I will define multiple transformations:

DropNullId: Drops the null Ids

2. GenderEncoder: Encodes Females as 1 and Males as 0

3. TimeFeaturesTransform: Cleans Time Features

4. PivotTransform: Builds the EDA’s pivot tables

5. GroupByTransform: Builds EDA’s GroupBy tables

6. JoinTransform: Joins all the transformations together

With the classes defined above, we can perform the entire dataset transformation in a few lines of code:

Note that this time I renamed theCancellation Confirmation column to label instead of Churn , this is for better compatibility with PySpark’s Machine Learning Models.

TrainTestSplit

We know from previous plots that the data is unbalanced, so we need to do a Stratified Split. I’d like to acknowledge where I got the code from, this Stack Overflow post.

Now we’re ready to build Machine Learning models.

Evaluation Metric

Before getting to modeling, we first need to determine which metric to use. There are four possible outcomes of a binary classification:

True Positives (TP): The user Churned and the model predicted so.
True Negatives (TN): The user did not Churn and the model predicted so.
False Positives (FP): The model predicted the user Churned, but the he/she stayed on Sparkify.
False Negatives (FN): The model predicted the user did not Churned, but he/she stopped using Sparkify.

Let’s say our strategy is to send vouchers to people likely to Churn. Now we need to estimate the cost associated to the possible outcomes:

TP: We spent money on someone that will leave Sparkify and some of them will probably want to stay after receiving the voucher, maintaining/increasing our revenue.
TN: We did not spend money on someone that will stay on Sparkify, maintaining our revenue.
FP: We spent money on someone that will stay on Sparkify, maintaining our revenue and affecting just a little of our profit.
FN: We did not spend money on someone that decided to leave Sparkify. This is the worst case scenario, since we could not talk to this customer. Maybe he/she will give Sparkify a bad review or convince his/her family and friends to use another music streaming service.

It seems clear that we want to correctly identify users likely to Churn and we also want to identify as many of them as possible. A metric that does this well Area Under the ROC Curve (usually called just AUC) which indicates how well the probabilities from the positive classes are separated from the negative classes (check this post on evaluation metrics).

Another advantage of this metric is that it deals well with class unbalance, which, from the plots above, we know it is our case. Besides, it measures the quality of the model’s predictions irrespective of what classification threshold is chosen, unlike F1 score or accuracy. This makes AUC a good starting metric since we don’t know the costs associated to TNs and FNs yet.

Note: The models performance will be evaluated using Spark’s BinaryClassificationEvaluator that defaults to AUC.

Now the we have our metric, we are ready to start modeling.

Modeling

Here we are going to answer our second and third questions:

2. Which model performs better? The RandomForest or the LogisticRegression?
3. Which features did the models give more importance to?

Let’s build our models’ pipelines:

Note: in the code block above you can see ParmGridBuilder . This instance is responsible for performing GridSearch for model Refinement.

Now let’s start with the LogisticRegression:

Not bad !! 0.91975 of AUC. Let’s see which features it gave more importance to:

It seems that the LogisticRegression gave little or zero importance to features found to be good during the EDA, such as NextSong, avg(tbs), and gender. Therefore, even though it perfomed well, this model is not very robust since it relies on a small subset of features.

Now let’s do the same for the RandomForest:

Amazing!! 0.96605 of AUC, way better than the LogisticRegression. Let’s see which features it gave more importance to:

The RandomForest gave more importance to time-related features such as max and avg time between sessionsand did not ignore any feature. Also, gender was not considered a powerful feature after all. Besides, this model can be considered more robust than the LogisticRegression, since it takes into account more features to make a prediction about Churn.

Possible Improvements

From the results we got, I can picture four possible improvements:

Testing other models, such as SupportVectorMachines and GradientBoostingMachines.
Dropping Features, such as Gender, that was not considered important in the end.
Engineering other features, such as more time based features, like average song length.
Ensemble of models, such as combining LogisticRegression and RandomForest predictions

Lest drop gender and see how the model perfoms.

Wow! That was a big AUC drop.

A lesson was learned today: just because the feature was the least important, it does not mean it was not important.

Conclusion

Hooray!! If you think it was a long post to read, you should know it took me a whole day to write, but it was worth it!!

I know all the code and the analysis can be overwhelming, but here I gathered many powerful (and important) Spark features, so I think this post can be used as a reference for many future applications.

Finally, in this post I tried to answer the following questions:

Are there powerful features to predict Churn?
Which model performs better? The RandomForest or the LogisticRegression?
Which features did the models give more importance to?

And the answers were:

Yes, there are!!
RandomForest performed better and it is more robust.
The RandomForest gave more importance to time-related features such as max and avg time between sessionsand did not ignore any feature. The LogisticRegression, on the other hand, gave little or zero importance to features found to be good during the EDA, such as NextSong, avg(tbs), and gender.

Besides, a lesson was learned regarding feature importance: just because a feature is considered the least important, it does not mean it is not important.

I am still learning Spark, so if you want to give me some advice or feedback, here is my LinkedIn.

Also, I invite you to clone the GitHub repository and do some analysis yourself. Please tell me if you find any interesting results.

Thank you for your attention and I hope you liked it !!