Project Definition
Project Overview
Sending offers to customer is a good strategy to attract new customers and retain existing one. However, not all customers respond in a similar way and individual customers have their own preferences towards received offer types. A customer might be interested in saving money with discounts, another might be more interested in getting more items for less while a different customer would prefer to only know about what is new without caring for any savings. This project, which is the Udacity Data Scientist nano-degree program capstone project, focused on analyzing the different offers provided by Starbucks and their customers response to them.
Problem Statement
In this project I aimed to identify Starbucks customers behavior in regards to sent offers. This include analyzing and grouping Starbucks customers to provide customized offers that has higher chance of attracting and retaining customers and influencing their future purchases. My strategy to accomplish this is to generate statistical analytics to draw conclusion on who Starbucks should target. In addition, I decided to implement a classification model which would predict the likelihood that a user completing an offer. For this, I considered multiple common classification algorithms that are known for being perfect for such classification problems where we need to categorize data into a given number of classes and identify a class for any newly added data.
Analysis
Data Exploration
Starbuck provided a dataset that is to simulate peoples purchase patterns and the influence of offers on these purchases. Each customer action triggers an event which could be receiving offers, viewing offers, and making purchases.
There are three types of offers, which are delivered through multiple channels, that a customer might receive. These are Buy-one-get-one free (BOGO), Discount or Informational.
There are three datasets that we worked with:
- portfolio: ten offers sent during 30-day test period which include information such as offer reward which is the money awarded for the amount spent, channels such web, email, mobile and social, difficulty which represents money required to be spent to receive reward, duration for offer to be open in days, offer type and an id.
- profiles : 17000 users in the rewards program with information such as gender, age, id, membership date and income.
- transcript: 306534 logged events such as receiving, viewing, completing an offer or making a transaction. In addition it provides more information about the event such as the offer associated with it if any, amount of money spent in “transaction” and reward money gained from “offer completed” and time in hours after start of test.
So, before starting with the data analysis, data was cleaned-up and prepared by performing the following steps :
portfolio dataset cleaning: This dataset had no missing values but needed some enhancements which is a good thing.
- The major change in this data set is separating the values in the channels column into four different columns (mobile, email, web and social) with 0 or 1 values then dropped the channels column.
- I also renamed the “id” column to “offer_id“ for later merge with other datasets.
profiles dataset cleaning: This dataset had missing values in three columns (gender, income and in age which was encoded as 118). Worth mentioning that whenever a value is missing in one of these columns, it is missing from the other two. The changes I made on that dataset include the following:
- Replaced the values in the age column with age groups such as (YA, 20s, 30s,… etc.) for better analysis
- Replace became_member_on date value with membership year only and renamed that column to “year”
- Filled the missing values with “NA” for gender and age and with the mean value for the income which equals to 65404.9915682968.
- Rename the column “id” to person to merge later with other dataframes.
transcript dataset cleaning: This dataset had no missing values but I needed to extracted the contents from the “value” column and create a new column for each value type. Few notes about my work here:
- During this process, I need to modify few additional things since the value column had objects with the following keys (offer id, offer_id, amount, reward) and we can notice that instead of having (offer id and offer_id), we should have one. So, I merged both into one column (offer_id).
- Each record would have one key only therefor causing null values in the other columns. I filled the null values for offer_reward and amount with 0.
- Renamed “reward” column name with “gained_reward” to avoid confusion with the portfolio reward column later.
Data Analysis and Visualization
In this section, we will go through two phases of statistical data analysis. First we will do a general analysis targeted towards enhancing our knowledge of the customers characteristics (or features). Then in the second phase, we will try to infer their preferences based on their behavior during the experiment.
Phase 1: General Exploratory Data Analysis
This part will provide an overview of Starbucks customers to analyze their customer-base and understand the customer groups which the majority of the customers belong to. Here, we are interested in the main customer characteristics which are gender, income, age and membership and the relationship between them.
First we examined at each characteristics separately starting with gender, we saw that males represented almost half of the customers, 36% were females while the remaining were either Other or they preferred not to say.
As for the age groups, we surpassingly found that the majority of the customers were either in their 50s or 60s followed by customers in their 40s. A large number of customers who did not provide an age came before the remains age groups.
And the customers income information provided good indicators in which we can see that most customers’ income range between 50000 and 75000. Larger number of customer receive income of below 50000 compared to customers who receive incomes above 75000. Finally, we notice that a very small number of customers receive incomes that are above 100000.
And last individual customer characteristic was the membership year where we saw that many customers joined the program in 2017. Seems that 2017 had a good advertisement strategy as we can see that customers almost doubled in number compared to 2016.
Next, we viewed the relationship between the above main characteristics in order to find a better indicators of Starbucks customers. from this information, Starbucks can do any or both of these steps:
- knowing who are more interested in the rewards/offers program, the Starbucks can reach out to more people in these categories.
- Knowing the minorities, Starbucks can investigate why they are not as interested and how they can improve their program to increase their numbers.
For example, we noticed that in these two figures that in each group, male numbers exceeded the females except for customers at the age 80 years and older. The difference in numbers is huge for younger customers (young adult to 40s) compared to older ones.
Next, we clearly saw the difference in income for the different age groups here. Customers in their 50s and older had higher income compared to younger groups.
And finally, we noticed that a higher percentage of males receive 0–80000 income while a higher percentage of females receive 80000+ income. However, the number of customers in the first group is by far larger than the second.
Phase 2: Detailed Statistical Data Analysis
This part will provide detailed analysis of Starbucks customer behavior regarding the program in relation to different factors.
We started by examining the data to see for a set of 40 random customers, the different events they were involved in during the experiment. As seen in this below figure, we can observe that:
- Some people never completed an offer even though they made transactions
- In general, many customers who received offers viewed them
- People with more transactions are more likely to complete offers
After that, we looked at which types of offers had more views and fulfillments. Even-though people were more interested in viewing BOGO offers compared to other offers, they completed (or used) more discount offers more often. However, this comparison is not very accurate because people might complete offers without viewing them and therefore these completions counts to the final record of completed offers.
From these figures, we also noticed that many customers who received the informational offers were interested in viewing them.
Then, we saw which three offers were most popular (i,e, had more completion events) and compared their attributes such as offer type and channels. From the figure and the table, we can conclude that discount offers which are sent via all channels with longer duration are more popular even if the reward is less and the difficulty is higher.
Next, we compared two sets of customers:
- Top 30 spenders: the customers who paid more in general for Starbucks products
- Top 30 respondents: the customers who used and completed more offers
Top spender & respondents are mostly in their 50s or 60s. people in their 30s and 40s represent a med-sized set of top spender & respondents. People in their 70s are more likely to spend more while people in their 20s are more likely to use offers. We also saw that in top spenders had males (53%) slightly higher than females (47%) while top respondents percentages is the complete opposite.
Finally we drilled down to each category to identify which offers they responded to and the event they triggered for each. These figures show for each age group and for each gender, which offer type they either viewed, received or completed.
From this we can say that:
- People are more likely to view BOGO but are more likely to use the Discount. This is true for all age groups and genders.
- People are interested in viewing the informational offers
Building Models
In this part I built a model to predict whether a new customer would view then complete an offer or not based on multiple factors. I started by preparing the data:
- Merged all dataframe into one main dataframe.
- Created binary columns for the categorical columns such as offer_type, gender, event and age.
- Created a new dataframe that consists only of the records of customers who viewed then completed the offers.
- I dropped unneeded columns (offer_type, gender, year, event, gained_reward, person, offer_id, time, offer received, offer viewed, age)
- Checked and filled null values if any.
The input data frame to the model will have the following columns:
amount, income, reward, difficulty, duration, mobile, email, web, social, bogo, discount, informational, F, M, NA, O, children, YA, 20s, 30s, 40s, 50s, 60s, 70s, 80+, NA and offer completed.
The data frame will look like this:
I created a function which expects a classifier and a dataframe. It will extract features and targets, split the data into training and testing sets, train the model, predict and finally score the model. Then I ran multiple classifiers such as AdaBoostClassifier, GaussianNB, DecisionTreeClassifier and RandomForestClassifier on the final DataFrame. Here is a part of the code with the models implementation:
After running this code, I found that AdaBoost performance was the best and I selected it for further improvements. More details about why I selected it and how I improved it is discussed in the next section.
Worth mentioning that this part of the project was in-fact a bit challenging for me as I did not acquire any advanced machine learning knowledge prior to this Udacity course. In addition, implementing models are both resource and time consuming and requires a great deal of patience.
Evaluation
Metrics
To evaluate the different models in this project, I used the accuracy_score as the metric. According to scikit-learn website, the accuracy_score is a funcntion that will return the fraction of correctly classified samples.
After running all four classifier, I got the following accuracy scores for each:
Training with AdaBoostClassifier...
Training Accuracy: 0.8097855632739354
Testing Accuracy : 0.8087702343561247Training with GaussianNB...
Training Accuracy: 0.7738447598912715
Testing Accuracy : 0.7712611741966658Training with DecisionTreeClassifier...
Training Accuracy: 0.8507198228128461
Testing Accuracy : 0.7831601836192317Training with RandomForestClassifier...
Training Accuracy: 0.8506996879089902
Testing Accuracy : 0.7949383909156801
Based on these scores I decided that AdaBoostClassifier is the best fit to accomplish our goal as it has a stable training and testing scores compared to other models. The others models scores indicates that we might have the model overfitting issue.
Model Improvement
Once I decided to go with AdaBoost, I ran grid search to find the optimal parameters to use. At the beginning when I was comparing the models, I ran the classifier with its default parameters which are:
- n_estimators default=50
- learning_rate=0.8,0.9,1,1.1,1.2
- algorithm=’SAMME.R’
I tried to run grid search with many combinations of parameters but I was always getting the default ones as the best parameters. I then decided to used a very close values to the default parameters to see if it would get us a better accurecy. I ran the search with the following:
- n_estimators default=40, 45, 50, 55, 60
- learning_rate=0.8,0.9,1,1.1,1.2
- algorithm=’SAMME’, ‘SAMME.R’
And actually got something different this time. The result of running the grid search was that the best parameters were:
- n_estimators default=45
- learning_rate=0.9
- algorithm=SAMME.R’
Then I implemented the model with the new parameters and got very minor accuracy improvement:
Training with AdaBoostClassifier(learning_rate=0.9, n_estimators=45)...
Training Accuracy: 0.8101479915433404
Testing Accuracy : 0.8090722396714182
Results
- In the modeling part of this project, I tried multiple algorithms which are AdaBoostClassifier, GaussianNB, DecisionTreeClassifier and RandomForestClassifier.
- I found that Adaboost was the best for our case here based on its reported accuracy results.
- After running grid search, I found that we got 0.11% improvement in accuracy.
Conclusion
Reflection
At the end of this article, I would like to highlight the following observations and conclusions regarding solving our problem which is “who of Starbucks customers we should send offers to?” :
- People in their 50s and 60s are more likely to respond to offers
- Even-though people might be more excited to view BOGO offers compared to the discount offers, The discount offers have more chances of being used compared to BOGO as it seems that people are more interested in paying less that in getting double the number of items.
- Behavior of both males and females are almost the same. The slight difference in results was due to the fact that males represented a larger portion of the dataset. Because of this I would suggest that offers should be sent to all users regardless of gender.
- Sending offers through all channel that have longer durations have higher chances of being used even if the reward is less and the difficulty is higher.
- A model is available to predict users behavior regarding Starbucks offers with accuracy of 80%
This was a very interesting project, I particularly enjoyed the process of examining the intricate statistical relationships between the customers features and how they, combined, provide insight to the overall customer-base. I also felt that this project strengthened my understanding of data modeling and classification techniques which I hope will further grow to solve other real-world complex problems.
Improvements
I think getting a larger datasets with more records would provide better analytics that would give us an improved future strategy. In addition, having more characteristics to customer other than age, gender, income, etc.. would add benefit to our goal.
I would also suggest building a Machine Learning Pipeline to classify new data instantly and to create and publish a web/mobile application based on the model.
Resources
- Banner image from https://www.youtube.com/watch?v=BcUVLce3TOU
- Data from Startbucks via Udacity https://www.udacity.com
- Project gituhub repository https://github.com/wandki/DSND-Capstone-Porject-Starbucks
- accuracy_score https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html