Churn analytics in the technology and telecom industries using Google Vertex AI
Oct 26, 2022 • 14 min read
A company’s ability to generate a steady flow of new customers, and retain existing customers at the same time, can decisively determine whether a business is successful or not. Customer retention is particularly important because the cost of user acquisition is much higher than the cost of user retention for most businesses. A good retention strategy generally requires understanding why customers decide to stay or churn, what characterizes the churners, how potential churners can be identified, what can be done to change their decisions, and how all these insights can be efficiently operationalized. In this blog post, we develop a reference churn analytics pipeline that helps to evaluate the churn risk for individual users and recommends personalized churn treatment plans that can be executed by marketing teams. We start with a brief discussion of how churn can be defined and measured, then introduce the reference solution architecture that includes several ML models, and finally we show how to implement these models using the AutoML capabilities provided by Vertex AI.
Churn scenarios
The concept of churn is applicable to any business that assumes regular interactions between users and services. At a high level, churn can be defined as the absence of expected interaction events over a certain period of time. This definition requires us to specify the interaction events that need to be tracked, as well as the business logic for detecting churn based on these records. This task is not trivial, and both parts depend heavily on the business model at hand.
The first common scenario is a subscription-based business model where the client is obligated to pay a fixed or consumption-dependent fee for each time interval. This model is depicted in the figure below. In this scenario, churn can be defined as the absence of the payment even during one time interval.
The alternative scenario is a transactional model that does not assume recurring payment or interaction commitments. In the transactional model, churn can be defined as the time gap between two successive interaction events that exceed a certain threshold, as shown in the figure below. For example, we may observe this scenario in the ad tech industry where a user is considered to be a churner if he or she does not interact with an ad during a certain time interval.
Finally, churn can be defined using undesirable events such as account cancellation, a bad review, product return, or service downgrade. In this blog post, we focus on the subscription scenario with variable monthly payment. This model is most typical for software, gaming, and telecom services.
Measuring churn
Once we have the churn conditions defined, the next logical step is to evaluate whether the business has a churn problem, followed by measuring its severity. This assessment will drive all further churn analytics and prevention activities. For example, it is sometimes recommended that a B2C business should have at least 100 instances of churn for a specific cohort of users or products to start churn analytics activities, and until this threshold is reached, the primary focus should be on user acquisition.
The most basic measure of churn is $churnquad rate$, that is, the percentage of users who churned from a given cohort during one subscription interval:
$churn quad rate = (n_{start} – n_{end}) / n_{start}$
where $n_{start}$ is the number of users at the beginning of the interval and $n_{end}$ is the number of retained (non-churned) at the end of the interval. There are different variations of this equation. For example, the number of users in the denominator can be averaged over the subscription interval. The shortcoming of the basic churn rate metric is that it does not tell us how much revenue we lost due to the churners. We can overcome this limitation by introducing a new measure that incorporates the revenue considerations. For example, the revenue churn rate can be defined as the percentage of expected recurring revenue lost during the time interval:
$revenue quad churn quad rate = (revenue_{start} – revenue_{end}) / revenue_{start}$
The user churn and revenue churn rates are usually broken down by user cohorts and products to provide better insights into business health. For example, a company can separately track churn rates for customers acquired a year ago, two years ago, and so on, to monitor how the quality of the customer base changes over time. An increasing churn rate over time can be a major issue even if the absolute value of the churn rate stays below the acceptable level. Defining churn metrics that clearly quantify the dynamics of churn and its impact on business health is extremely important, and it is common to develop customized metrics.
Once we have specified the measure of churn and assessed the associated business impact, we are able to predict churn more effectively and help reduce churn rates.
Preparing the dataset
For prototyping purposes, we generate a synthetic dataset that emulates the main properties of the user account data of a real-world telecom company. In this dataset, each user is represented by a time series with three signals: phone usage duration, internet usage duration, and number of service calls per month. We also generate account attributes including social demographic data, and information about contract and payment. Finally, we augment the user data with email texts to demonstrate how churn signals can be extracted from textual data.
In our dataset, we assume that the churners exhibit three dynamic behavioral patterns (shown in the figure below): increasing usage, decreasing usage, and interrupted usage. Increasing usage can be attributed to users who are spending their free or unused resources (in this case, minutes or gigabytes of data) before churning. Decreasing usage can be attributed to users who find the service to be too expensive. Finally, users who have problems setting up their equipment, or who experience recurring interruption of service, may decide to churn as well. We also include additional signals such as the increasing number of customer support calls for high-risk users, and add noise by mixing normal and churning behavior patterns for a certain percentage of users.
We show the relationship between the churn rate and certain user cohorts in the figure below. These charts indicate that, in our dataset, the churn rate is positively correlated with the number of calls and negatively correlated with cohort tenure. This behavior is expected in many domains. For example, the higher number of service calls can indicate some problem that users face before they decide to churn. The relationship between the churn rate and the other two metrics (internet and phone usage) is not easily interpretable using the basic charts, although there are persistent patterns across the cohorts. We will analyze these behavioral patterns using advanced methods in the next sections.
Finally, for the sake of specificity, we generated three types of treatments: upgrade of package, discount, and free devices. We simulate the decision of how specific treatments are matched with users by implementing basic rules that use churn metrics. This mimics the real-world marketing process of companies that use targeting rules hand-crafted by churn prevention experts.
Architecture of the churn analytics pipeline
Churn analytics is not just a prediction model or collection of metrics. It is a comprehensive process of analyzing user behavior, evaluating potential actions and outcomes, and making actionable decisions. For the case in this blog post, we implement a simplified version of such a pipeline that includes four major parts, as shown in the figure below: data consolidation, churn risk evaluation, churn behavior analytics, and a decision making layer.
Our primary focus is the churn risk evaluation models and downstream decision optimization. However, we also discuss and implement the basic sentiment analysis model that demonstrates how the main pipeline can be augmented with natural language processing (NLP) capabilities to enable the extraction of churn triggers from call transcripts and other user-generated content.
Short-term churn prediction model
The architecture presented in the previous section includes the event model. The main purpose of this model is to detect users who have already made a decision to churn; that is, the users whose next interaction event will be the churn event. This model is typically more accurate than the longer-term lifetime model discussed in the next section because the predictions are made just for the immediate user action. Therefore, the model deals with less noise during the process of mapping recent behavioral patterns with users’ final decisions.
We build the event model using AutoML Tabular Classification service [1]. The model is designed to consume the following information: most recent events from user history (phone duration, internet duration, and number of service calls in the last month), average statistics of monthly payments, tenure, social demographic data, and contract information. For each user, the model estimates the churn probability. This design is depicted in the figure below.
We use the $area quad under quad the quad precision-recall quad curve$ (‘maximize-au-prc’) objective function, which is an appropriate choice for imbalanced datasets where we need to optimize the results for the less common classes (in this case, churners). Alternatively, we can target specific precision or recall points using ‘maximize-recall-at-precision’ or ‘maximize-precision-at-recall’ objectives, respectively. The training job produces the following metrics:
$- Area quad under quad Precision-Recall quad Curve: 0.99$
$- Area quad under quad ROC quad Curve: 0.99$
$- Log-loss: 0.12$
$- Precision: 0.9$
$- Recall: 0.87$
We get good results for precision (0.9), and recall (0.87), however, in many real-world environments, it might not be possible to attain such results, and we would need to search for a trade-off between these two metrics. The trade-off optimization generally requires knowing the cost of inaccurate classification. In other words, we need to construct or estimate the cost matrix. For example, let us assume that the cost of a retention offer (treatment) is 10, the probability of the offer acceptance is 30%, and the cost of replacing a churned user with a new one is 40. Hence, the overall expected cost of a false negative is (40-10)*0.3 = 9, which represents the average loss associated with each non-detected churner. On the other hand, the expected cost of a false positive is 10 for each non-churner that is incorrectly flagged as a churner.
Once the cost matrix is defined, we are able to search through different thresholds and find the one that provides the best trade-off between false positive and false negative rates. The process of finding the optimal threshold for the treatment is illustrated in the figure below.
Long-term churn prediction model
The second model in the pipeline quantifies how the churn risk evolves over time for a given user, so we can determine the optimal treatment time. Problems related to the analysis of the expected duration of time until an event are extensively studied in a dedicated branch of statistics known as survival analysis, and there are many specialized models and algorithms developed specifically for this purpose. In this section, we discuss how this type of analysis can be performed using generic components provided by Vertex AI AutoML services.
We build the lifetime model that estimates how the churn probability evolves within the time horizon of six months. This can be done using both the AutoML Tabular and AutoML Forecasting services, and we provide two separate implementations in the reference notebook.
Let us examine the AutoML Tabular implementation first. The inputs for the model are designed as follows (also, see the figure below):
- Duplicate rows the number of times equal to length of the time horizon (in our case, six months).
- Add lag variables at least as long as the length of the time horizon. The lag variables include features such as phone usage, internet usage, and number of customer support calls.
- Add column “time horizon” populated with offsets within the time horizon.
- For each row, set the target value to the event of the future month that the model needs to predict, as shown in the figure.
The model also utilizes other attributes used in the event model, including gender, payment method, and duration of contract. The model is a binary classification model that predicts risks of churn for various time horizons, as shown in the figure below.
The training job achieves the following metrics for the above design:
$- Area quad under quad Precision-Recall quad Curve: 0.97$
$- Area quad under quad ROC quad Curve: 0.97$
$- Log-loss: 0.22$
$- Precision: 0.74$
$- Recall: 0.53$
These results are clearly worse than the precision and recall obtained for the event model. The reason for this is that the model is trained to predict churn for a longer time horizon, therefore, the inputs are more imbalanced and noisy than the inputs of the short term model. Moreover, for a time horizon longer than one month, the model cannot pick up the behavioral signals that precede the decision of the user to churn. We should generally interpret the outputs of this model as potential risk estimates rather than a strict classification.
The charts below show the outputs of the lifetime model for one user drawn from the test dataset. The survival chart (upper left figure) shows that the user is at higher risk of churning after six months. Moreover, a closer inspection of the user’s behavior during the prediction period (this information is not present during the inference time) reveals the increasing usage pattern discussed in the previous sections. This technique can be used to better understand the behavior of churners and identify temporal patterns that indicate high or low churn risk.
The alternative approach is to use the AutoML Forecasting service [2]. The time series forecasting formulation is a convenient choice for the long-term churn prediction problem because both inputs (past interaction events) and outputs (future churn probabilities) can be viewed as time series. However, the AutoML Forecasting service is designed to solve regression problems where the output values are real values, not probabilities. We work around this problem by normalizing the regression outputs into the probabilities using the sigmoid function (see the reference notebook for details). In our example, the forecasting model outperforms the classification one, but both approaches are generally valid and worth evaluating.
Once the long-term risk values are estimated, we can determine the optimal treatment times for individual users. One possible approach is to set the thresholds using cost matrices like we did in the previous section. Another approach is to select the month with the highest uptick of the risk score.
Uplift model
The models described in the previous sections provide scores and insights that can help marketing teams make better decisions. In practice, marketing teams often have challenges with reconciling multiple scores and converting insights into actionable decisions. In this section, we develop the third component of our churn risk evaluation pipeline, called the uplift model, that aims to address this challenge.
The uplift model is designed to evaluate the impact of potential churn prevention actions (conditional uplift) and provide marketing teams with personalized treatment recommendations. The uplift model can also be integrated with online systems and marketing automation software to automatically personalize web content and offers.
The main idea of uplift modeling is to evaluate the difference between the churn probabilities conditioned on specific treatment and n-treatment baseline:
$ uplift(treatment quad x) = p(churn | treatment quad x) – p(churn | no treatment)$
The practical implementation of this approach is usually associated with two challenges: selection bias and high cardinality of the treatment set. The selection bias problem arises because the historical data are typically collected under some treatment targeting rules, not randomized experiments. In many cases, this problem can be addressed using bias correction techniques (see [3] for a comprehensive overview and [4] for a real-world case study). In some cases, the bias in the historical data cannot be corrected using analytical methods, and we either need to collect more data using randomized experiments or deem the uplift modeling approach infeasible, and discard it.
The second problem comes from the fact that companies usually have a large number of treatment variants, and it is infeasible to evaluate each of them separately. This problem can often be alleviated by introducing some kind of hierarchy that allows grouping similar treatments together, reducing the cardinality of the action space. We have only three treatments (discount, package upgrade, and free device) in our synthetic dataset, so we do not discuss this problem further in this blog post.
The design of the uplift model is similar to the event model, but we include the treatment type as an input feature and use the after treatment effect as the target label. This design is presented in the figure below. The same approach can be used to extend the lifetime model, so that both the treatment type and time would be prescribed.
Once the model is trained, we do a sanity check and examine the importance of the treatment feature. This step is implemented using Vertex Explainable AI. The feature attribution outputs produced by Vertex Explainable AI reveal that the treatment is one of the most important features, beside tenure and internet usage. This means that the model is sensitive enough and can be used for further uplift analyses.
The uplift scores enable us to filter out the users with high risk of churning (lost causes) and users that have a high chance of retention without any treatment (false positive cases) in the system. As illustrated in the following waterfall graph, the main targets are the users that have a relatively low probability of retention without any treatment, and relatively high probability when the treatment is provided. After the users are selected, each treatment is compared with the baseline and a final treatment recommendation is made.
Optimization engine
The uplift model can be used to optimize treatment decisions at the level of individual users. However, we usually need to optimize not only the user-level decisions, but entire churn prevention campaigns and programs that can be the subject of budgetary and other business constraints. Performing such an optimization can be a separate complex problem because treatment decisions become interdependent. In this section, we augment our pipeline with an additional optimization component that implements campaign-level optimization.
The campaign-level optimization can usually be represented as an integer programming (IP) or mixed integer programming (MIP) problem. In the most basic case, we can use the following formulation for the budget-constrained optimization:
maximize $sumlimits_{i}sumlimits_{j} x_{ji} times p_{ji}$
subject to
$sumlimits_{i} x_{ji}c_{j} leq b_{j} $for j in {device, discount, upgrade, none}
$sumlimits_{j} x_{ji} = 1 $for all i
$x_{ij} in text{{0,1}} $for all i, j
where $i$ iterates users, $j$ iterates treatments, $x_{j,i}$ are binary decision variables, $p_{j,i}$ is the probability that user $i$ will be retained under the treatment $j$, $b_j$ is the budget constraints for treatment $j$, and $c_j$ is the treatment cost. The main goal is to maximize the overall uplift score represented by the probability of user retention. This goal can be further enhanced using lifetime value (LTV) to optimize long-term revenues.
Advanced techniques
In the previous sections, we developed models for identifying churners and determining optimal treatments. The careful analysis of these models and their outputs can provide valuable insights into churn triggers and user behavior, but this is mainly a byproduct of the risk scoring activity. We can augment these capabilities with more specialized models that help to get even deeper insights into the churner behavior. These extended insights can be used not only to fight churn, but also to identify a broader range of opportunities for improving customer experience, e.g. eliminating bottlenecks in user journeys, detecting product usability issues, and improving customer support services. In this section, we demonstrate one of such extensions.
Sentiment model
One common component that can be included into the churn analytics pipeline is a sentiment model. The purpose of a sentiment model is to extract useful signals from the user generated data such as product reviews, emails, text messages, and call transcripts. These signals can be used to perform manual analysis or enhance inputs of the risk scoring models.
We develop a reference sentiment model using a public “Crowdflower Claritin-Twitter” dataset [5]. This dataset includes a large number of tweets that mention Claritin (allergy relief product) tagged with sentiment labels. We use this dataset to train an AutoML Text Classification job and obtain a generic model that can be used for detecting negative sentiment in user emails. The sentiment score produced by the model can be used, in particular, as an input into the downstream churn risk scoring models.
We can also use the trained sentiment model to identify potential churn triggers. For example, we can define several characteristic keywords such as “price” and “quality” and then measure the distances between these keywords and emails in the space of text embeddings that can be produced using the model we developed. These distances can be used as additional signals for churn behavior analytics. The sentiment scoring and churn trigger analytics workflows are summarized in the figure below.
Conclusion
In this blog we showcased our starter kit for the churn prediction problem that many businesses face, using common best practices in this area. The solution has flexible open architecture and can be customized for a variety of business scenarios. To a large extent, this is enabled by Vertex AI AutoML components that enable us to focus on the business problem and workflow rather than low-level feature engineering and model parameter tuning.
References
- https://cloud.google.com/vertex-ai/docs/tabular-data/classification-regression/overview
- https://cloud.google.com/vertex-ai/docs/tabular-data/forecasting/overview
- Mark W. Fraser and Shenyang Guo, Propensity Score Analysis: Statistical Methods and Applications, 2ed, SAGE Publications, 2014
- https://data.world/crowdflower/claritin-twitter
LEARN MORE
Customer churn prevention