STAT 430 - Final Project

Market Segmentation of Credit Card Customers

Matthew Hoyle / Shuyuan Shen / Junseok Yang

1. Introduction and Dataset Research

Motivation & Research Question

Retaining existing customers and finding new ones are crucial for the sustainability and profitability of any business. Banking is no exception. The high competitiveness in the market of credit cards makes it extremely difficult for banks to attract new customers and prevent attrition among the existing users (Magatef and Tomalieh 2015). This is probably one of the reasons why most of us here have received tons of marketing mails about credit cards from various banks.

However, it is economically inefficient and wasteful to send promotions mails repetitively to every individual in this country since not everyone needs or wants credit cards. Being able to identify individuals who are most likely to be credit card customers will significantly improve the efficiency of marketing in the banking industry.

The question then becomes how to identify individuals who are most likely to continue being credit card customers. A widely used marketing strategy is called market segmentation (Tynan and Drayton 1987; Yankelovich and Meer 2006). According to Tynan and Drayton (1987, p.301), the goal of market segmentation is to identify and delineate market segments or “sets of buyers,” which would then become targets for the company’s marketing plans. Buyers in each set tend to be homogeneous and share some common characteristics.

The traditional market segmentation technique uses surveys to collect data from the population and employs multiple discriminator analysis, multiple regression analysis, or some other analytical procedures to establish the profile of the segments (Tynan and Drayton 1987). In recent years, scholars and banks have begun to use cutting-edge machine learning techniques and big data to further improve the accuracy and efficiency of market segmentation (Elrefai, Elgazzar, Khodeir 2021; Dawood, Elfakhrany, Maghraby 2019; Kaminskyi, Nehrey, Zomchak 2021; Mishra et al. 2020).

This project constitutes one of the efforts to use machine learning techniques in market segmentation in the banking industry. There are two central research questions that this project intends to answer:

  1. Can we perform market segmentation and uncover demographic differences for credit card customers solely by their banking information?
  2. How well can we distinguish existing customers from the attrited ones using clustering algorithms?

To answer these questions, we use a dataset that documents the demographics and usage behaviors of about 9000 active credit card holders during the last six months. The file is at a customer level with 20 variables, including a variable that indicates where the holder is attrited or not. We will start with data cleaning and some descriptive analytics of the dataset. Then, we will answer the two research questions step by step.

2. Data Cleaning and Data Manipulation

Import

The documentation of the dataset says we should ignore the last two columns and suggest we delete them. Thus, we deleted the last two clomuns as well as the first column that contains the IDs of card holders.

Since the dataset does not have any missing values (NaN), we are good to go!

3. Basic Descriptive Analytics

Numerical Attributes' Summary Statistics

The table above shows the descriptive statistics of the numerical variables in the dataset. The mean age of credit card customers is 46, older than one may expect since many young people are using credit cards nowadays.

One thing to notice from this table is differences in mean and variances among these variables. Some variables, for instance, "Dependent count" and "Months Inactive" have very small means and standard deviations, while some other variable, for instance, "Credit Limit" and "Total Transaction Amount", have very large means and standard deviations. Thus, we should be mindful of the scale differences and standardize the variables when needed.

Categorical Attributes' Value Count

The histograms above shows the count of each category in each categorical variables. As we can see, the dataset is quite balanced in terms of gender. The numbers of male customers and female customers are about the same.

People with graduate degree are overreprsented in this dataset, and the numbers of customers who are married or single are significantly higher than those who are divorced or unknown.

A mojority of people in this dataset earn less than 40,000 dollars annually.

Most of them hold Blue card, compared to more advanced Gold, Silver, and Platinum cards.

4. Random Sample & Data Scaling Decisions (Only applies to Research Question 2)

Random Sample

Since the size of our dataset is fairly large, we would like to use a small size of random sample from the original dataset.

Furthermore, considering the significant imbalance between 'Existing' and 'Attrited' customers (as shown below), we would like to equally sample 800 rows from each customer status.

Do we need to scale the dataset...?

Yes, since there are some attributes with relatively large values and standard deviations, such as 'Credit_Limit', which might be possible to dominate over other attributes, it would be better to scale and analyze the dataset.

5. Clusterability and Clustering Structure

Research Question 1

The variables that will be used in this analysis are the banking information vairables. Because these variables are comprised of both numerical and categorical, we will create a Gower's distance matrix using the data.

Next, we will use this Gower's distance matrix to build t-SNE plots, which will give us insight into the clustering structure of our banking information features. We do not need to scale here as the Gower's distance matrix does not require scaling. We will also calculate the Hopkins statistic, which will give us insight into the clusterability of our data.

We select a t-SNE that is a good representation of the clustering. From this plot and the low Hopkins statistic, we see that the banking information on its own is likely clusterable.

We next plot the banking information variables onto our chosen t-SNE plot. We see that the small clustering at the top left is primarily comprised of non-blue card categories. We also see high total relationship counts towards the bottom of the plot, higher credit limit towards the top left, hgher total revolving balance towards the right, high average open to buy towards the top left, high total transaction amount and count towards top, and higher average utilization ratio towards the right. This shows us that the variables included in this analysis show distinct variability within the data and that clustering is possible.

Research Question 2

Considering our dataset is mixed of numerical and categorical attributes, we use Gower's distance to capture the distance between customers. Gower's distances range from 0 to 1 with 0 indicating the two customers are the same for all variables and 1 indicating each of the categorical variable values are different from each other the numerical distances for each attribute are the furthest apart in the dataset.

Hopkin's Statistic

To determine the clusterability of the dataset, we first calculate the Hopkins's statistics, which is a measure of clusterability fo the dataset. Based on the five Hopkin's Statistics above, they are closer to 0 than to 0.5. This indicates that the dataset is clusterable.

Next, we run the t-SNE algorithm to acquire the t-SNE to further explore the clusterability of the dataset.

T-SNE Algorithm

From the t-SNE plots above, we can see that there are clusters forming. This suggests that the dataset is clusterable. Most t-SNE plots with perplexity value of 20 or more seem to display two main clusters in the dataset. Also, there may be some small sub-clusters as well.

We present below the t-SNE plot with perplexity value of 50 and random state of 23 as it has one of the clearest clustering structure with two main clusters and some sub-clusters.

It is surprising that the t-SNE plot's clustering structure is almost perfectly displayed by 'Gender'. Plus, the sub-clusters are also able to reflect 'Income_Category' very well. Customers with high 'Credit Limit' and 'Average Openess to Buy' concentrate in the left cluster. The clustersing structure does not distinguish other numerical variables very well.

Based on these visualizations, we can make a naive prediction that the clustering structure of the t-SNE plot heavily reflects 'Gender' and 'Income Category' rather than 'Attrition_Flag' (whether a customer has been churned or not), thus would have weak or no association between 'Attrition_Flag' and customers' demographics.

6. Algorithm Selection Motivation

Research Question 1

Since the t-SNE shows potential clusters that are of different sizes, non-spherical, and potentially non-separated, an approach based on euclidean distances will not suffice. Thus, we will use Hierarchical Agglomerative Clustering using Gower's Distance Matrix.

Research Question 2

Considering that the dataset is mixed of categorical and numerical attributes, we can use:

  1. K-Prototype Algorithm
  2. Hierarchical Agglomerative Clustering using Gower's Distance Matrix