Cluster Analysis
Cluster Analysis is a statistical technique which groups respondents together so that they are the most similar (Internally homogeneous) within their cluster group but as different (Externally heterogeneous) as possible between each cluster group. Cluster analysis is particularly useful in consumer segmentation when consumers tend to be similar in many ways (e.g., demographically) but are different attitudinally.
In the media industry, Cluster Analysis can be used to segment audiences, enabling the creation of targeted content or advertisements. It helps in maximizing engagement and return on investment (ROI) by focusing efforts on specific audience segments.
What are the benefits of using Cluster analysis?
Identifying new or differentiated audience segments:
Cluster analysis enables the segmentation of for example, a product category into mutually exclusive groups based on their responses to key lifestyle and attitudinal statements. These audience groups are made up of like-minded people.
Understanding buying behaviour:
Cluster analysis can be used to identify groups of buyers. The buying behaviour of each group can be looked at separately on measures such as favourite stores, brand loyalty, price willing to pay, frequency of purchase, etc.
Identifying new product opportunities:
By clustering brands and products, competitive sets within the market can be identified. Brands in the same cluster compete more with each other than with brands in other clusters. A business can examine its current offerings compared to those of its competitors to identify potential new product opportunities.
Attitudinal profiling:
Attitudinal profiling within Cluster Analysis involves categorizing individuals based on their attitudes, beliefs, and preferences rather than solely on demographic or behavioral characteristics. This method helps identify distinct groups of individuals who share similar attitudes or psychographic traits, allowing marketers to tailor their strategies to specific mindset segments. This enables more personalized and resonant campaigns, leading to increased customer satisfaction, loyalty, and overall marketing effectiveness.
The types of surveys questions used in clustering
Likert scale
Likert scale questions are questions about the level of agreement to a statement (generally they have a range of agreement for example from 1 to 5. This type of question as described above will be weighted by Telmar by applying values from low to high (1 to 5). Likert scale questions:
- Definitely agree
- Tend to agree
- Neither
- Tend to disagree
- Definitely disagree
Volumetrics
These are numerical values traditionally known in marketing as volumetrics. Volumetrics questions are typically:
- How much time was spent with different activities
- How much money spent on different items
- Number of visits
- Amount of product consumed
- Number of issues read
- etc
Binary questions
Binary questions have only two mutually exclusive answer options e.g:- agree or disagree, yes or no. Binary questions examples:
- Do you drink beer (yes/no)
- People who answered either definitely or tend to agree to an attitudinal question
- Read Time magazines (yes)
The criteria for determining quality groups in clustering
The metrics that determine the quality of the Cluster groups are called Determination and CH index. The higher the value of either criteria, the better the solution. They are both shown on the same scale, from 1 to 100%. This chart helps decide the best number of clusters. The recommended number is shown by the Rec. solution button.
CH-Index
The CH-Index (Calinski-Harabasz Index) in Cluster Analysis is a metric used to evaluate the quality of clustering. Its purpose is to measure the separation between clusters and the compactness within clusters. A higher CH-Index indicates better-defined and more distinct clusters, helping to identify the optimal number of clusters in a dataset.
Determination
Determination in Cluster Analysis refers to the process of determining the number of clusters or groups within a dataset. Its purpose is to find the optimal cluster number that best represents the underlying patterns and structure in the data. This aids in meaningful data segmentation and analysis.
If the figures in the Determination score are very similar, it often means that the underlying data is too similar to create more clusters. In other words, the data may not exhibit distinct patterns or groupings, making it challenging to identify additional clusters.
If the underlying data is more varied, it typically results in more cluster group recommendations. Greater data variation can lead to more distinguishable patterns and groupings, making it easier to identify and recommend multiple clusters.
Silhouette-Score
A silhouette score in cluster analysis is a measure that assesses the quality of clustering. It quantifies how similar data points are to their assigned cluster compared to other clusters. The purpose of the silhouette score is to determine the optimal number of clusters and evaluate the separation between clusters. It helps identify well-defined and distinct clusters.
Inertia
Inertia in Cluster Analysis, often referred to as the Within-Cluster Sum of Squares (WCSS), is a metric that measures the compactness of clusters. Its purpose is to quantify how close data points are to the centroids (center points) of their respective clusters. The lower the inertia, the more tightly grouped the data points are within clusters.
Interpreting Averages for a Cluster Analysis result
- How to interpret AVERAGES for Likert scale questions
Where respondents answered their agreement on a Likert scale typically from 1 to 5, where 1 represents “definitely agree” and 5 represents “definitely disagree” . The lowest average value (1) is the strongest agreed average score. A score of 1.52 for “fast food is junk” means the average answer is in between “tend to agree” and “definitely agree”.
In some surveys, such as Numeris and Vividata, the Likert scale is the opposite where 5 represents “definitely agree” and 1 “definitely disagree”. In these cases, Explore will re-organise the answers to match the typical 1 = ”definitely agree” and 5 = ”definitely
disagree” structure. - How to interpret AVERAGES for Volumetric questions
These values or averages represent the volume or “number of pieces bought” or amount spent and is dependent on the actual question in the survey used in the Cluster analysis. For example, a score of 2.12 could represent the number of drinks bought in a month. - How to interpret AVERAGES for Binary questions
Binary variables have a value of 0 or 1 i.e. they are mutually exclusive values (the answer is either: yes or no). In the example below, the average value means the frequency of positive answers (yes’s or agrees).
For example, 0.75 for cluster 1 for ‘Price comparison - Internet Visited - Shopping, Retail and Finance’ means that 75% of the Cluster 1 respondents agreed that they compare prices online whereas in Cluster 2, only 10% of the respondents go online to compare prices. This implied that the respondents in Cluster 2 are not price sensitive, whereas Cluster 1 they are very price sensitive.
Index
The index allows comparison between different variables in a simpler form than standardized scores. It is calculated as follows:
Index = 100 * (Average value in cluster / Average value in population for the given variable)
.
For example, if the index for ‘Buying or selling at an online auction’ in Cluster 1 is 176, while in Cluster 2 the index is much lower at 58, respondents in Cluster 1 therefore are almost 75% more likely than the total population to buy or sell online, whereas in Cluster 2 respondents are 58% less inclined than the total population to buy or sell online.
Standard Scores
The standardized score shows how far the average value in a cluster is located from the average value of the whole population (which is always equal to zero). This distance is measured in standard deviations. The value 0.5, for example, means that average value in a cluster is higher than the total average for half of the standard deviation; value -1 means that the average value is lower than total average for one standard deviation. The higher values of standardized scores – the more distinguishable clusters are for the given variable. The closer to zero – the less difference between clusters is observed.
Standard Deviation
The standard deviation is a measure of variability – the higher its value in a Cluster, the less compact this cluster is.