Cluster Analysis, Correspondence Analysis, Factor Analysis, Audience Effects
Segmentation Apps
The Segmentation Apps in Explore are a suite of specialized, advanced analytical tools found within the Segmentations Module. This module, accessed via the left bar, is dedicated to performing sophisticated market segmentation and data reduction techniques.
These are the core Segmentation Apps:
- Cluster Analysis
- Correspondence Analysis
- Factor Analysis
- Audience Effects
1. Cluster Analysis
Cluster Analysis is a statistical technique used to group respondents together. The goal is to create groups (clusters) where the members are most similar (Internally homogeneous) within their group but are as different (Externally heterogeneous) as possible between groups. It is particularly useful in consumer segmentation when consumers might be similar demographically but differ attitudinally.
Use Cases and Benefits of Cluster Analysis
Cluster analysis offers several key benefits in market research and the media industry:
- Identifying Differentiated Audience Segments: It enables the segmentation of groups, such as a product category, into mutually exclusive groups based on their responses to key lifestyle and attitudinal statements. These groups consist of like-minded people.
- Understanding Buying Behavior: It can identify groups of buyers, allowing analysis of each group's behavior separately across measures like favorite stores, brand loyalty, price willingness, and purchase frequency.
- Identifying New Product Opportunities: By clustering brands and products, businesses can identify competitive sets within the market. Brands within the same cluster compete more heavily with each other than with brands in other clusters, allowing a business to examine current offerings and identify potential new product opportunities.
- Attitudinal Profiling: This involves categorizing individuals based on their attitudes, beliefs, and preferences (psychographic traits) rather than only demographics or behavioral characteristics. This helps identify distinct groups that share similar mindsets, allowing marketers to create more personalized and resonant campaigns that lead to increased customer satisfaction, loyalty, and overall marketing effectiveness.
- Media Industry Segmentation: In the media industry, it is used to segment audiences, which enables the creation of targeted content or advertisements. This helps maximize engagement and return on investment (ROI) by focusing efforts on specific audience segments.
Types of Survey Questions Used in Clustering
The data used in clustering typically comes from three types of survey questions:
- Likert Scale Questions: These measure the level of agreement to a statement, usually ranging from 1 to 5 (e.g., "Definitely agree" to "Definitely disagree"). Explore may re-organize scales to match the typical structure of 1 = "definitely agree" and 5 = "definitely disagree". The lowest average value (1) is the strongest agreed average score.
- Volumetrics: These are numerical values representing volume, such as how much time or money was spent, the number of visits, or the amount of product consumed.
- Binary Questions: These have only two mutually exclusive answer options (e.g., yes/no, agree/disagree). The average value of a binary variable (0 or 1) represents the frequency of positive answers. For instance, an average of 0.75 means 75% of the cluster agreed.
Metrics for Determining Cluster Quality
The quality of the Cluster groups is determined using several statistical metrics:
Metric | Purpose and Interpretation |
Determination | Refers to finding the optimal number of clusters that best represents the underlying data structure. The higher the value (shown on a scale from 1 to 100%), the better the solution. If Determination scores are similar, the underlying data may be too similar to create more clusters; if the data is more varied, it results in more cluster group recommendations. |
CH-Index (Calinski-Harabasz Index) | Measures the separation between clusters and the compactness within clusters. A higher CH-Index indicates better-defined and more distinct clusters and helps identify the optimal number of clusters. |
Silhouette Score | Assesses the quality of clustering by quantifying how similar data points are to their assigned cluster compared to other clusters. Its purpose is to determine the optimal number of clusters and evaluate cluster separation. |
Inertia (Within-Cluster Sum of Squares) | Measures the compactness of clusters. A lower inertia means the data points are more tightly grouped around their cluster centroids (center points). |
2. Correspondence Analysis
Correspondence Analysis (CA) is a statistical technique used to analyze and visualize relationships between category variables. It condenses information efficiently, allowing professionals to explore, visualize, and understand complex relationships between variables quickly.
Use Cases and Benefits of Correspondence Analysis
CA is highly valuable in advertising, media, and marketing research:
- Segmentation and Targeting: It helps identify distinct audience segments and their preferences. By analyzing consumption habits (like media consumption), CA allows marketers to fine-tune targeting strategies, ensuring the right message reaches the right audience.
- Media Channel Optimization: CA assists in identifying the most effective media channels for advertising campaigns. Researchers can analyze how different media platforms relate to consumer responses or brand metrics to allocate resources efficiently and make informed budget decisions.
- Competitor Analysis: Like Cluster Analysis, CA can cluster brands and products to identify competitive sets within the market. Brands in the same "factors" compete more with each other, helping businesses identify potential new opportunities by comparing their offerings to competitors.
- Identifying Psychographic Drivers: It is used prior to running a Cluster analysis to identify which psychographic variables have the strongest associations with specific audience segments and products. This information is valuable for understanding what drives certain audience behaviors, brand loyalty, and adoption.
Interpreting Correspondence Analysis
The interpretation focuses on factor contribution and plotting items on a map:
- Factors: CA typically plots results using Factor 1 (X-axis, left to right) and Factor 2 (Y-axis, top to bottom).
- Proximity on the Map: The proximity of data items on the map indicates the degree of similarity. For example, brands or age groups positioned near each other are considered very similar.
- Relationship Between Items: To understand the relationship between active rows and columns (e.g., a brand and a demographic), lines are drawn to the origin (zero) of the graph.
- Obtuse angles between two data points indicate a negative correlation.
- Distance from the Center: The closer a data item is to the center of the map, the less information the map reveals about it. The further a data item is from the middle, the more strongly it is related to something else on the map.
- Variance Explained: A robust analysis aims for a 'variance explained' of 70% or above by a combination of Factor 1 and Factor 2.
Key metrics in the CA results table include:
Metric | Definition |
% INF (Percent Influence) | Shows the contribution each variable makes to the overall analysis (the higher the value, the more relevant it is to the market differences). It is often sorted by % INF prior to Cluster analysis to determine the most influential variables. |
ABS (Absolute Contribution) | The "value" of each statement/brand within the factor. The higher the ABS, the better the factor is at explaining that particular variable. It can be interpreted like the %Col in Crosstab. |
REL (Relative Contribution) | Determines which side of the graph the brand or lifestyle statement will appear on. For Factor 1, "-" appears on the left and "+" appears on the right of the map. It explains which factor best explains the brand/demographic and can be read like the %Row in Explore. |
3. Factor Analysis
Factor Analysis (FA) is a statistical method used primarily for data reduction and summarization. It aims to discover the underlying structure or dimensions (called "factors") in a large set of measured variables.
FA takes a large number of correlated variables (e.g., attitudinal statements, brand ratings) and reduces them to a smaller number of conceptual, unobserved variables (the factors).
Core Purpose and Mechanism
- Simplification: Instead of analyzing 50 separate attitude questions, FA might group them into 5-10 meaningful factors (e.g., "Price Sensitivity," "Innovation Seeker," "Brand Loyalist").
- Identification of Underlying Constructs: It assumes that the observed correlation between variables is due to their relationship with one or more common underlying factors.
- Data Input: It works best with continuous or interval data, such as Likert-scale responses or volumetric data, where the variables exhibit correlation.
Use Cases and Benefits in Research
- Attitudinal Segmentation Input:
- Pre-Clustering Step: FA is frequently used before Cluster Analysis. It reduces the dimensionality of psychographic/attitudinal data, providing a more robust, meaningful, and less noisy input for the clustering algorithm. Instead of clustering based on 50 raw variables, you cluster based on 5-10 composite factor scores.
- Scale Construction and Validation: It helps confirm if a set of survey items designed to measure a single construct (like "Trust in a Brand") actually measures that single factor.
- Identifying Key Drivers: It quickly highlights which groups of statements or attributes move together, indicating a shared psychological or behavioral driver.
- Data Interpretation: It makes large datasets more manageable and interpretable by researchers, allowing them to name and characterize the latent factors.
Interpreting Factor Analysis
The interpretation focuses on the Factor Loadings and the Eigenvalues.
- Factors: These are the new, unobserved variables created by the analysis. They are ordered by the amount of variance they explain in the data.
- Eigenvalue: Represents the amount of total variance in the original data explained by each factor. Typically, factors with an Eigenvalue greater than 1 are considered significant and retained for interpretation (Kaiser Criterion).
- Factor Loadings: These are the correlation coefficients between the original variables and the newly created factors.
- High Loadings (close to 1 or -1): Indicate a strong relationship, meaning the variable contributes significantly to defining that factor. Variables with high loadings on the same factor are conceptually grouped together.
- Naming Factors: Researchers examine the content of the variables that load highly on a factor and assign a descriptive, conceptual name to the factor (e.g., "Health Consciousness").
Relationship to Crosstab Tools
When factor analysis is completed, the resulting Factor Scores (each respondent's score on the newly created factors) can be treated as new, meaningful variables.
- These Factor Scores can then be saved as a custom audience in Explore and used as Column or Row Variables.
- This allows researchers to profile the factors (e.g., "Which demographics are high on the 'Innovation Seeker' factor?") or use the scores as input for further analysis, like regression or Cluster Analysis.
4. Audience Effects
Audience Effects (AE) is an advanced Segmentation tool in Explore. AE provides a quick method to identify the most promising groups or segments of people for targeted advertising for a product or service.
Mechanism and Setup:
- It analyzes a large set of variables (which can include demographics, media usage, attitudinal statements, product usage, and even digital personas) in the rows against a specific target defined in the column.
- When setting up the analysis, only binary variables (yes/no, agree/disagree) are allowed; any volume or mean/median codes are automatically removed before the analysis runs.
- The initial result is a scatter plot based on reach and index. This result then feeds into a CHAID analysis to create the segments.
Why Audience Effects is Useful (Use Cases)
AE streamlines complex segmentation and planning processes:
- Quick Segment Identification: It provides a rapid way to find the most promising segments for advertising. The dual screening process of reach and index, which is often tedious when done manually across thousands of variables, can be completed within minutes.
- Advanced Variable Analysis: It allows users to work across thousands of demographics, product usage, attitudinal data, and digital personas for digital planning.
- Targeted Planning: The resulting segments (which are a short list of variables) are used to create a Gain table of mutually exclusive audience segments.
- Integration into Media Planning: The identified segments can be moved into planning tools, such as the optional Audience Planner, for media planning of the newly discovered audience segments. These segments can also be passed to media and channel planning tools and actuators, such as online publishing companies and Demand Side Platforms (DSPs).
Explanation of Audience Effects Graphs and Tables
The core output of the Audience Effects analysis relies on three primary visual and tabular outputs: the Pareto Chart, the Gain Table, and the Gain Plot.
Pareto Chart
The Pareto Chart is the initial scatter plot used in the analysis.
- Variables Plotted: It shows all variables that were selected for the analysis on the scatter plot, including the NOT (negative condition) of each variable.
- Axes: The plot is based on Index (x-axis) and Audience (y-axis).
- Utility: The chart highlights Recommended Variables (shown as green dots) that have both high indexes and high audience reach. These "best variables" are subsequently used as the input for the CHAID analysis.
Gain Analysis
The Gain Lift Analysis is a section broken into two parts: the Gain Table and the Gain Plot. The segments shown here are paths through a CHAID tree.
1. Gain Table (The Segments and Lift)
The Gain Table displays the resulting segments derived from the CHAID analysis:
- Segment Values: It shows values comparing the base audiences to the target audiences.
- Sorting: The identified segments are sorted in descending order based on the Index. The first segment is therefore the most promising group.
- Index / Lift: The table displays the resulting Index, also referred to as the "Lift". The Index calculation is $100 \times (\text{Target percentage in segment} / \text{Base percentage in segment})$. For example, if a segment represents $3.3%$ of the total population but contains $15.1%$ of Red Bull drinkers, the Index is $45$.
- Target Accumulation: A separate Target Accumulation chart displays accumulated segments with details like Title, Full Code, Population, Target Population, and Index.
2. Gain Plot (The Visual Check)
The Gain Plot is a visual aid that works alongside the Gain Table:
- Baseline Check: This plot quickly shows the user that their segments are performing above the baseline for the target. Being above the baseline indicates that the segments are a good predictor of potential target members (e.g., Red Bull drinkers).
- Required Action: If the segments in the Gain Plot appear flat (meaning they are close to the baseline line), it is an indication that the user should add more or different variables to mine the data story and rerun the AE analysis.
Overall, the best way to think of Audience Effects is that if traditional crosstab analysis is like panning for gold (sifting through all the dirt manually), Audience Effects is like using a specialized magnetic detector. The Pareto Chart acts as the initial broad scan, showing where the most promising material (high index/high audience variables) might be. The Gain Lift Analysis then precisely isolates the actual nuggets (the specific, powerful audience segments), and the Gain Plot provides a quick, visual confirmation that the location you chose is indeed rich (above the baseline).