Skip to content Skip to navigation Skip to collection information

OpenStax_CNX

You are here: Home » Content » The Art of the PFUG » An Application of Model-based Clustering in Market Segmentation

Navigation

Table of Contents

Lenses

What is a lens?

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

This content is ...

Affiliated with (What does "Affiliated with" mean?)

This content is either by members of the organizations listed or about topics related to the organizations listed. Click each link to see a list of all content affiliated with the organization.
  • Rice Digital Scholarship

    This collection is included in aLens by: Digital Scholarship at Rice University

    Click the "Rice Digital Scholarship" link to see all content affiliated with them.

Also in these lenses

  • Lens for Engineering

    This module and collection are included inLens: Lens for Engineering
    By: Sidney Burrus

    Click the "Lens for Engineering" link to see all content selected in this lens.

Recently Viewed

This feature requires Javascript to be enabled.
 

An Application of Model-based Clustering in Market Segmentation

Module by: Sean Zeng, Sarah J. Thomas. E-mail the authors

Summary: This report summarizes work done as part of a Computational Finance PFUG under Rice University's VIGRE program. VIGRE is a program of Vertically Integrated Grants for Research and Education in the Mathematical Sciences under the direction of the National Science Foundation. A PFUG is a group of Postdocs, Faculty, Undergraduates and Graduate students formed around the study of a common problem. In this module we present a model-based clustering scheme for market research. We use observation-driven Poisson regression to model purchase patterns of customers over time. Based on those models, customers are segmented into groups. We illustrate the methods on purchases of two related grocery products: bacon and eggs.

Introduction

Time Series of Counts Data in Marketing

Time series of counts (TSC) surface whenever countable, time-dependent observations can be made about a phenomenon. For example, daily catches of a fisherman, weekly swine flu cases admitted into a hospital, and monthly used cars sold by a salesman are all instances of TSC. Count data are abundant in the study of consumer behavior. Observations such as a consumer's product purchase patterns, online click streams, store visits, and rate of product consumption are examples of count data and can be analyzed for marketing purposes. In fact, in the creation of a company's marketing mix or marketing strategy, managers rely heavily on relevant, accurate and timely (RAT) information about the consumer behavior[1]. In an increasingly technology and information dependent society, companies scramble to acquire RAT information in order to understand the new, emerging demands of consumers, and to gain and keep that competitive edge.

Perhaps the most important use for consumer behavior data is market segmentation where consumers are identified and grouped based on their distinctive traits and characteristics. Market segmentation begins with the realization that there exists a heterogeneity in consumer demand of goods and services and that the heterogeneous market is made up of “a number of smaller homogeneous markets, in response to differing preferences, attributable to the desires of consumers for more precise satisfaction of their varying wants[9]”. Market segmentation, if conducted correctly and effectively, is an extremely powerful and informative tool for managers since segmentation analysis can reveal both the size and the traits of the consumer groups, for whom appropriate group-specific marketing strategies can be devised. Correct and relevant market segmentation is critical for businesses since incorrect evaluations of consumer groups and their characteristics leads to unproductive marketing mix and waste of resources. Numerous publications and books have been written on effective methods of consumer segmentation. For a survey and review of market segmentation methods, see [11].

Current Market Segmentation Methods

Through a review of relevant literature on the subject, it is evident that the statistical process of cluster analysis lies at the core of market segmentation. Though classification of consumers into preset number of groups with already identified traits is also common in segmentation, clustering offers the advantages of flexibility and accuracy because the number of groups, their size and their characteristics are data-driven.

Thus, there has been substantial development in marketing research related clustering models and methodology in the past 60 years. During the initial stages of market segmentation research in 1950s, the statistical models proposed depended heavily on existing Operations Research and Management Science methods. These models were either too complicated to be implemented practically or too unrealistic to accurately represent real-world situations. As researchers gained more computing power, models became more realistic and implementable[7]. However, current methods also have some shortcomings. For example, many segmentation models have difficulties capturing the relationship between the exogenous (consumer traits) and the response variables (ex. purchase frequencies and profits) in a segment[3]. Also, even if the model successfully relates segment traits with response, the time-dependent nature of consumer behavior is often ignored. Finally, many of the models involve only one dependent variable. Consider the purchase of printers among different groups of consumers. Such narrowly defined analysis maybe sufficient for, say a printer manufacturer, however, for an electronic store manager deciding how to stock, shelf and promote printers, it is helpful to be able to generalize the analysis to cover electronic products closely related to printers—printing paper, scanners, toners, etc.—since consumers who buy printers are likely to be interested in these products and vice versa. Thus, a multivariate segmentation approach can be more informative and useful.

Model-Based Clustering

A novel model-based method of clustering TSC was proposed by Thomas, Ray and Ensor[10], who applied the method to Houston air pollution monitoring data. In the study, air quality monitor stations, represented by time series of pollution readings were clustered to identify regions of the city with similar patterns of pollution. The results from the pollution study were promising and motivate the used of model-based clustering (MBC) in other applications. In this study, our goal is to address the shortcomings of current market segmentation methods by applying MBC to consumer purchase data.

Outline for remainder of the discussion

We will explain MBC in more detail in Section 2. The description of the data is given in Section 3. Ongoing work and project implications are presented in Sections 4 and 5 respectively. Finally directions for future research are laid out in Section 6.

Methodology

The name “model-based clustering” implies two components to the method: the modeling component and the clustering component. An appropriate model is fit to each time series and then a dissimilarity metric based on the likelihood of those models is used to cluster the TCS.

Modeling TSC

A classical model for count data is Poisson regression. Recently, Fokianos and Kedem[2] proposed a model for TSC in the general linear model (GLM) framework which can be called “observation-driven” Poisson regression. In GLM, we model the response {Yt,t=1,,N}{Yt,t=1,,N} as a linear function of the covariates {Xt,t=1,,N}{Xt,t=1,,N}. There are two components to a time series following a GLM:

  1. Random Component: Distribution of {Yt|Yt-1,}{Yt|Yt-1,} belongs to the exponential family of distributions.
    f(yt;θt,φ|F)=expytθt-b(θt)at(θt)+c(yt,φ),f(yt;θt,φ|F)=expytθt-b(θt)at(θt)+c(yt,φ),
    (1)
    where θt is the parameter of the distribution and φ is a dispersion parameter. In the Poisson case, we have: at(φ)=1at(φ)=1; θt=log(μt)θt=log(μt); b(θt)=μtb(θt)=μt; c(yt;φ)=-logyc(yt;φ)=-logy.
  2. Systematic Component:μt, the mean of Yt, is modeled by a monotone link function g(·)g(·) such that
    g(μt)=XtTβ,g(μt)=XtTβ,
    (2)
    where X is the set of covariates and β is a vector of coefficients. In the Poisson case μ=λμ=λ, g(λ)=log(λ)g(λ)=log(λ).

In the case of time series data, we can augment the exogenous covariates in the model with lagged values of the response variable, i.e. the observed counts at previous time points. Thus the model is “observation-driven.” Lags of exogenous covariates can also be included. For instance let the new covariate matrix be represented by Z, where

Z t T = ( X t , X t - 1 , , X t - p , Y t - 1 , , Y t - n ) . Z t T = ( X t , X t - 1 , , X t - p , Y t - 1 , , Y t - n ) .
(3)

For more details see [2].

Clustering of TSC models

Armed with the GLM model for Poisson regression, we can begin clustering the TSC. In order to determine the similarity or dissimilarity between two TSC, a metric is needed to measure the “distance.” The classic Euclidean metric is not adequate for data with time dependence. We will use the empirical Kullback-Leibler (KL) likelihood metric[12], which calculates the distance between two TSC by evaluating the relative fit of their respective models.

Let λj be a given “model structure” for the data, i.e. an observation-driven Poisson model with specified covariates. The KL metric has the following expression.

D K ( λ k , λ j ) = 1 | Y K | y Y K ( log p ( y | λ k ) - log p ( y | λ j ) ) D K ( λ k , λ j ) = 1 | Y K | y Y K ( log p ( y | λ k ) - log p ( y | λ j ) )
(4)

where YK is the set of data objects which belong to cluster k. Note that logp(y|λk)logp(y|λk) is an expression for the likelihood of the model. See [2] for discussion on the likelihood of observation-driven Poisson models. The measure is made symmetric by,

D S K = D K ( λ k , λ j ) + D K ( λ j , λ k ) 2 D S K = D K ( λ k , λ j ) + D K ( λ j , λ k ) 2
(5)

With the KL metric, we apply a hierarchical bottom-up clustering algorithm. A flowchart of the algorithm is displayed in Figure 1.

Figure 1: The MBC algorithm
Figure 1 (flowchart.png)

The algorithm produces a cluster tree similar to the figure below. The bottom-up clustering method is easy to visualize and break down objects into groups and eliminates the need for any stopping criterion.

Figure 2: Sample hierarchical cluster tree
Figure 2 (tree.png)

Data

Finding Relevant Data

Though count data are prevalent in consumer behavior, obtaining commercial commercial data for MBC is expensive. Thus, for this project, we use results from previous studies on marketing data to creat a data set that realistically mimics consumer behavior.

Data Simulation

Niraj et al. [8] proposed an economic model for consumer purchases of bacon and eggs. Based on store scanner data, the authors studied the consumer sensitivities to various variables such as personal utility, product prices, product displays, and purchase history. For the purpose of data simulation, key elements from this economic model were borrowed to create our own consumer bacon and eggs purchase data.

We let Yb,tYb,t and Ye,tYe,t be a bivariate Poisson random variable which represent a consumer's purchase of bacon and eggs during time window t respectively, then Yb,tYb,t and Ye,tYe,t can be modeled using a trivariate reduction[4]:

Y b , t = U b , t + U b e , t Y e , t = U e , t + U b e , t Y b , t = U b , t + U b e , t Y e , t = U e , t + U b e , t
(6)

where Ui,tPois(λi,t)Ui,tPois(λi,t) for i=e,b,bei=e,b,be. Ub,tUb,t and Ue,tUe,t represent the consumer's tendency to buy bacon or eggs independently while Ube,tUbe,t represents consumer's tendency to buy the two products together. Note that Yb,tYb,t and Ye,tYe,t are marginally Poisson since sum of two Poisson variables is still Poisson.

Recall the Poisson log link function for GLM: logλ=ZTβlogλ=ZTβ. In the simulation, λb,tλb,t and λe,tλe,t are modeled using exogenous covariates (utility, price and product displays) as well as one lag of response, Yt-1Yt-1 i.e. the quantities of the product purchased last time period:

log λ b , t = β b , 0 + β b , 1 U t i l b , t + β b , 2 P r i c e b , t + β b , 3 D i s p b , t + β b , 4 D i s p e , t + β b , 5 Y b , t - 1 + β b , 6 Y e , t - 1 log λ e , t = β e , 0 + β e , 1 U t i l e , t + β e , 2 P r i c e e , t + β e , 3 D i s p b , t + β e , 4 D i s p e , t + β e , 5 Y b , t - 1 + β e , 6 Y e , t - 1 log λ b , t = β b , 0 + β b , 1 U t i l b , t + β b , 2 P r i c e b , t + β b , 3 D i s p b , t + β b , 4 D i s p e , t + β b , 5 Y b , t - 1 + β b , 6 Y e , t - 1 log λ e , t = β e , 0 + β e , 1 U t i l e , t + β e , 2 P r i c e e , t + β e , 3 D i s p b , t + β e , 4 D i s p e , t + β e , 5 Y b , t - 1 + β e , 6 Y e , t - 1
(7)

for simplicity, logλbe,t=βbe,0logλbe,t=βbe,0.

The consumer's utility, UtilUtil, is assumed to follow a Gumbel distribution [8] with location=0=0 and scale=0. After consulting local grocery stores, we let PricebN(4,0.7)PricebN(4,0.7) and PriceeN(3,0.3)PriceeN(3,0.3). DisplayDisplay indicates whether the product was advertized in store. This indicator variable is either on (1) or off (0) with probability p.

Figure 3: Simulated purchases of bacon and eggs for 100 weeks for one consumer with sensitivities: βb=(-2.4,1,-0.05,0.8,0.3,-0.5,0.2)βb=(-2.4,1,-0.05,0.8,0.3,-0.5,0.2), βe=(-0.6,1,-0.02,0.3,1.50,0.2,-0.5)βe=(-0.6,1,-0.02,0.3,1.50,0.2,-0.5) and βbe=-0.2βbe=-0.2.
Figure 3 (be_purchase.png)

A realization of one consumer's purchase over time is plotted in Figure 3. We notice a few things in this plot that make it “realistic”: only small quantities are purchased; when higher quantity was purchased in a previous period, fewer units were purchase during the next period; the pruchases of the two products seem correlated as a number of peaks overlap.

Ongoing Work

Modeling consumer purchases

Similar to data simulation, we model the consumer purchases of bacon and eggs using a trivariate reduction.

Y b , t = U b , t + U b e , t Y e , t = U e , t + U b e , t Y b , t = U b , t + U b e , t Y e , t = U e , t + U b e , t
(8)

Where Ui,tPois(λi,t)Ui,tPois(λi,t) for i=b,e,bei=b,e,be. However, we constrain the covariates to include only observable variables: price, display and past purchase. Thus in the log link function for Poisson GLM, we model λ like this:

log ( λ b , t ) = β b 0 + β b 1 P r i c e ( b , t ) + β b 2 D i s p ( b , t ) + β b 3 D i s p ( e , t ) + β b 4 Y ( b , t - 1 ) + β b 5 Y ( e , t - 1 ) log ( λ e , t ) = β e 0 + β e 1 P r i c e ( e , t ) + β e 2 D i s p ( b , t ) + β e 3 D i s p ( e , t ) + β e 4 Y ( b , t - 1 ) + β e 5 Y ( e , t - 1 ) log ( λ b e , t ) = β b e 0 log ( λ b , t ) = β b 0 + β b 1 P r i c e ( b , t ) + β b 2 D i s p ( b , t ) + β b 3 D i s p ( e , t ) + β b 4 Y ( b , t - 1 ) + β b 5 Y ( e , t - 1 ) log ( λ e , t ) = β e 0 + β e 1 P r i c e ( e , t ) + β e 2 D i s p ( b , t ) + β e 3 D i s p ( e , t ) + β e 4 Y ( b , t - 1 ) + β e 5 Y ( e , t - 1 ) log ( λ b e , t ) = β b e 0
(9)

Methods for estimating bivariate Poisson regression models are available in R package “bivpois” [5]. We can compare the λ^λ^ generated by the model against the “real” λ used in the simulation. Note that λ^i,t=λi,t^+λbe,t^λ^i,t=λi,t^+λbe,t^ for i=b,ei=b,e.

Figure 4: Comparing fitted λ^λ^ (in solid line) with “real” λ (black dots) used in data simulation
Figure 4 (be_purchase_fit.png)

We also tested the robustness of regression model to varying strengths of λbe,tλbe,t, the covariance term. A summary of the simulation studies is presented in Figure 5.

Figure 5: Median value of estimated regression coefficients compared to “real” betas used in simulation. (Top row: bacon; Bottom row: egg)
Figure 5 (beta_change.png)

The regression is slightly more accurate with a lower λbe,tλbe,t

Next Steps

We are currently working to extend the univariate MBC method to the bivariate case. The extension process consists of developing the bivariate model for the simulated consumer TSC data, deriving the bivariate KL metric, and improving the clustering algorithm to cluster bivariate models. So far, we have developed a working bivariate Poisson regression model using the bivpois package. The clustering algorithm is still under development.

Implications of MBC

Our application of MBC in market segmentation offer insights about consumers in two ways. First, the models that describe each cluster are easy to interpret since the parameters that make up the model represent directly the consumer's “sensitivities” to observable exogenous variables. There are three exogenous variables considered in our simple bivariate model fitted to the simulated data: product price, product display, and quantity purchased during previous time period. Several useful interpretations can be made from studying the parameters. For example, a negative βb1 reveals that price affects bacon purchase negatively, but a positive βb2 implies the advertisement of bacon promotes sale of bacon. We can also compare the magnitudes of β across consumer segments to detect how differently consumers react to price, display and past purchases.

Second, the bivariate nature of the model also contains information about the cross-product effects between products. This is useful if the two products are closely related, which is true in our case of bacon and eggs. For example, the examination of βb4, which is the parameter for display of eggs, indicates how much egg advertisements influences the consumer's bacon purchase. We incorporate time series components into the model by included quantities of bacon and eggs purchased at the last time period. We might expect the products to have a negative autocorrelation but a positive cross correlation. For instance, if a customer purchased a large quantity of bacon last shopping trip, he might purchase less bacon but more eggs on the subsequent trip. In addition, the parameter for the correlation term acts an overall indicator of how bacon and egg purchases are related. This information is useful especially in deciding if a cross-product marketing strategy is appropriate in the first place.

Finally, the hierarchical bottom-up clustering tree allows the manager to group consumers in a way that makes most sense. It is easy to visualize the size and memberships of consumer groups with this clustering algorithm.

Directions for Future Research

To discover the true potential for MBC in marketing research settings, we still need to apply it to actual consumer data. If studies using simulated data are promising, we would like to collaborate with marketing researchers to implement MBC on actual store scanner data and other marketing TSC data.

One interesting type of TSC data is e-commerce data. As more business is conducted online, company must be adapt to respond optimally to consumer behavior on the web. Many market segmentation protocols are already in place for some companies. For example, Google displays “Sponsored links” or advertisements related to the user's search term. Large online stores such as Amazon.com, Buy.com and eBay.com all offer customized recommendations for users. MBC's ability to reveal consumer sensitivities can contribute to make online marketing more effective.

With new sets of data available for MBC research, we will likely encounter TSC data that contain many extra zeroes. Zero-inflation is a common feature in count data and can cause problems for simple models that do not treat the extra zeroes properly. In the development of univariate MBC in pollution study [10], a technique called zero-inflated Poisson (ZIP) regression [6] was used to accurately capture the zeroes that were prevalent in the TSC. Some consumer TSC data that might have extra zeroes include durable goods such as appliances or cars. The application of ZIP regression model in these instances may broaden the scope of applications of MBC in marketing research.

From statistical modeling point of view, an immediate extension for the project is to generalize the model beyond the bivariate case. The model for multivariate Poisson regression becomes more complex as the dimension size increase because the number of correlation terms increases. The problem can be illustrated using the trivariate case. In the Data section, we described how the bivariate Poisson is constructed via the trivariate reduction. For the trivariate case the multivariate reduction becomes more complicated:

Y a , t = U a , t + U a b , t + U a c , t + U a b c , t Y b , t = U b , t + U a b , t + U b c , t + U a b c , t Y c , t = U c , t + U a c , t + U b c , t + U a b c , t Y a , t = U a , t + U a b , t + U a c , t + U a b c , t Y b , t = U b , t + U a b , t + U b c , t + U a b c , t Y c , t = U c , t + U a c , t + U b c , t + U a b c , t
(10)

now with Uab,tUab,t,Uac,tUac,t,Ubc,tUbc,t,and Uabc,tUabc,t as correlation terms. In general, for the N-variate case, the total number of correlation terms is i=2NNii=2NNi.

Acknowledgements

The authors would like to thank Dr. Kathy Ensor, Department of Statistics, Rice University, and Dr. Bonnie Ray, IBM J. T. Watson Research Center, for their valuable insights and guidance.

This Connexions module describes work conducted as part of Rice University's VIGRE program, supported by National Science Foundation grant 0739420.

References

  1. Aaker, David A. and Kumar, V and Day, George S. (2000). Marketing Research. (7). John Wiley & Sons, Inc.
  2. Fokianos, Konstantinos and Kedem, Benjamin. (2004). Partial Likelihood Inference for Time Series Following Generalized Linear Models. Journal of Time Series Analysis, 25(2), 173-197.
  3. Krieger, Abba M. and Green, Paul E. (1996, August). Modifying Cluster-Based Segments to Enhance Agreement with an Exogenous Response Variable. Journal of Marketing Research, 33(3), 351-363.
  4. Kocherlakota, Subrahmaniam and Kocherlakota, Kathleen. (1992). Bivariate Discrete Distributions. Marcel Dekker, Inc.
  5. Karlis, Dimitris and Ntzoufras, Ioannis. (2005, September). Bivariate Poisson and Diagonal Inflated Bivariate Poisson Regression Models in R. Journal of Statistical Software, 14,
  6. Lambert, Diane. (1992, February). Zero-Inflated Poisson Regression, with an Application to Defects in Manufacturing. Technometrics, 34(1), 1-14.
  7. Leeflang, Peter S.H. and Wittink, Dick R. (2000). Building Models for Marketing Decisions: Past, Present and Future. International Journal of Research in Marketing, 17, 105-126.
  8. Niraj, Rakesh and V. Padmanabhan, P.B. Seetharaman. (2008, March-April). A Cross-Category Model of Households' Incidence and Quantity Decisions. Marketing Science, 27(2), 225-235.
  9. Smith, Wendell R. (1956, July). Product Differentiation and Market Segmentation as Alternative Marketing Strategies. The Journal of Marketing, 21(1), 3-8.
  10. Thomas, Sarah J. and Ray, Bonnie K. and Ensor, Kathy B. (2009). A Model-based Approach for Clustering Air Quality Monitoring Networks in Houston, Texas. Technical report. Rice University.
  11. Wedel, Michel and Kamakura, Wagner A. (1998). Market Segmentation: Conceptual and Methodological Foundations. Kluwer Academic Publishers.
  12. Zhong, Shi and Ghosh, Joydeep. (2003, 11). A Unified Framework for Model-based Clustering. Journal of Machine Learning Research, 4, 1001-1037.

Collection Navigation

Content actions

Download:

Collection as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Module as:

PDF | EPUB (?)

What is an EPUB file?

EPUB is an electronic book format that can be read on a variety of mobile devices.

Downloading to a reading device

For detailed instructions on how to download this content's EPUB to your specific device, click the "(?)" link.

| More downloads ...

Add:

Collection to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks

Module to:

My Favorites (?)

'My Favorites' is a special kind of lens which you can use to bookmark modules and collections. 'My Favorites' can only be seen by you, and collections saved in 'My Favorites' can remember the last module you were on. You need an account to use 'My Favorites'.

| A lens I own (?)

Definition of a lens

Lenses

A lens is a custom view of the content in the repository. You can think of it as a fancy kind of list that will let you see content through the eyes of organizations and people you trust.

What is in a lens?

Lens makers point to materials (modules and collections), creating a guide that includes their own comments and descriptive tags about the content.

Who can create a lens?

Any individual member, a community, or a respected organization.

What are tags? tag icon

Tags are descriptors added by lens makers to help label content, attaching a vocabulary that is meaningful in the context of the lens.

| External bookmarks