1. bookVolume 4 (2020): Issue 4 (December 2020)
Journal Details
License
Format
Journal
First Published
30 Mar 2017
Publication timeframe
4 times per year
Languages
English
access type Open Access

Making Meaningful User Segments from Datasets Using Product Dissemination and Product Impact

Published Online: 27 Nov 2020
Page range: 237 - 249
Received: 24 Aug 2020
Accepted: 29 Oct 2020
Journal Details
License
Format
Journal
First Published
30 Mar 2017
Publication timeframe
4 times per year
Languages
English

Online companies face large user populations, making segmentation a daunting exercise. Demonstrating an approach that facilitates user segmentation, this research leverages product dissemination and product impact metrics with normalized Shannon entropy. Using 4,653 products from an international news and media organization with 134,364,449 user-product engagements, we isolate the key products with the widest product dissemination and the least product impact using entropy-based measures, effectively capturing the engagement levels. We demonstrate that a small percentage (0.33% in our dataset) of products are so widely disseminated that they are non-discriminatory, and a large percentage of products (17.02%) are discriminatory but have so little dissemination that their impact is negligible. Our approach reduces the product dataset by 17.35% and the number of user segments by 8.18%. Implications are that organizations can isolate impactful products useful for user segmentation to enhance the user focus.

Keywords

Introduction

A critical task in domains such as marketing, advertising, system design, social media content publishing, website architecture, and among others, is understanding the users, with substantial work in many of these domains to understand user behavior patterns (Gu, Gao, Tan, & Peng, 2020; Warnaby & Shi, 2019). User understanding aids, for example, in future product creation, marketing, and advertising, virtually in any consumer-facing aspects of a business (Chan, Green, Lekwijit, Lu, & Escobar, 2019; Liang & Wu, 2019; Wu & Yu, 2020). Nevertheless, “understanding the user” is often a misnomer, as in many contexts, there is not one user but many user groups (Morisada, Miwa, & Dahana, 2019). These user groups are referred to as user segments.

However, similar to finding irrelevant documents (Li, Chen, & Qi, 2019), identifying user segments in many situations is difficult for various reasons, from “lack of data,” to “privacy concerns,” to isolating “what data to use” (Böttcher, Spott, Nauck, & Kruse, 2009). Addressing this last difficulty is our research goal in support of improving user segmenting and relating activities requiring segmentation of user populations. This is the “why” of our research – determining what portion of the often overly abundant amount of product data one should focus on for user segmentation tasks.

Determining what product interactions to use for user segmentation can be surprisingly challenging (Simester, Timoshenko, & Zoumpoulis, 2020). Many companies have multiple products, and users interact with more than one product offering from a given company. By products, we mean anything produced for users, including tangible goods, videos, services, ads, social media postings, advertisements, etc. For example, some products may be wildly popular (i.e., many purchases, going viral, or countless clicks). Yet, the product can be, at the same time, non-discriminatory in that nearly all user segments interact with it. Such products are not effective for segmenting because segmentation relies on information that differentiates users (Jenkinson, 1994).

Similarly, other products may be extremely discriminatory, but the volume of user engagements (e.g., purchases, views, clicks, trips, etc.) may be very less as to not be impactful (i.e., few people from the user population interact with the product). Using such products of either of these types introduces unneeded complexity into the segmenting process. These two types of products are typical in online environments in which product consumption tends to follow power-law distributions (Ratkiewicz, Fortunato, Flammini, Menczer, & Vespignani, 2010) – a few products that are hits and a lot of products that attract little attention.

It is this dichotomy that motivates our research approach to identify products that are both discriminatory and impactful. We aim to isolate the meaningful (i.e., useful) products for practical decision-making concerning user segmenting. These useful products are identified via the discriminatory products, in that they have balanced dissemination and impact for the overall user population.

In this research, the specific products we investigate are online videos produced by an international media organization and posted on YouTube (a major social media platform). These videos have associated user demographic attributes and product engagement, in this case – views, to identify what are the optimal values for product discrimination. We develop entropy-based measures for identifying discriminatory products while considering each user segment's level of engagement with those products. We identify meaningful user segments by deriving product dissemination and product impact metrics. These two metrics are as follows:

Product dissemination – the number of user segments to which a product has been distributed.

Product impact – the number of user engagements with a given product from all users within a population.

Beginning with a prior research section, we then introduce our research objectives, data collection site, and we then present our methods and results. We end with the discussion, implications, and directions for our next stages of research.

Prior Literature

User segmenting (Mellor, 2006) is the process of dividing a group of people into homogeneous subgroups that differ from other subgroups (Jenkinson, 1994), typically based on behaviors and demographics, grounded on some product, brand, advertisement (García-Sánchez, Colomo-Palacios, & Valencia-García, 2020), or content (Stern, 1994), with many factors affecting product engagement by users (Kalaivani & Sumathi, 2019). The identification of user segments has been important in marketing and advertising (Smith, 1956) for some time, and it is increasingly important in the technology and online content publishing domains, such as online news media (Kwak, An, Salminen, Jung, & Jansen, 2018) and online content. The identification of user segments is typically aimed at the understanding of a subset of people's reactions, interactions, uses, etc., based on one or more key performance indicators (KPI) (Jiang, Chi, & Gao, 2017; Wei & Wei, 2017), to achieve some goal or objective, such as increasing revenue, increasing market share, or designing future content.

In pursuit of our research objectives, we investigate the use of user behavior analytics, specifically using dissemination (Song, Lin, Tseng, & Sun, 2005), to identify impactful user segments (i.e., impactful as in meaningful in terms of user population size) based on online products from a major news and media corporation. Addressing product dissemination is a continuation of efforts within many disciplines to effectively and efficiently use user analytics data. Product dissemination can be conceptually defined as measuring the product reach across user segments. Thus, products that appeal to large numbers of segments achieve higher product dissemination.

However, users’ product consumption profiles are more complicated – they might be similar in their preference for popular products but dissimilar in terms of non-popular products. Moreover, niche segments are not interested in the most popular products, but focusing on the “hits” can easily hide these nuances. So, new methods are needed to discover these user segments and simplify user analytics (Gu et al., 2020) for improved decision-making.

However, Bawden and Robinson (2015b) analyzed the relationship between information and complexity and highlighted that although information is used in the measure of complexity, there is no agreed upon definition for the multifaceted concept of complexity within the analytics domain. In terms of our research, complexity is measured as the number of products interacted with by a user population. An increase in the number of products for segmentation increases complexity, which is especially detrimental because product metrics are intended for use by people, and complexity decreases the ability to take meaningful actions based on those metrics. However, information theory (Shannon, 1948) is the foundation for Shannon's entropy, which attempts to quantify the quantity of information in a variable.

Although a seemingly valid measure for analyzing user behavior data, Shannon's entropy has only limited use. This may be because it is primarily a networking and computer science measure (Jansen & Rieh, 2010). Bawden and Robinson (2015a) pointed out that information is not synonymous with entropy, although information and entropy are strongly related. In this research, we directly use the concept of Shannon entropy in the field of user analytics. Cooper (1983) proposed that information systems could be overcome by leveraging the maximum entropy principle. The system would be sensitive to the frequencies of products in the collection resulting in increased power and expressiveness without an increase in complexity. This increase in effectiveness without an increase in complexity is why we use entropy in the research presented here.

The concept has limited use in earlier research in the user behavior domains. Concerning user product expectations, entropy was used to isolate the importance of discounts (Haghighatnia, Abdolvand, & Rajaee Harandi, 2018). Yang, Shan, Jiang, Yang, and Yao (2018) used entropy to align new products and consumer needs, with an aim, similar to ours, of managing complexity. Entropy has also been used to align consumer product reviews that are helpful and important, based on the novelty of information (Fresneda & Gefen, 2019).

This research builds on this prior work, leveraging the concepts of dissemination, impact, and entropy within product collections. However, we take our research in the rather novel direction of user segmentation, which is more of a Web analytics area, increasingly important in the management domain with the growth in online businesses. We identify specific products, in this case, videos, to discern meaningful user segments that can then be used for advertising, marketing, content creation, and system development (Jansen, Salminen, & Jung, 2020). The increased availability of web analytics data has opened a fresh avenue within the field of management science for the use of entropy-based metrics. We demonstrate that normalized Shannon's entropy provides insights beyond standard Web analytics metrics, such as the number of views, and serves as a measure to determine impactful user engagements with products. This research substantially builds on prior research (Jansen, Jung, Salminen, An, Kwak, 2017), specifically by employing an entirely new and larger product dataset and adding entropy measures to address the user segmenting in a novel way by showing that this measure has substantial advantages relative to simply using user engagement counts.

Research Goal and Objectives

Our research goal is to reduce the complexity of user product datasets to facilitate more meaningful user segmentation. For this, we develop an approach for identifying such meaningful products by creating metrics for identifying product dissemination and product impact values. The use of these analytic metrics would be valuable in various venues needing to address segmenting of user populations (An, Kwak, Jung, Salminen, & Jansen, 2018).

In support of this research goal, we investigate the following objectives:

Research Objective 1: create metrics to identify products that do not meaningfully contribute to user population segmentation;

Research Objective 1a: identify products that are so prevalent across segments in the user population that they are not meaningful for segmentation.

Our premise is that there are so widely disseminated products within a population that they are not worthwhile attributes for segmentation, as they cross multiple segments. The concept of user segmenting is isolating differences among segments. We define product dissemination, which allows us to calculate an upper bound on product discrimination and prevalence:

Research Objective 1b: identify products that are so uncommon in the user population that they are not meaningful for segmentation.

Our premise is that products have such a marginal impact within a population that they do not meaningfully contribute to segmentation, as so few users interact with these products. We view this as product impact, which allows us to calculate a lower bound on product discrimination and prevalence.

Research Objective 2: develop measures determining values for the upper and lower bounds of the created metrics;

Research Objective 2a: develop a measure for determining the appropriate value for product dissemination;

Research Objective 2b: develop a measure for determining the appropriate value for product impact.

Based on a review of prior work, our premise is that rather than employing heuristics to determine what products to discard for user segmentation, we can statistically determine these values using normalized Shannon's entropy.

Methodology and Analysis

We validate our premise about the value of product dissemination and product impact for user segmentation using actual user data from a major online media and mobile channel based in the United States. Understanding users is notably important in the highly competitive online media industry to increase the consumption of digital content and acquire relevant information concerning news events that may be important to users.

Data Collection Site

Our user population data are from a major international media and content production company. We use the YouTube Channel as the data collection site for the research reported here. The technique employed is generalizable to most social media channels and in-house user relationship management (CRM) data. The chief reason to focus on YouTube products is that the analytics platform gives detailed statistics for every product (i.e., video) posted, and the user population is large. Hence, it provides a good case study to evaluate the viability of our approach. As an example of a YouTube video, see Figure 1, noting specifically the number of views.

Figure 1

Sample YouTube video from an online media company's YouTube channel, with the number of views, which we use as the user engagements for this research.

The YouTube analytics platform is quite robust in the spectrum of product variables, and it also provides user profile attributes (e.g., gender, age, and country location) for each product, although at an aggregate level. We employ these user and product attributes to determine product dissemination and product impact in addressing our research objectives. We can then identify meaningful user segments based on product engagements and related user demographics.

To access the YouTube analytics platform data, one uses the YouTube application programming interface (API).

https://developers.google.com/youtube/analytics/

Although there are various product KPIs, we focus only on viewCount (the number of views). However, depending on the product dataset, the metric could be any form of user-product engagement, such as purchases, reviews, clicks, reservations, etc.

We collected 4,653 video products produced during the period from June 2014 to May 2020, with 134,364,449 user-product engagements, which are views, for our data analysis. We noted that the YouTube analytics data are aggregated, so the analytics is privacy-preserving for the individual user (no personally identifiable information is accessed). The data analytics metric values for the YouTube channel are private and available only to the owner of the YouTube channel and thus not publicly accessible. However, the data are available to every account holder of each YouTube channel, so the analysis is directly applicable. Additionally, the overall is transferable to nearly any user dataset arranged via products and engagements. The specific user and product attributes that we use for this research are as follows:

User attributes

Age group – YouTube users are classified into multiple age categories (13–17 years, 18–24 years, 25–34 years, 35–44 years, 45–54 years, 55–64 years, and ³65 years), so seven possible age categories for a user.

Gender – YouTube users are classified into either male or female, so two possible categories.

Country – YouTube uses the two-letter ISO-3166-1 country code index to classify where users are from, with 249 current officially assigned country codes at the time of this study.

Product engagements

Count of engagements – YouTube provides the number of views per video segregated by gender and by age grouping.

The data were collected and processed automatically using a script and system used to automatically generate personas from in-house user data, Web analytics, and social media analytics (Jansen et al., 2020; Jung et al., 2017). User segmentation is the first step in creating these data-driven personas, so the persona analytics system is a use case for the applicability of the research methodology presented here for user segmentation.

Methods

We now begin our approach for identifying the impactful products within the dataset to discover meaningful user segments. Specifically, we create and define the two concepts of product dissemination and product discrimination and define bounds for these measures to identify the impactful products. We first walk through the creation of metrics based solely on views and then present a metric using normalized Shannon's entropy that we use to more robustly show the distinctness of the user segments derived from this product reduction.

Calculation of PDissemination and PImpact.

For this, we calculate values for the product dissemination (PDissemination) and product impact (PImpact) levels of each product. Conceptually, we define our metrics as

PDissemination is a measure of how widely a product is disseminated within a user population as measured by the number of user segments containing at least one user that has interacted with that product (i.e., breadth).

PImpact is a measure of a product's popularity within a user population as measured by the number of user engagements with that product. A user engagement can be a view, a purchase, a download, etc. (i.e., volume).

PDiscrimination is a measure for identifying products within a product dataset that contributes to meaningful user segmenting. Specifically, PDiscrimination for such products is bounded by PDissemination below a given level and PImpact above a given level. This means that a product must have a reasonable spread among user segments, and the product must also have a reasonable level of engagement.

We first look at how widely the product is disseminated (although there may be a correlation between the breadth that a product is interacted with among all user segments and volume of engagements, this is not a necessary condition). We more formally define product dissemination (PDissemination) as follows: PDissemination=SegmentsProductSegmentsTotal{P_{Dissemination}} = {{Segments_{Product}}} \over {{Segments_{Total}}} where SegmentsProduct is the number of user segments that interact with a document from all segments of the user population, and SegmentsTotal is the number of user segments that have interacted with any product from the complete set of products.

We define PDissemination for each product in the collection, giving us a measure of how widely the product has been disseminated relatively based on the number of user segments containing users that have interacted with each product.

We then look at the volume of product engagements. Again, there may be a correlation between the breadth that a product is engaged with among all user segments and the volume of engagements, but this is not a necessary condition. We more formally define product impact (PImpact) as follows: PImpact=EngagementsProductEngagementsTotals.{{P_{Impact}} = {Engagements_{Product}} \over {Engagements_{Totals}}}. where EngagementsProduct is the number of user interactions with a specific product, and EngagementsTotals is the total number of engagements from all users with all products or any product from the complete set of products.

We define PImpact for each product in the collection, giving us a measure of how impactful a product is in the set of products relatively based on the number of user engagements with each product.

With these two measures, we can then identify those products that are meaningful for user segmentation, in that they are impactful but not too widely disseminated. These are the products that have product discrimination (PDiscrimination) defined as follows: PImpact<PDiscrimination<PDissemination,{{\rm{P}}_{{\rm{Impact}}}} < {{\rm{P}}_{{\rm{Discrimination}}\,}} < {{\rm{P}}_{{\rm{Dissemination,}}}} where PDissemination < τU and PImpact > τL, and τL is a predefined lower limit of product engagements, while τU is a predefined upper limit of a number of user segments.

Particularly, τL is determined by the number of product engagements across the user populations and τU is determined by the number of user segments, specifically those products with a large number of user segments. We first determine τL and τU heuristically to show the approach conceptually. We then develop a more rigorous analysis in RQ2.

PDiscrimination allows us to find an optimal range of products, specifically identifying products that are somewhat associated with specific user segments but also have a reasonable number of engagements. However, the approach could apply to any distributed product for which one is interested in user segmentation. Products that have a large PDissemination would not be good candidates to identify unique user segments, as these documents are prevalent for many user segments, which inform us of τU. Products with a small PImpact may not have enough effect, as measured by a metric such as count of views, to make the user segment identification worthwhile, which informs us of τL.

With this product dissemination and product impact approach, we can analyze our product dataset to determine the appropriate PDissemination and PImpact values to identify both distinct and meaningful products for user segmentation. In a sense, our approach is an adoption of the concept of term frequency – inverse document frequency (tf – idf) (Havrlant & Kreinovich, 2017) using information retrieval for the user segmentation area. A product is a word, and a user segment is a document. Following that analogy, term frequency (i.e., product engagements) maps to the number of engagements for a product by a given user segment, and inverse document frequency maps to SegmentsProduct. Thus, it maps also to PDissemination because SegmentsTotal is constant. Then, just as we search for a document by a keyword, we can search for a user segment via a product. So, our approach is built on sound prior conceptual algorithmic work in related fields.

Calculation of PEntropy

Although PDissemination provides how many user segments interact with a product and PImpact provides the overall level of user engagement, they do not consider the level of product engagement within user segments. For example, consider two products with the same total engagement counts but with a different distribution of user segments’ engagement. For one product, the engagement counts could equally come from many user segments. In this case, each user segment contributes to 1/n engagement counts to this product, where n is the number of user segments. For the other product, 99% of the engagement counts could come from one segment and 1% from all the other n - 1 user segments. Although these two products are highly different, they have the same PDissemination, which is n/n = 1.0. A similar analysis can be done for PImpact.

In order to capture such differences in the level of engagements, we leverage the concept of the normalized Shannon entropy and define an additional metric, PEntropy, as follows: PEntropy=i=1npilogpilogn.{P_{Entropy}} = - \sum\limits_{i = 1}^n {{{{p_i}\log {p_i}} \over {\log n}}.} where pi is the proportion of the interactions count from the user segment i over the total view counts, and n is the number of user segments. As the Shannon entropy is maximized when pi 1/n for all i, we normalize it by log n o that the resulting value has the range from 0 to 1, where 0 means low diversity (i.e., most of the engagements come from a few segments) and 1 means high diversity (i.e., all segments equally contribute to view count).

Theoretically, PEntropy captures how equally a given product is engaged with by each of the user segments. In other words, products with a high PEntropy would not be good candidates to identify unique user segments, as these products are interacted equally by user segments. By contrast, products with low PEntropy would be good candidates to identify unique segments, as user segments unequally engage with them. Among other potential advantages, we employ PEntropy for developing a more rigorous algorithmic approach for calculating τL and τU (as discussed above). From a theoretical perspective, PEntropy is a step in incorporating Shannon's entropy concept within the broader field of user behaviors, especially within the field of management analytics.

Results

We define a user segment as a unique combination of demographics (country, gender, ageGroup) with a given unique product for our dataset. With 2 gender groups, 7 age groups, and 249 countries, we have an upper limit of 3,486 demographic user segments (i.e., 2 ′ 7 ′ 249) based on demographics. Our dataset has 1,313 user segments based on demographics, as not all combinations of gender, age groups, and countries are represented. Also, not all demographic groups engaged with all 4,653 products. Given the 1,313 demographic group engagements with the 4,653 products, we have 83,625 as the maximum number of user segments (SegmentsTotal), for which we used the baseline. Naturally, this is a simplistic measure of user segmentation, with many approaches possible. However, this an acceptable baseline for demonstrative purposes for this research, as the process remains the same regardless of the segmenting approach actually employed.

We then calculate PDissemination and PImpact for each of the 4,653 products in our dataset to reduce the dataset to identify meaningful user segments.

We first determine the number of unique user segments (SegmentsProduct) for each product, which allows us to calculate PDissemination. We then determine EngagementsProduct for each product, which allows us to calculate PImpact. The results of these calculations are shown in Table 1.

Average, Max, Min, Median, and Standard Deviation of SegmentsProduct and EngagementsProduct for the 4,653 Products in the Dataset

MeasureSegmentsProductEngagementsProduct
Average17.8928,876.95
Max9346,050,914
Min149
Median133,735
SD29.92173,077.37

It is interesting to note that the product with the maximum number of user segments has a PDissemination of 1.12%, meaning that users from 934 of the 1,313 demographic user segments (71.14%) engaged with this product. This product would not be a discriminative candidate in our dataset. Conversely, the product(s) with the minimum number of user segments has a PDissemination of 0.001%, meaning that just 1 user segment engaged with this product. So, the product with a PDissemination of 0.003 has good discriminative power. However, when we examine the PImpact of this product, in total, it represents 0.000003% of total engagements, which is not impactful. Therefore, this product is not meaningful.

We visualize the PDissemination and PImpact for all products in a rank-frequency plot as shown in Figure 2(a) and (b).

Figure 2

(a) Plot of rank versus PDissemination and (b) PImpact for each product in the dataset.

From Figure 2, the rank–frequency plots, as expected, follow a power-law distribution, with a head consisting of a relatively small number of products that are very popular with many user segments. Then, the long tail consists of a relatively lot of popular products with a small number of user segments. This highly skewed pattern of popularity is repeatedly reported in many online services (Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, 1–14. doi:10.1145/1298306.1298309ChaM.KwakH.RodriguezP.AhnY. Y.MoonS.2007I tube, you tube, everybody tubes: Analyzing the world's largest user generated content video systemProceedings of the 7th ACM SIGCOMM Conference on Internet Measurement11410.1145/1298306.1298309' href="#j_dim-2020-0048_ref_005_w2aab3b7c29b1b6b1ab2b1b5Aa">Cha, Kwak, Rodriguez, Ahn, & Moon, 2007).

When we graph the rank–frequency of PDissemination in a log–log plot, we find PDissemination results in the expected power-law outcome of a nearly straight line, as shown in Figure 3(a) and (b).

As shown in Figure 3, the log–log plot gives us a nearly straight line, with a slight negative slope, although with some curvature at both ends, which is to be expected with only a few thousand data points. However, a trend line (the dashed line) is a linear equation to which the data points conform with quite nicely (R2 = 0.957).

Figure 3

(a) Log–log plot of rank versus of PDissemination and (b) PImpact for each product in the dataset.

In Figure 4, to illustrate the dataset, we have highlighted and labeled three areas of interest.

Figure 4

Conceptual plot highlighting products with meaningful PDiscrimination.

The products in the middle region are the products with good PDiscrimination. To the far left, we have the products with high PDissemination, which are products with high dissemination and, therefore, low discriminative power. To the far right, we have the products with low PDissemination, which are products with high discriminatory factors but which may have low PImpact, so the user segments would not be meaningful. To investigate this possible correlation, we need to examine the number of engagements for these low PDissemination products. In our dataset, the volume of engagements positively correlates with a number of user segments (Pearson's rs = 0.93, p < 0.05), so if the exact number of interactions for each product was not available, we could directly use the number of user segments.

We first factor out products with very PDissemination. We identify products with a high PDissemination using the second derivative approach for local extremes, which is a form of the angle method for detecting elbows (Zhao, Xu, & Fränti, 2008). A second derivative test is an approach for determining whether a critical point is a relative minimum or a maximum. The second derivative returns the slope of the curve at any point on a plot. If the second derivative is positive at a point, the plot is concave up, and if negative at a point, the graph is concave down. The second derivative will be zero at an inflection point if the plot is continuous; otherwise, an inflection plot can be where the slope of the curve change

Using our second derivative-based approach, we identify

τU to be equal to or greater than a PDissemination. of 0.004. This results in 9 products (0.19% of all products) being removed, representing 13.5% of the total interactions;

τL to be equal to or less than a PImpact of 0.0001. This results in 751 products (16.14% of all products) being removed, representing 0.31% of the total interactions.

Overall, these results are consistent with expectations, given the power-law distribution of our content collection. There are a few products, each with a large frequency of engagements, so their total frequency is quite sizeable. Conversely, many products individually have few engagements, but they collectively represent a sizeable portion of the content but with fewer interactions. Applying this approach, we reduce our number of user segments by 7.14% (5,945 user segments).

However, as discussed earlier, this approach does not consider each user segment's level of engagement, which leads us to investigate PEntropy as a more robust measure.

Relationship among PDissemination, PImpact, PEntropy

To demonstrate what our measures capture and do not capture, we compare PDissemination, PImpact, and PEntropy.

Figure 5(a) and (b) shows the cumulative density distribution (CDF) of PDissemination, and PEntropy. We can see that the curve of PDissemination quickly increases while that of PEntropy slowly increases. This difference in slope indicates that PEntropy, by definition, can capture more complex dynamics of the interaction of user segments, as discussed in the earlier example with two products of the same product engagement counts but different levels of engagement from the user segments (see “Calculation of PEntropy” subsection).

Figure 5

(a) Cumulative density distribution (CDF) of PDissemination and (b) PEntropy.

We then directly compare the measures. First, we depict the relationship between PEntropy and the product interactions.

In Figure 6, we can see that a higher product engagement does not guarantee a higher PEntropy. By contrast, as anticipated, higher product engagements have a much higher probability of leading to higher PDissemination, as shown in Figure 7.

Figure 6

Product engagements versus normalized entropy of products.

Figure 7

Product engagement versus PDissemination of products.

Although the entire product dataset's overall patterns show an interesting trend, to effectively identify the most impactful or the most discriminatory, the top (popular) products are more important than the whole set of products. We thus focus on them for comparison.

For comparison between two sets of products, we use the Jaccard similarity coefficient (J), which is defined as the size of the intersection divided by the size of the union of the two sets. The Jaccard coefficient measures the intersection of two sets and also accounts for chance overlap. In this case, the sets are Set 1 = top k products chosen by views of PDissemination and Set 2 = top k products chosen by views of PEntropy. We use the following equation to calculate the Jaccard coefficient: J=NcNa+NbNc.{\rm{J}} = {{{N_c}} \over {{N_a} + {N_b} - {N_c}}}. where Na are the top k products chosen by views of PDissemination, Nb are the top k products chosen by views of PEntropy, and Nc are the number of top k products at the intersection.

Figure 8 shows a trend of Jaccard coefficients of the top k products chosen by views, PDissemination, and PEntropy. Higher coefficients between two measures mean that the top products with the corresponding measures are more common. For example, the line of engagement–dissemination maps into the Jaccard coefficient between the top products according to PDissemination and PImpact. The line shows a high percentage of overlap between the two measures.

Figure 8

Jaccard coefficients of top k products according to product engagement, PDissemination, and PEntropy.

Figure 9 shows the top 100 products by rank (by PImpact) and their PEntropy (left) and PDissemination (right). As almost no pattern emerges in the left figure, the correlation between PImpact and PEntropy is almost zero (Pearson's rs = −0.066, p <0.05), while the coefficient between product engagement counts and PDissemination is still high (Pearson's rs = 0.521, p < 0.05) in the right figure. This reaffirms that entropy is a more resilient measure that is not directly correlated with PImpact.

Figure 9

Top 100 products by product engagement count and their (a) PDissemination and (b) PEntropy.

Therefore, we recalculate our values where there is a notable change in the slope of the curve of PEntropy, again using the second derivative approach. Using our second derivative method, using PEntropy, we identify 14 products (0.33% of all products) being removed. Using our second derivative approach, again using PEntropy, we identify 798 products (17.02% of all products) being removed. In total, our number of user segments has been reduced by 8.18% (6,814 user segments removed). For example, a segment Male 25–34, Finland with Product #1001 was dropped because while it is distinct, it has low proportionality of the whole dataset's engagements. Our method demonstrates that a small percentage (0.33% in our dataset) of products is so widely disseminated that they are non-discriminatory, and a large percentage of products (17.02%) are discriminatory but have so little dissemination that their impact is negligible. Our approach reduces the product dataset by 17.35% and the number of user segments by 8.18%, a nontrivial percentage (6,814 user segments removed).

Discussion and Implications

Using actual user data from an online media outlet, we validate our premise concerning the discriminatory value of product dissemination and product impact for determining meaningful user segments for products, including a novel application of the normalized Shannon's entropy within the field of web analytics. As such, our empirical results add to the theoretical grounding of user segmenting by isolating meaningful differences among user grouping based on meaning engagement with products (Claycamp & Massy, 1968).

Our findings show that for, highlighting dissemination versus the impact of product, content engagement metrics can, in fact, be used for both crucial segmentation subtasks such as: (a) differentiating one user group from another (where user group is defined by some commonalities, such as similar age) and (b) identifying content that matters (i.e., engagement is above a reasonable threshold). The major implication is that our approach provides a straightforward yet effective technique for simplifying a complex dataset to allow decision-makers to focus on the key products for segmenting and related activities.

In our case, for example, we reduce the dataset by 17.35%, allowing detailed focus and analysis on the key percentage of the product dataset that was most impactful in terms of user segmentation and achievement of critical KPIs. The practical implication of this is that, for further processing of the data for purposes such as machine learning where less dimensionality is typically considered as an advantage, this smaller dataset captures the variation in the original dataset reasonably well. Similarly, decision-makers can now deal with fewer data and thus run elaborate analyses of user segments more easily. Our approach reduced the number of user segments by 8.18%. We were able to determine the most impactful content in terms of resonating with both unique user segments and the overall user population. The research results are also quite promising with the potential impact for understanding how various users engage with products, aiding business in both producing products that consumers want to engage with and, simultaneously, producing products in an appealing manner to that user segment

In addition, we note that the skewed popularity pattern, the so-called long tail, is not so unique. Still, it is prevalent in various web content services, such as YouTube, Amazon, Netflix, Facebook, and Twitter. We provide an effective way to deal with power-law consumption patterns by considering both the popularity and discriminative nature of online content. Thus, our findings have high generalizability to identify user segments for various online publishing services in the contents of advertising, marketing, and mobile app distribution.

Naturally, there are limitations to this research. First, we evaluated only our approach on a single dataset and only a single product type. Although conceptually the approach should be independent of product types, this should be evaluated. Also, we only incorporated views as user interaction. The approach would need to be evaluated on other user interactions or, perhaps, a set of user interactions. Also, given only one dataset, the effect on datasets of other sizes remains to be seen. Finally, the method is rather intensive for an appropriately 10% reduction in segment size, so automated methods need to be developed.

The research's strength is that we use real content and real user data from a major social media channel for identifying products that can be leveraged to categorize both distinct and impactful user segments via product dissemination and product discrimination. Our research is one of the few to apply entropy to analyze product engagement patterns of online consumers. Entropy-based measures suggest a robust approach to finding discriminatory content while considering each user segment's level of engagement. The resulting user segments compromise the high contribution of the product engagements and the uniqueness of user preferences.

Conclusions and Future Work

In this research, we show that the concepts of product dissemination and product discrimination can be used as underlying constructs for meaningful user segmentation based on online products, when combined with specific KPI performance metrics. We further demonstrate that normalized entropy can be effectively employed in determining meaningful user segments. User segmentation can be accomplished rapidly and dynamically using large-scale, real-time, user data from major online social media platforms, resulting in an analysis reflecting real user segments’ behavior. Although specifically focusing on digital products, our approach is flexible and resilient for application in a wide range of contexts.

Indeed, additional work can be done, such as a more rigorous definition of determining values and integration of other KPIs to define users’ behavioral aspects. Our calculations of the cutoff points, even leveraging slope-based approaches, still involves a measure of heuristics, judgment, and domain knowledge. It would be interesting to see if there is value in determining more rigorous methods. This would involve aligning KPIs with specifically related metrics. Here, we limited our focus to understanding the meaningful user segments with current levels of product dissemination. It would be interesting to apply the approach to the long tail of user segments to investigate, for example, possible product diffusion (Rogers, 1976).

Figure 1

Sample YouTube video from an online media company's YouTube channel, with the number of views, which we use as the user engagements for this research.
Sample YouTube video from an online media company's YouTube channel, with the number of views, which we use as the user engagements for this research.

Figure 2

(a) Plot of rank versus PDissemination and (b) PImpact for each product in the dataset.
(a) Plot of rank versus PDissemination and (b) PImpact for each product in the dataset.

Figure 3

(a) Log–log plot of rank versus of PDissemination and (b) PImpact for each product in the dataset.
(a) Log–log plot of rank versus of PDissemination and (b) PImpact for each product in the dataset.

Figure 4

Conceptual plot highlighting products with meaningful PDiscrimination.
Conceptual plot highlighting products with meaningful PDiscrimination.

Figure 5

(a) Cumulative density distribution (CDF) of PDissemination and (b) PEntropy.
(a) Cumulative density distribution (CDF) of PDissemination and (b) PEntropy.

Figure 6

Product engagements versus normalized entropy of products.
Product engagements versus normalized entropy of products.

Figure 7

Product engagement versus PDissemination of products.
Product engagement versus PDissemination of products.

Figure 8

Jaccard coefficients of top k products according to product engagement, PDissemination, and PEntropy.
Jaccard coefficients of top k products according to product engagement, PDissemination, and PEntropy.

Figure 9

Top 100 products by product engagement count and their (a) PDissemination and (b) PEntropy.
Top 100 products by product engagement count and their (a) PDissemination and (b) PEntropy.

Average, Max, Min, Median, and Standard Deviation of SegmentsProduct and EngagementsProduct for the 4,653 Products in the Dataset

MeasureSegmentsProductEngagementsProduct
Average17.8928,876.95
Max9346,050,914
Min149
Median133,735
SD29.92173,077.37

An, J., Kwak, H., Jung, S. G., Salminen, J., & Jansen, B. J. (2018). Customer segmentation using online platforms: Isolating behavioral and demographic segments for persona creation via aggregated user data. Social Network Analysis and Mining, 8(1), 1–19.AnJ.KwakH.JungS. G.SalminenJ.JansenB. J.2018Customer segmentation using online platforms: Isolating behavioral and demographic segments for persona creation via aggregated user dataSocial Network Analysis and Mining81119Search in Google Scholar

Bawden, D., & Robinson, L. (2015a). “A few exciting words”: Information and entropy revisited. Journal of the Association for Information Science and Technology, 66(10), 1965–1987.BawdenD.RobinsonL.2015a“A few exciting words”: Information and entropy revisitedJournal of the Association for Information Science and Technology661019651987Search in Google Scholar

Bawden, D., & Robinson, L. (2015b). “Waiting for Carnot”: Information and complexity. Journal of the Association for Information Science and Technology, 66(11), 2177–2186.BawdenD.RobinsonL.2015b“Waiting for Carnot”: Information and complexityJournal of the Association for Information Science and Technology661121772186Search in Google Scholar

Böttcher, M., Spott, M., Nauck, D., & Kruse, R. (2009). Mining changing customer segments in dynamic markets. Expert Systems with Applications, 36(1), 155–164.BöttcherM.SpottM.NauckD.KruseR.2009Mining changing customer segments in dynamic marketsExpert Systems with Applications361155164Search in Google Scholar

Cha, M., Kwak, H., Rodriguez, P., Ahn, Y. Y., & Moon, S. (2007). I tube, you tube, everybody tubes: Analyzing the world's largest user generated content video system. Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, 1–14. doi:10.1145/1298306.1298309ChaM.KwakH.RodriguezP.AhnY. Y.MoonS.2007I tube, you tube, everybody tubes: Analyzing the world's largest user generated content video systemProceedings of the 7th ACM SIGCOMM Conference on Internet Measurement11410.1145/1298306.1298309Open DOISearch in Google Scholar

Chan, C. W., Green, L. V., Lekwijit, S., Lu, L., & Escobar, G. (2019). Assessing the impact of service level when customer needs are uncertain: An empirical investigation of hospital step-down units. Management Science, 65(2), 751–775.ChanC. W.GreenL. V.LekwijitS.LuL.EscobarG.2019Assessing the impact of service level when customer needs are uncertain: An empirical investigation of hospital step-down unitsManagement Science652751775Search in Google Scholar

Claycamp, H. J., & Massy, W. F. (1968). A theory of market segmentation. Journal of Marketing Research, 5(4), 388–394.ClaycampH. J.MassyW. F.1968A theory of market segmentationJournal of Marketing Research54388394Search in Google Scholar

Cooper, W. S. (1983). Exploiting the maximum entropy principle to increase retrieval effectiveness. Journal of the American Society for Information Science, 34(1), 31–39.CooperW. S.1983Exploiting the maximum entropy principle to increase retrieval effectivenessJournal of the American Society for Information Science3413139Search in Google Scholar

Fresneda, J. E., & Gefen, D. (2019). A semantic measure of online review helpfulness and the importance of message entropy. Decision Support Systems, 125, 1–11.FresnedaJ. E.GefenD.2019A semantic measure of online review helpfulness and the importance of message entropyDecision Support Systems125111Search in Google Scholar

García-Sánchez, F., Colomo-Palacios, R., & Valencia-García, R. (2020). A social-semantic recommender system for advertisements. Information Processing & Management, 57(2), 1–16. doi: 10.1016/j.ipm.2019.102153García-SánchezF.Colomo-PalaciosR.Valencia-GarcíaR.2020A social-semantic recommender system for advertisementsInformation Processing & Management57211610.1016/j.ipm.2019.102153Open DOISearch in Google Scholar

Gu, X., Gao, F., Tan, M., & Peng, P. (2020). Fashion analysis and understanding with artificial intelligence. Information Processing & Management, 57(5), 1–15. doi: 10.1016/j.ipm.2020.102276GuX.GaoF.TanM.PengP.2020Fashion analysis and understanding with artificial intelligenceInformation Processing & Management57511510.1016/j.ipm.2020.102276Open DOISearch in Google Scholar

Haghighatnia, S., Abdolvand, N., & Rajaee Harandi, S. (2018). Evaluating discounts as a dimension of customer behavior analysis. Journal of Marketing Communications, 24(4), 321–336.HaghighatniaS.AbdolvandN.Rajaee HarandiS.2018Evaluating discounts as a dimension of customer behavior analysisJournal of Marketing Communications244321336Search in Google Scholar

Havrlant, L., & Kreinovich, V. (2017). A simple probabilistic explanation of term frequency-inverse document frequency (tfidf) heuristic (and variations motivated by this explanation). International Journal of General Systems, 46(1), 27–36.HavrlantL.KreinovichV.2017A simple probabilistic explanation of term frequency-inverse document frequency (tfidf) heuristic (and variations motivated by this explanation)International Journal of General Systems4612736Search in Google Scholar

Jansen, B. J., Jung, S. G., Salminen, J., An, J., & Kwak, H. (2017). Viewed by too many or viewed too little: Using information dissemination for audience segmentation. Proceedings of the Association for Information Science and Technology, 54(1), 189–196.JansenB. J.JungS. G.SalminenJ.AnJ.KwakH.2017Viewed by too many or viewed too little: Using information dissemination for audience segmentationProceedings of the Association for Information Science and Technology541189196Search in Google Scholar

Jansen, B. J., & Rieh, S. (2010). The seventeen theoretical constructs of information searching and information retrieval. Journal of the American Society for Information Science and Technology, 61(8), 1517–1534.JansenB. J.RiehS.2010The seventeen theoretical constructs of information searching and information retrievalJournal of the American Society for Information Science and Technology61815171534Search in Google Scholar

Jansen, B. J., Salminen, J. O., & Jung, S. G. (2020). Data-driven personas for enhanced user understanding: Combining empathy with rationality for better insights to analytics. Data and Information Management, 4(1), 1–17.JansenB. J.SalminenJ. O.JungS. G.2020Data-driven personas for enhanced user understanding: Combining empathy with rationality for better insights to analyticsData and Information Management41117Search in Google Scholar

Jenkinson, A. (1994). Beyond segmentation. Journal of Targeting, Measurement and Analysis for Marketing, 31(1), 60–72.JenkinsonA.1994Beyond segmentationJournal of Targeting, Measurement and Analysis for Marketing3116072Search in Google Scholar

Jiang, T., Chi, Y., & Gao, H. (2017). A clickstream data analysis of Chinese academic library OPAC users’ information behavior. Library & Information Science Research, 39(3), 213–223. doi: 10.1016/j.lisr.2017.07.004JiangT.ChiY.GaoH.2017A clickstream data analysis of Chinese academic library OPAC users’ information behaviorLibrary & Information Science Research39321322310.1016/j.lisr.2017.07.004Open DOISearch in Google Scholar

Jung, S. G., An, J., Kwak, H., Ahmad, M., Nielsen, L., & Jansen, B. J. (2017). Persona generation from aggregated social media data. Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, 1748–1755. doi: 10.1145/3027063.3053120JungS. G.AnJ.KwakH.AhmadM.NielsenL.JansenB. J.2017Persona generation from aggregated social media dataProceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems1748175510.1145/3027063.3053120Open DOISearch in Google Scholar

Kalaivani, D., & Sumathi, P. (2019). Factor based prediction model for customer behavior analysis. International Journal of System Assurance Engineering and Management, 10(4), 519–524.KalaivaniD.SumathiP.2019Factor based prediction model for customer behavior analysisInternational Journal of System Assurance Engineering and Management104519524Search in Google Scholar

Kwak, H., An, J., Salminen, J., Jung, S. G., & Jansen, B. J. (2018). What we read, what we search: Media attention and public attention among 193 countries. Proceedings of the 2018 World Wide Web Conference, 893–902. doi:10.1145/3178876.3186137KwakH.AnJ.SalminenJ.JungS. G.JansenB. J.2018What we read, what we search: Media attention and public attention among 193 countriesProceedings of the 2018 World Wide Web Conference89390210.1145/3178876.3186137Open DOISearch in Google Scholar

Li, C., Chen, S., & Qi, Y. (2019). Filtering and classifying relevant short text with a few seed words. Data and Information Management, 3(3), 165–186.LiC.ChenS.QiY.2019Filtering and classifying relevant short text with a few seed wordsData and Information Management33165186Search in Google Scholar

Liang, S., & Wu, D. (2019). Predicting academic digital library OPAC users’ cross-device transitions. Data and Information Management, 3(1), 40–49. doi:10.2478/dim-2019-0001LiangS.WuD.2019Predicting academic digital library OPAC users’ cross-device transitionsData and Information Management31404910.2478/dim-2019-0001Open DOISearch in Google Scholar

Mellor, V. (2006). Mastering Audience Segmentation: How to Apply Segmentation Techniques to Improve Internal Communication. London: Melcrum.MellorV.2006Mastering Audience Segmentation: How to Apply Segmentation Techniques to Improve Internal CommunicationLondonMelcrumSearch in Google Scholar

Morisada, M., Miwa, Y., & Dahana, W. D. (2019). Identifying valuable customer segments in online fashion markets: An implication for customer tier programs. Electronic Commerce Research and Applications, 33, 1–11.MorisadaM.MiwaY.DahanaW. D.2019Identifying valuable customer segments in online fashion markets: An implication for customer tier programsElectronic Commerce Research and Applications33111Search in Google Scholar

Ratkiewicz, J., Fortunato, S., Flammini, A., Menczer, F., & Vespignani, A. (2010). Characterizing and modeling the dynamics of online popularity. Physical Review Letters, 105(15), 1–4.RatkiewiczJ.FortunatoS.FlamminiA.MenczerF.VespignaniA.2010Characterizing and modeling the dynamics of online popularityPhysical Review Letters1051514Search in Google Scholar

Rogers, E. M. (1976). New product adoption and diffusion. Journal of Consumer Research, 2(1), 290–301.RogersE. M.1976New product adoption and diffusionJournal of Consumer Research21290301Search in Google Scholar

Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.ShannonC. E.1948A mathematical theory of communicationThe Bell System Technical Journal273379423Search in Google Scholar

Simester, D., Timoshenko, A., & Zoumpoulis, S. I. (2020). Targeting prospective customers: Robustness of machine-learning methods to typical data challenges. Management Science, 66(6), 2495–2522.SimesterD.TimoshenkoA.ZoumpoulisS. I.2020Targeting prospective customers: Robustness of machine-learning methods to typical data challengesManagement Science66624952522Search in Google Scholar

Smith, W. R. (1956). A product differentiation and market segmentation as alternative marketing strategies. Journal of Marketing, 21(1), 3–8.SmithW. R.1956A product differentiation and market segmentation as alternative marketing strategiesJournal of Marketing21138Search in Google Scholar

Song, X., Lin, C. Y., Tseng, B. L., & Sun, M. T. (2005). Modeling and predicting personal information dissemination behavior. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, 479–488. doi: 10.1145/1081870.1081925SongX.LinC. Y.TsengB. L.SunM. T.2005Modeling and predicting personal information dissemination behaviorProceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining47948810.1145/1081870.1081925Open DOISearch in Google Scholar

Stern, B. B. (1994). A revised communication model for advertising: Multiple dimensions of the source, the message, and the recipient. Journal of Advertising, 23(2), 5–15.SternB. B.1994A revised communication model for advertising: Multiple dimensions of the source, the message, and the recipientJournal of Advertising232515Search in Google Scholar

Warnaby, G., & Shi, C. (2019). Changing customer behaviour: Changing retailer response? The potential for pop-up retailing. Journal of Customer Behaviour, 18(1), 7–16.WarnabyG.ShiC.2019Changing customer behaviour: Changing retailer response? The potential for pop-up retailingJournal of Customer Behaviour181716Search in Google Scholar

Wei, S., & Wei, K.-K. (2017). The contingency effect of relational competency on the relationship between information technology competency and firm performance. Data and Information Management, 1(1), 3–16. doi:10.1515/dim-2017-0003WeiS.WeiK.-K.2017The contingency effect of relational competency on the relationship between information technology competency and firm performanceData and Information Management1131610.1515/dim-2017-0003Open DOISearch in Google Scholar

Wu, I. C., & Yu, H. K. (2020). Sequential analysis and clustering to investigate users’ online shopping behaviors based on need-states. Information Processing & Management, 57(6), 1–18. doi: 10.1016/j.ipm.2020.102323WuI. C.YuH. K.2020Sequential analysis and clustering to investigate users’ online shopping behaviors based on need-statesInformation Processing & Management57611810.1016/j.ipm.2020.102323Open DOISearch in Google Scholar

Yang, Q., Shan, C., Jiang, B., Yang, N., & Yao, T. (2018). Managing the complexity of new product development project from the perspectives of customer needs and entropy. Concurrent Engineering, 26(4), 328–340.YangQ.ShanC.JiangB.YangN.YaoT.2018Managing the complexity of new product development project from the perspectives of customer needs and entropyConcurrent Engineering264328340Search in Google Scholar

Zhao, Q., Xu, M., & Fränti, P. (2008). Knee point detection on bayesian information criterion. Proceeding of 2008 20th IEEE International Conference on Tools with Artificial Intelligence (Vol. 2), 431–438. doi: 10.1109/ICTAI.2008.154ZhaoQ.XuM.FräntiP.2008Knee point detection on bayesian information criterionProceeding of 2008 20th IEEE International Conference on Tools with Artificial Intelligence243143810.1109/ICTAI.2008.154Open DOISearch in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo