1. bookAHEAD OF PRINT
Journal Details
License
Format
Journal
eISSN
2444-8656
First Published
01 Jan 2016
Publication timeframe
2 times per year
Languages
English
access type Open Access

Fed-UserPro: A user profile construction method based on federated learning

Published Online: 20 May 2022
Volume & Issue: AHEAD OF PRINT
Page range: -
Received: 14 Feb 2022
Accepted: 10 Apr 2022
Journal Details
License
Format
Journal
eISSN
2444-8656
First Published
01 Jan 2016
Publication timeframe
2 times per year
Languages
English
Abstract

User profiles constructed using vast network behaviour data are widely used in various fields. However, data island and central server capacity problems limit the implementation of centralised big data training. This paper proposes a user profile construction method, Fed-UserPro, based on federated learning, which uses non-independent and identically distributed unstructured user text to jointly construct user profiles. Latent Dirichlet allocation model and softmax multi-classification regression method are introduced into the federated learning structure to train data. The results show that the accuracy of the Fed-UserPro method is 8.69%–19.71% higher than that of single-party machine learning methods.

Keywords

Introduction

With the rapid development of mobile internet, online social behaviours have shown a strong development trend. The topic of how to make full use of the various elements of information shared by users participating in online society has become popular in recent years. User profiles are built to document users’ social attributes, living habits, consumption behaviours and other characteristics, as abstracted from vast user network behaviour data. These generated profiles are widely used in e-commerce, social networking, internet financing, product development and other fields, providing an important basis for accurate advertising, personalised recommendations and risk control. In practical applications, user network behaviour profiles are built using big data and machine learning technology. To accomplish this, one must collect a vast amount of user network behaviour data. After cleaning and fusing the data, machine learning algorithms are used to model behaviours and build profiles. Existing user network behaviour profiles tend to be domain oriented, focusing on improving the accuracy of the user profiles derived from the user information. Profiles built on single enterprise data usually have difficulty in fully reflecting user characteristics. As the internet society grows and changes, the integration of user network behaviour has become a research trend across multiple fields and enterprises; thus, comprehensive and accurate user profiles are widely used with many network social governance services.

The following key issues exist in the use of large datasets to construct user profiles. First, the requirement for vast amounts of data adds significant performance requirements to central server equipment and networks. If each service provider collects a large quantum of user data for combination and synthesis, the data must be stored and processed centrally. Thus, the centralised method of constructing user profiles poses severe challenges to storage capacity, computing power and network transmission capability. Second, a data island problem is caused by fragmented data storage among enterprises [2]. The behaviour data input by users on the network are collected and stored by different service providers. However, owing to competition, security restrictions and approval processes, a barrier exists that is difficult to overcome regarding network user behavior data collection (i.e., the data island problem). Even if companies intend to exchange data, they may encounter policy accountability issues that prevent it. Splitting stored user network behaviour data obviously does not lead to comprehensive user profile development, resulting in a great discount in its availability and accuracy. Third, there are huge differences in data between different enterprises. Due to the different market positioning of different enterprises, the types of user groups they attract are also different. In addition, the private data of different enterprises are affected by the use behaviour of users under the enterprise. Therefore, the private data distribution of any particular enterprise cannot fully reflect the global data distribution of the entire industry, and it is difficult to build a comprehensive and accurate user profile by relying exclusively on the private data of an enterprise.

Recently, a federated learning architecture proposed by Google [5, 17] has provided inroads to solving data island and load capacity problems. This federated learning architecture ensures that the data of each participant are stored in a decentralised manner without the need for centralisation. It builds a machine learning model of global data without sharing the original data [11]. In the real environment, different from traditional machine learning and distributed machine learning, the data characteristics of participants in the federated learning scenario are mostly no independent and identically distributed (non-IID). Based on this, this paper proposes a global user profile construction method – Fed-UserPro, which is based on federated learning. The main contributions of this paper are summarised as follows:

This paper proposes, for the first time, a method for constructing user profiles based on unstructured data, and it uses a federated learning architecture to cooperate with multiparty data to construct global user profiles. Compared with user profiles in a single field, it describes user characteristics more comprehensively.

Based on the federated learning architecture, Fed-UserPro is proposed. Its federated learning architecture employs a horizontal division that is used to construct a global user profile using multiparty data. With this method, a latent Dirichlet allocation (LDA) model is used to mine potential user topic information to obtain a topic probability distribution, and users are grouped by softmax regression multi-classification. This method is extended to the federated learning architecture. When the data of each participant are not independent and identically distributed (IID), the accuracy of the model is improved with parameter transfer and aggregation.

The Fed-UserPro algorithm was subject to experiments using the Sina Weibo dataset. The experiment verified the accuracy and running time of the algorithm on non-IID data and compared it to the user profile algorithm, UserPro, trained by a single data holder. The results show that the accuracy of the Fed-UserPro algorithm was significantly higher than that of UserPro. The accuracy was increased by 19.71% in the best case and by 8.69% in the worst case; the running time showed a linear increase with the increase of participants.

Related works

Based on the need for federated internet user profiling, this paper provides an improved federated learning algorithm.

User profiles are widely used in recommendation, advertising and marketing services. In recent years, such profiles have been widely used in the construction of smart libraries, smart campuses, emergency public opinion managers and personalised insurance services. With the increasing use of user profiles, the types of data and methods for constructing them are also increasing. In addition to using statistical learning methods to obtain user profile tags, big data machine learning methods can be used to mine detailed and versatile user behaviour information from different sources. Zeng and Sun [12] built user profiles and embedded them into a library's recommendation service by collecting user behaviour information, such as access logs and search keywords in the library, improving user retrieval efficiency and the quality of recommendations. He et al. [13] constructed user profiles by collecting basic real-world information about the urban elderly using a smart elderly care platform to predict their service needs and provide customised services. Ren et al. [14] used crawler technology to obtain the static and dynamic attributes of Weibo users using machine learning methods to analyse their emotional tendencies and build user profiles. They also leveraged user portrait information to predict emotions, helping the platform develop targeted public opinion guidance strategies. Lin and Xie [9] analysed user behaviours based on social identity theory and used an LDA topic model to mine user interests and preferences to construct profiles of Weibo groups.

Federated learning has been proposed to solve the problem of training and updating local models under privacy constraints. It can train a global model using multiple participants’ data simply by aggregating the gradient or parameter information gathered from each during the model training stage without manipulating the original local data. For the machine learning task, the goal is to find an optimal solution to minimise the loss function. Usually, for complex problems with too many model parameters, optimisation algorithms are used to find the numerical solution of the loss function. The most common optimisation algorithm is the stochastic gradient descent (SGD) method. The FedSGD [19] algorithm applies the SGD algorithm to the federated learning framework for the aggregation optimisation of model parameters. After receiving the current round of global model parameters sent by the central server, the local participants perform a gradient calculation according to all their local data and upload the gradient back to the server to complete the global model aggregation round and update. Although this method is computationally efficient, it requires many communication rounds to reach a satisfactory model. In federated learning, communication cost is a problem that must be solved. There are two optimisation methods for reducing the communication costs of model training [1, 4, 6, 10]. The first reduces the number of communication rounds and the amount of information transmitted in each. Communication round reduction can be achieved by increasing the parallelism of the participants and increasing their local computation power. Information transmission reduction in each round can be achieved through parameter compression. The FedAvg algorithm proposed by McMahan et al. [4] reduces communication costs by increasing the local computation of each participant round. FedAvg has equivalent effects to centralised learning [22]. However, when applied to real data, defects pertaining to equipment heterogeneity arise. For example, participants often lack resources for the current training stage and cannot complete the training task within the specified time; thus, the server abandons them [3]. Consequently, for the problem of equipment heterogeneity, some scholars have proposed the FedProx algorithm [7, 23], which can dynamically adjust the number of local iterative training rounds of participants according to the resource status of the equipment. It improves the stability of federated learning, in which the original data are stored in different places. Owing to factors of time, space and individual differences, it is easy to cause the data to show non-IID characteristics [1, 21]. In response to this problem, Zhao et al. [16] proposed a method of sharing a small amount of data, which helped improve FedAvg performance when the data were non-IID.

In contrast with the above research, this paper examines a more comprehensive user profile that is based on unstructured text data under a horizontally federated learning architecture, with the goal of improving model accuracy by designing parameter transmission and aggregation when the data of each participant are non-IID.

Preliminaries

This section provides preparatory knowledge of user profiles and federated learning.

User profile

The concept of the user profile was first proposed by Alan Cooper, the “father” of interaction design. He believed that user profiles were virtual representations of real users [8]. The creation of a user portrait requires a vast amount of real user data to obtain usable information through statistical analysis and machine mining. The portrait further describes individuals or groups by establishing tags from different dimensions and forming the prototype of a user group.

Definition 1

User profile UPi refers to a set composed of users, categories and feature labels, which can be expressed as UPi = {Ui, Ci,< l1, p1 >,< l2, p2 >,…,< lk, pk >}, where Ui represents the ID in the user profile, and Ci represents the category to which Ui belongs. < l1, p1 >,< l2, p2 >, … ,< lk, pk > are the top-k important feature labels of category Ci, li represents the feature tag and pi represents the probability of having the feature tag li.

The content posted by Weibo users on their personal accounts reflects their personal characteristics (e.g., hobbies, values and social needs). It is feasible to extract important information from such data to construct user profiles. However, users do not store much content on Weibo, and only the most important aspects of their personalities can be used to identify their behavioural characteristics. Therefore, this article selects only the top-k important tags as features.

Federated learning

A federated learning architecture includes servers and N participants, P1, P2, …, Pn, as shown in Figure 1. Under the coordination of the server, federated learning allows multiple participants to use their local data collaboratively to train a global model. The process of federated learning can be simply described as [18, 20]: (1) The server distributes the parameters of the current global model to the clients participating in this round of model update training; (2) The client receives the parameters, updates the local model, performs model training based on the local data and updates the local model parameters; (3) The client uploads the parameters to the server after updating the local model parameters of this round; (4) After the server receives all model parameters participating in this round of training, it aggregates them according to certain rules and updates the global model again.

Fig. 1

Federated learning architecture

According to the distribution of participants’ local data, federated learning is divided into horizontal, vertical and transfer types [5, 21]. The essence of horizontal federated learning is the collaborative learning of samples. The sample characteristics of each participant are similar, and model training is carried out by combining different data samples among participants. The essence of vertical federated learning is the collaborative learning of features. There is much overlap in user IDs, but the characteristics of user data held by different participants differ. Federated transfer learning solves the problem of each participant having few ID and sample feature overlaps. It can deal with insufficient sample sizes for a certain issue under the premise of ensuring data privacy and security.

Although federated learning uses a distributed learning framework, there are certain differences between it and distributed machine learning. For example, under federated learning, the central server does not have the right to allocate and control all data used for training; it only acts as a curious but credible third party to perform modelling tasks and parameter distributions, as well as local model updating of parameter aggregation operations. In distributed machine learning, the data of each work node are IID, and the number of work nodes is far lower than the number of training data samples. In federated learning, each participant is a work node, and the data are non-IID under the following situations [1]: covariate shift, prior probability shift, concept shift and data imbalance. Among these, a prior probability shift refers to the different distributions of the category labels of different clients. The setting of the non-IID data in this article is based on the prior probability shift, which is expressed as Ptrain(x|y) = Ptest(x|y) and Ptrain(y) ≠ Ptest(y).

Problem definitions

The problem is that a federated learning architecture is needed to combine the data of multiple participants to construct user profiles during horizontal data segmentation.

The central server holds a uniformly distributed pre-training dataset provided by all participants according to their user category characteristics. The dataset was desensitised to delete user identities, which are only stored on the central server. The data of the pre-training dataset can be expressed as S = {< Ū1,C1 >, < Ū2,C2 >, …, < ŪM, CK >}. Among them, to describe the unstructured data of user Ūi, Ci represents the category label, and the total number of categories is K. The central server trains the global LDA model according to the pre-training dataset to map the participant user text data into the topic feature space vector. Each participant, Pi, holds part of the user data, and the data, Di = {Ui,< l1, p1 >,< l2, p2 >,…,< ln, pn >}, held by participant Pi. Among them, li represents the i-th feature label, and pi represents the probability that user Ui has label li. The central server initiates a softmax regression multi-classification model and uses the FedAvg federated learning algorithm to train a global model containing all types of labels together with all participants. The central server distributes the trained global model to each participant, and the participant uses local data and the global model to obtain the distribution of group user interest labels in each category as a group user profile.

The symbols used in this article are described in Table 1.

Symbols used in this article and their meanings

Symbols Meanings

Pi The i-th participant in federated learning
Di Data set held by the i-th participant
S Pre-training data set held by the server
N Total number of participants
li Topic feature tags
pi Probability of having topic feature label li
Ui User ID in user data
Ci Category i
K Total categories
w Softmax model parameters
M The total number of documents in the pre-training dataset
Ūi Unstructured data describing user Ui in S
Fed-UserPro algorithm

This section introduces the Fed-UserPro algorithm, which is divided into server and client sides.

Fed-UserPro server-side algorithm

The server-side algorithm is divided into preprocessing and training sub-algorithms. The server holds the pre-training dataset, S = {< Ū1, C1 >, < Ū2, C2 >, …, < ŪM, CK >}. First, during the preprocessing stage, a global LDA model is trained to assist the participants’ local data in the vector representation of the topic feature space; then, a classification model is jointly trained in the training stage.

Server-side preprocessing algorithm

The data held by the server are the user's unstructured text data (i.e., all Weibo data published by user Ui). The text data of Weibo users can intuitively reflect their interests and preferences. Owing to problems of noise in microblog content, lower word frequencies and nonstandard word use, traditional text mining technology cannot be used effectively. Moreover, the high-latitude representation of text data in the feature space is a major challenge for subsequent model training. The topic model shows a strong advantage in the mining of this type of text. Therefore, this paper uses the LDA topic [15] model to preprocess the text data minus topic features and obtain a low-latitude topic feature space.

The LDA model is an unsupervised Bayesian generative learning model that includes a text–topic–word distribution. In recent years, it has been widely used for text dimensionality reduction, topic mining and text representation. All user document data contain several topics and probabilities corresponding to different topics. Each topic contains multiple feature tags, and the feature tags have corresponding probability distributions in the topics. Figure 2 shows the process of document generation using the LDA model.

Fig. 2

LDA graphical model representation. LDA, latent Dirichlet allocation

In Figure 2, α and β represent the parameters at the corpus level, and the documents held by each user are the same; θ is a variable at the document level, and each document corresponds to a θ. Thus, the probability of each document generating each topic z is different, and each document is generated to sample θ once. Both z and w are word-level variables; z is generated by θ, w is generated by both z and β, and word w corresponds to topic z.

According to the LDA probability model, the joint distribution formula of all variables can be known: P(Wn,zn,θd|α,β)=P(θd|α)m=1NdP(zn|θd)P(Wn|zn,β). P({W_n},{z_n},{\theta _d}|\alpha ,\beta ) = P({\theta _d}|\alpha )\prod\nolimits_{m = 1}^{{N_d}} P({z_n}|{\theta _d})P({W_n}|{z_n},\beta ).

The training process of the topic model learns the parameters of the model in the existing document set, and the Gibbs sampling method is most often used to solve the distribution parameters. It randomly assigns a topic number to each feature word in the document set and modifies the topic number of each word by scanning and updating the entire corpus. It repeats the process until convergence to obtain document topic distribution parameters.

On the federated learning server side, the word segmenter is used to preprocess each user's data (e.g., word segmentation and stop-word removal) so that each user document becomes a bag of words. It then uses the LDA model for training and obtaining the latent semantic information of the dataset. It then calculates the probability distribution of each user under each topic to realise the vector representation of the original data in the topic feature space.

It is essential that an appropriate number of topics be selected in the LDA model. Presently, most topics are determined through experience and experiments. In this paper, two indicators of perplexity and consistency were used to jointly determine the number of topics. The degree of confusion refers to how uncertain the trained model is about the topic to which the document belongs; hence, the lower the degree of confusion, the better. Generally, however, the higher the number of model topics, the lower the degree of confusion, which leads to model overfitting in the training set and lower topic interpretability. Consistency can reveal the strength of the semantic relationship between words in a topic, and the higher the consistency, the better. Therefore, the score of the comprehensive model for perplexity and consistency determines the number of topics in the model.

The server distributes the trained LDA model to all participants and preprocesses its local data. After preprocessing, all user text data are mapped to the topic feature space so that the server and different participants can start federated learning to jointly construct a global user profile.

Server-side training algorithm

The server collects a round of user parameters and updates them as a weighted average according to the proportion of each participant's dataset to the global dataset. It then sends the updated parameters to the participants for the next iteration. The specific algorithm is shown as Algorithm 1.

Fed-UserPro Server-side training algorithm

Input: Local model parameters of each participant wt1,wt2,,wtK w_t^1,w_t^2, \ldots ,w_t^K
Output: Global user profile model
  1: Initialise model parameters w1;
  2: for t =1, 2, …, T do
  3:   m ← max(γ*N, 1), γ ∈ [0,1];//m is the number of participants
  4:   Pt ← Participants in this round are placed in the collection;
  5:   for PPt do
  6:      wt+1CClientUpdata(wt) w_{t + 1}^C \leftarrow ClientUpdata({w_t}) ;
  7:      wt+1C|Di||D|wt+1C {w_{t + 1}} \leftarrow \sum\limits_C {{|{D_i}|} \over {|D|}}w_{t + 1}^C ;
  8:   end for
  9: end for
Fed-UserPro client algorithm

This paper uses the softmax multi-class regression algorithm to train the user classification model. Softmax regression is a general form of logistic regression that is used for binary classification tasks and for multi-classification tasks.

The client Pi holds the input training dataset containing K categories, Di = {< U1,C1 >,< U2,C2 >,…,< Um, Cm >}, where Ci ∈ {1, 2,…K} and m is the number of samples. The output of the softmax multi-class regression model is the probability that user Ui belongs to each category, expressed as hw(Ui) in the form of Eq. (2). hw(Ui)=[p(Ci=1)|Ui;wp(Ci=2)|Ui;wp(Ci=K)|Ui;w]=1j=1KewjTUi[ew1TUiew2TUiewKTUi] {h_w}({U_i}) = \left[ {\matrix{ {p({C_i} = 1)|{U_i};w} \cr {p({C_i} = 2)|{U_i};w} \cr { \ldots \ldots } \cr {p({C_i} = K)|{U_i};w} \cr } } \right] = {1 \over {\sum\nolimits_{j = 1}^K {e^{w_j^T{U_i}}}}}\left[ {\matrix{ {{e^{w_1^T{U_i}}}} \cr {{e^{w_2^T{U_i}}}} \cr \ldots \cr {{e^{w_K^T{U_i}}}} \cr } } \right] where w=[w1T,w2T,,wKT] w = [w_1^T,w_2^T, \ldots ,w_K^T] are the parameters of the model. The probability of user Ui belonging to category Cj is shown in Eq. (3). p(Ci=j|Ui;w)=ewjTUil=1Kew1TUi p({C_i} = j|{U_i};w) = {{{e^{w_j^T{U_i}}}} \over {\sum\nolimits_{l = 1}^K {e^{w_1^T{U_i}}}}}

The sum of the probability that the sample belongs to all K categories is one. In softmax regression, the cross-entropy loss function is often used to measure the gap between the predicted result and the real result. L(w) is the cost function of the softmax regression: L(w)=1n[i=1nj=1K1{Ci=j}logewjTUil=1Kew1TUi] L(w) = - {1 \over n}\left[ {\sum\nolimits_{i = 1}^n \sum\nolimits_{j = 1}^K 1\{ {C_i} = j\} \log {{{e^{w_j^T{U_i}}}} \over {\sum\nolimits_{l = 1}^K {e^{w_1^T{U_i}}}}}} \right] where 1{} is an indicative function (i.e., 1{ expression that is true} = 1, and 1{ expression that is false} =0). Eq. (5) can be used to solve the gradient of w. The gradient of L(w) with respect to wj is solved as follows: L(w)(wj)=1mwj[i=1nj=1K1{Ci=j}logewjTUil=1Kew1TUi]=1m[i=1nUi(1{Ci=j}p(Ci=j|Ui;w))] {{\partial L(w)} \over {\partial ({w_j})}} = - {1 \over m}{\partial \over {\partial {w_j}}}\left[ {\sum\nolimits_{i = 1}^n \sum\nolimits_{j = 1}^K 1\{ {C_i} = j\} \log {{{e^{w_j^T{U_i}}}} \over {\sum\nolimits_{l = 1}^K {e^{w_1^T{U_i}}}}}} \right] = - {1 \over m}\left[ {\sum\nolimits_{i = 1}^n {U_i}(1\{ {C_i} = j\} - p({C_i} = j|{U_i};w))} \right]

The algorithm of the Fed-UserPro client is shown in Algorithm 2.

Client Pk training algorithm ClientUpdate(w)

Input: Local training set Dk = {Ui,< l1, p1,>,< l2, p2 >,…,< ln, pn >}
Output: Local training model parameters wtk w_t^k
  1: The client Pk downloads the parameters wt used for this round of model training from the server;
  2: Divide Di into small batch training sets B of size batch_size;
  3: wtkwt w_t^k \leftarrow {w_t} ;
  4: for e = 1,2, …, E do
  5:   for each batch bB do
  6:      wtkwtkηL(wtk,b) w_t^k \leftarrow w_t^k - \eta \nabla L(w_t^k,b) ;
  7:     upload wtk w_t^k to the server;
  8:   end for
  9: end for

The client Pk receives the training parameters sent by the server for local training. The client divides the held data Dk into a small batch training set B with a size batch_size, takes one batch b each time and trains parameter wtk w_t^k according to the loss function L(w) of Eq. (5). After the local task is executed, the trained parameters are uploaded to the server. Finally, the user portrait is constructed from the global data.

Experiments

In this study, the Fed-UserPro method was experimentally verified on a real dataset. This section introduces the experimental environment, parameter settings and experimental results.

Experimental data acquisition

This study applied crawler technology to examine the Weibo data published by active users in many articles using 10 Sina Weibo fields (i.e., tourism, military, finance, sports, film, sports, childcare, food, fashion and digital) and the text data published by 30,000 microblog users. The text data published by users were preprocessed mainly using Chinese word segmentation to remove stop words, etc. Each user was labelled with an interest type according to the field (category) of the user's source.

The above data were divided into training and test sets. In each category, 10% of the data were selected and provided to the central server for LDA model pre-training. The data released by 21,000 users were divided horizontally and used as the training set for each participant in federated learning. According to the actual situation, the data of each participant presented a non-IID a priori probability offset that contained 10 categories of data for a total of 4,200 training samples, of which two categories contained 1,500 training samples each, and the remaining eight contained 150 training samples each.

The text data published by 770 users were selected from each category as the test set for federated learning. It was assumed that the central server was credible and that there was no data interaction among the central server and the participants, and no data interaction among the participants during the model training stage, so that the original data information would not be leaked. The number of topics in the LDA model was jointly determined by topic confusion and consistency indicators, and the topic–word distribution was obtained after model training was completed. The user's domain was the category of the user, and the number of topics in the LDA model was the number of user characteristics.

Experimental parameters

For multi-classification tasks, classification accuracy, recall rate and F1 value are generally used as model evaluation indicators. The calculation formulas for accuracy rate (P) and recall rate (R) are shown in Eq. (6), where TP is true positive, FP is false positive and FN is false negative. {P=TPTP+FPR=TPTP+FN \left\{ {\matrix{ {P = {{TP} \over {TP + FP}}} \cr {R = {{TP} \over {TP + FN}}} \cr } } \right.

Owing to the differences in the number of samples held by each participant during federated learning, the classification accuracy rate did not reflect the performance of the model in each category. This study measured the macro-averaging index to evaluate the classification performance of the model. The macro-averaging is the arithmetic average of the F1-score of each class. The calculation formula is shown in Eq. (7). Macroavg=2×1ni=1nPi×1ni=1nRi1ni=1nPi+1ni=1nRi Macro - avg = {{2 \times {1 \over n}\sum\nolimits_{i = 1}^n {P_i} \times {1 \over n}\sum\nolimits_{i = 1}^n {R_i}} \over {{1 \over n}\sum\nolimits_{i = 1}^n {P_i} + {1 \over n}\sum\nolimits_{i = 1}^n {R_i}}} where Pi and Ri, respectively, represent the precision and recall of the i-th category. To grasp the changes in the participant-updated parameter differences during the training process, the average variance of participant parameters and the calculation formula were also tested, as shown in Eq. (8). V=1Kj=1Ki=1N(vijv¯j)2N V = {1 \over K}\sum\nolimits_{j = 1}^K {{\sum\nolimits_{i = 1}^N {{({v_{ij}} - {{\bar v}_j})}^2}} \over N} where K is the total number of categories, N is the total number of participants, v¯j {\bar v_j} is the average value of the j-th category parameter and vij is the parameter value of the i-th participant in the j-th category.

Analysis of experimental results

The experiment compared and analysed the Fed-UserPro user profile construction method based on federated learning and the UserPro user profile construction method using single-party data. The algorithms for constructing user portraits iterated 1,000 rounds each, and the results are shown in Table 2. P1–P5 represent the results of five single parties executing the UsePro algorithm on their respective data. The data of these five single parties were all distributions of prior probability deviations, and the distributions were different, as described in Section 5.1. Fed-UserPro represents the user profile construction method based on the federated learning proposed in this paper. The experiment analysed the accuracy of the algorithm and the macro-average. The first (r-1), 15th (r-15), 25th (r-25), 35th (r-35), 50th (r-50), 100th (r-100) and 1,000th (r-1,000) round results are presented for comparative analysis. The experimental results show that the accuracy of Fed-UserPro in the first round of iteration reached 79.77%, which is significantly better than the accuracy of the first round of the single party. After 100 iterations, the accuracy of Fed-UserPro reached 98.44%, and the accuracy of the single party was 84.32% (best) and 71.12% (worst), showing the advantages of federated learning in terms of accuracy. Similar results can be seen with the macro-average parameters because the participants uploaded various representative samples to the central server to train the global LDA, which reduced the inaccurate subject distribution in single-point learning due to the small number of data samples, poor performance of sample diversity and dimensionality reduction of the test set to the subject feature space. Simultaneously, federated learning cooperated with multiple participants to train together, which increased the number of each type of training data in single-point learning and improved the accuracy and macro-average score of the model in the test set. Fed-UserPro improved the accuracy by 19.71% in the best case compared with the r-1,000 accuracy of P5, and by 8.69% in the worst case compared with the r-1,000 accuracy of P3.

Accuracy and macro-average results of federated learning and single-point learning

P1 Index r-1 r-15 r-25 r-35 r-50 r-100 r-1000
Precision 20.00% 22.45% 49.15% 64.39% 74.31% 77.58% 79.02%
Macro-average 7.35% 12.19% 49.04% 62.03% 72.01% 74.43% 73.65%

P2 Index r-1 r-15 r-25 r-35 r-50 r-100 r-1000
Precision 20.00% 20.00% 38.11% 47.50% 65.53% 79.75% 85.76%
Macro-average 6.66% 7.31% 37.00% 47.83% 66.10% 79.79% 84.46%

P3 Index r-1 r-15 r-25 r-35 r-50 r-100 r-1000
Precision 19.98% 20.00% 35.85% 50.58% 73.49% 84.32% 89.88%
Macro-average 7.46% 6.71% 33.69% 50.52% 72.17% 83.98% 88.95%

P4 Index r-1 r-15 r-25 r-35 r-50 r-100 r-1000
Precision 20.00% 20.64% 41.40% 45.63% 57.89% 75.62% 85.88%
Macro-average 10.56% 11.92% 45.18% 48.80% 60.73% 77.49% 85.92%

P5 Index r-1 r-15 r-25 r-35 r-50 r-100 r-1000
Precision 20.00% 20.19% 35.15% 43.57% 57.37% 71.12% 80.32%
Macro-average 7.16% 7.54% 32.51% 43.45% 57.59% 69.91% 77.93%
Fed-UserPro Index r-1 r-15 r-25 r-35 r-50 r-100 r-1000
Precision 79.77% 94.85% 97.02% 97.87% 98.20% 98.44% 98.28%
Macro-average 73.05% 94.69% 96.97% 97.84% 98.19% 98.43% 98.28%

This paper also experimentally verified the performance changes of federated learning when the parameters were changed (i.e., the number of local epochs of participants’ local training iterations, the number of model parameter update samples’ batch size and the number of participants per round participating in parameter updates). Performance testing mainly included algorithm availability (i.e., precision and macro-average) and efficiency (i.e., runtime). The experimental results are shown in Figures 3–5, and the abscissas all indicate the number of iterations.

Fig. 3

Experiment of changing the local epoch size of participants. (a) Accuracy (b) Macro-average

Fig. 4

Experiment of changing the batch size of the participants. (a) Accuracy (b) Macro-average

Fig. 5

Experiments to change the number of participants. (a) Accuracy (b) Macro-average

Figure 3 shows the accuracy and macro-average score of the Fed-UserPro algorithm when the local epoch of the participants’ local training iterations was changed. At this time, the number of participants participating in the model update in each round of federated learning was fixed at five, and the experiment was iterated for 1,000 rounds. The figure shows the results of the first 200 rounds. When the local epoch was set to five, the model converged the fastest, and the accuracy reached 97.72% after 20 iterations. When the local epoch was set to one, the accuracy was only 85.33% after 20 iterations. This is because, in each round of global communication, increasing the number of calculations of the local model can reduce the global communication cost, and the model can achieve higher accuracy in fewer communication rounds. Similar results can be seen in the average macro score. When the local epoch was five, the average macro score increased the fastest.

Figure 4 shows the changes in accuracy and macro-average resultant to changing the batch size of model parameters. When the batch size was 120, the convergence speed of the model was the fastest. When the number of iterations was 20, the accuracy of the model reached 96.27%. However, when the batch size was 600, the accuracy of iteration 20 was only 80.24%. When the batch size was 300, the accuracy was between batch size 120 and batch size 600. Thus, reducing the number of samples for each model parameter update can speed up the convergence of the model. Similar results can be seen with the average macro score, but the convergence was slightly slower. When iterating 20 times and when the batch size was 120, the average macro score was 96.19%, and when the batch size was 600, the average macro score was only 73.78%.

Figure 5 shows the changes in accuracy and macro-average resultant to changing the number of participants in each update round. It can be seen from the results that when all five participants participated in each round of global model update training, the accuracy of the model on the test dataset and the average macro score steadily improved, and the model converged faster. When the number was three or four, the accuracy and average score of the macro fluctuate greatly, and the smaller the value, the greater the fluctuation. It also required more communication rounds to make the model converge stably because some participants did not participate in the parameter update in a certain round. When they participated in the parameter update again, the accuracy of the model was greatly improved.

To grasp the changes in the differences between the participants’ update parameters during the training process, an experimental test was conducted on the average variance between the various parameters of the participants during federated learning, as shown in Figure 6. The abscissa represents the number of iterations, and the ordinate represents the average squared difference between the participant parameters. Owing to the differences in the amount of various local training data among participants at the beginning of training, each participant focused more on training large local sample categories. Therefore, with the increase in training times, the average variance of parameters among participants also increased. With the advancement of federated learning, the central server aggregated and updated the parameters of all categories in each round, and the average variance of parameters among participants decreased, also improving the accuracy of participants in small sample categories according to the aggregated parameters. This further improved the classification performance of the local model as a whole.

Fig. 6

Average variance between various parameters of participants in federated learning

In this paper, the running times of Fed-UserPro and UserPro algorithms were tested, and the results are shown in Figure 7. Figure 7(a) indicates the time individually required for Fed-UserPro and UserPro algorithms to run 200 rounds when the model parameters were updated to the batch size and the number of local training iterations was fixed to three. The running time of the algorithm was the longest when the batch size was 120. This is because the number of samples involved in each update was small. Training using local data requires more iterations, and more calculation time is required.

Fig. 7

Running time experiment. (a) Change batch size (b) Change local epoch

Figure 7(b) shows that when the number of local training iterations of the participants changed, the Fed-UserPro and UserPro algorithms ran for 200 rounds, and the number of fixed model parameter update samples was 120. Thus, the smaller the number of local participant training iterations, the shorter the algorithm's running time. According to the results in Figure 7, when the number of participants increased, the running time of the Fed-UserPro algorithm increased linearly, showing good scalability.

Summary

User profiling is widely used in e-commerce, social networking, internet financing, product development and other fields. It provides an important basis for accurate advertising, personalised recommendations and risk control. Building a user profile requires a vast amount of data to provide an accurate user portrayal. However, data island problems have become the biggest obstacle to building user profiles in a centralised fashion. The emergence of federated learning allows multiple parties to jointly train machine user portrait models without sharing local data. This paper proposed the Fed-UserPro federated learning user portrait method based on a multi-classification model, and experimental verification was carried out on a real dataset. Experimental results showed that this method can significantly improve the accuracy of a single-party training model based on local data. This not only ensures that the data of participants are not shared but also improves model accuracy, which helps build powerful user group profiles.

Notably, Fed-UserPro has room for improvement, which will motivate future research activities. First, the data privacy of participants needs to be improved, and intermediate parameters need to be handled using encryption or differential privacy. Then, the data need to be combined and sent to the server. The user portrait algorithm also needs improvement, such as by applying an unsupervised clustering algorithm without determining the number of user categories in advance. To summarise, user profile technology under federated learning is still a new research field, and there are many problems worthy of in-depth study.

Fig. 1

Federated learning architecture
Federated learning architecture

Fig. 2

LDA graphical model representation. LDA, latent Dirichlet allocation
LDA graphical model representation. LDA, latent Dirichlet allocation

Fig. 3

Experiment of changing the local epoch size of participants. (a) Accuracy (b) Macro-average
Experiment of changing the local epoch size of participants. (a) Accuracy (b) Macro-average

Fig. 4

Experiment of changing the batch size of the participants. (a) Accuracy (b) Macro-average
Experiment of changing the batch size of the participants. (a) Accuracy (b) Macro-average

Fig. 5

Experiments to change the number of participants. (a) Accuracy (b) Macro-average
Experiments to change the number of participants. (a) Accuracy (b) Macro-average

Fig. 6

Average variance between various parameters of participants in federated learning
Average variance between various parameters of participants in federated learning

Fig. 7

Running time experiment. (a) Change batch size (b) Change local epoch
Running time experiment. (a) Change batch size (b) Change local epoch

Fed-UserPro Server-side training algorithm

Input: Local model parameters of each participant wt1,wt2,,wtK w_t^1,w_t^2, \ldots ,w_t^K
Output: Global user profile model
  1: Initialise model parameters w1;
  2: for t =1, 2, …, T do
  3:   m ← max(γ*N, 1), γ ∈ [0,1];//m is the number of participants
  4:   Pt ← Participants in this round are placed in the collection;
  5:   for PPt do
  6:      wt+1CClientUpdata(wt) w_{t + 1}^C \leftarrow ClientUpdata({w_t}) ;
  7:      wt+1C|Di||D|wt+1C {w_{t + 1}} \leftarrow \sum\limits_C {{|{D_i}|} \over {|D|}}w_{t + 1}^C ;
  8:   end for
  9: end for

Client Pk training algorithm ClientUpdate(w)

Input: Local training set Dk = {Ui,< l1, p1,>,< l2, p2 >,…,< ln, pn >}
Output: Local training model parameters wtk w_t^k
  1: The client Pk downloads the parameters wt used for this round of model training from the server;
  2: Divide Di into small batch training sets B of size batch_size;
  3: wtkwt w_t^k \leftarrow {w_t} ;
  4: for e = 1,2, …, E do
  5:   for each batch bB do
  6:      wtkwtkηL(wtk,b) w_t^k \leftarrow w_t^k - \eta \nabla L(w_t^k,b) ;
  7:     upload wtk w_t^k to the server;
  8:   end for
  9: end for

Symbols used in this article and their meanings

Symbols Meanings

Pi The i-th participant in federated learning
Di Data set held by the i-th participant
S Pre-training data set held by the server
N Total number of participants
li Topic feature tags
pi Probability of having topic feature label li
Ui User ID in user data
Ci Category i
K Total categories
w Softmax model parameters
M The total number of documents in the pre-training dataset
Ūi Unstructured data describing user Ui in S

Accuracy and macro-average results of federated learning and single-point learning

P1 Index r-1 r-15 r-25 r-35 r-50 r-100 r-1000
Precision 20.00% 22.45% 49.15% 64.39% 74.31% 77.58% 79.02%
Macro-average 7.35% 12.19% 49.04% 62.03% 72.01% 74.43% 73.65%

P2 Index r-1 r-15 r-25 r-35 r-50 r-100 r-1000
Precision 20.00% 20.00% 38.11% 47.50% 65.53% 79.75% 85.76%
Macro-average 6.66% 7.31% 37.00% 47.83% 66.10% 79.79% 84.46%

P3 Index r-1 r-15 r-25 r-35 r-50 r-100 r-1000
Precision 19.98% 20.00% 35.85% 50.58% 73.49% 84.32% 89.88%
Macro-average 7.46% 6.71% 33.69% 50.52% 72.17% 83.98% 88.95%

P4 Index r-1 r-15 r-25 r-35 r-50 r-100 r-1000
Precision 20.00% 20.64% 41.40% 45.63% 57.89% 75.62% 85.88%
Macro-average 10.56% 11.92% 45.18% 48.80% 60.73% 77.49% 85.92%

P5 Index r-1 r-15 r-25 r-35 r-50 r-100 r-1000
Precision 20.00% 20.19% 35.15% 43.57% 57.37% 71.12% 80.32%
Macro-average 7.16% 7.54% 32.51% 43.45% 57.59% 69.91% 77.93%
Fed-UserPro Index r-1 r-15 r-25 r-35 r-50 r-100 r-1000
Precision 79.77% 94.85% 97.02% 97.87% 98.20% 98.44% 98.28%
Macro-average 73.05% 94.69% 96.97% 97.84% 98.19% 98.43% 98.28%

Kairouz P, McMahan H B, Avent B, et al. (2020), Advances and Open Problems in Federated Learning. Foundations and Trends in Machine Learning, 14(1–2):1–210 KairouzP McMahanH B AventB 2020 Advances and Open Problems in Federated Learning Foundations and Trends in Machine Learning 14 1–2 1 210 10.1561/9781680837896 Search in Google Scholar

Li Q, Wen Z, Wu Z, et al. (2019), A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection. CoRR abs/1907.09693 LiQ WenZ WuZ 2019 A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection CoRR abs/1907.09693 Search in Google Scholar

Aledhari M, Razzak R, Parizi R M, et al. (2020), Federated learning: A survey on Enabling Technologies, Protocols, and Applications. IEEE Access, 8: 140699–140725 AledhariM RazzakR PariziR M 2020 Federated learning: A survey on Enabling Technologies, Protocols, and Applications IEEE Access 8 140699 140725 10.1109/ACCESS.2020.3013541 Search in Google Scholar

McMahan B, Moore E, Ramage D, et al. (2017), Communication-efficient Learning of Deep Networks from Decentralized Data. AISTATS: 1273–1282 McMahanB MooreE RamageD 2017 Communication-efficient Learning of Deep Networks from Decentralized Data AISTATS 1273 1282 Search in Google Scholar

Yang Q, Liu Y, Chen T, et al. (2019), Federated Machine Learning: Concept and Applications. ACM Transactions on Intelligent Systems and Technology, 10(2):1–19 YangQ LiuY ChenT 2019 Federated Machine Learning: Concept and Applications ACM Transactions on Intelligent Systems and Technology 10 2 1 19 10.1145/3298981 Search in Google Scholar

Li T, Sahu A K, Talwalkar A, et al. (2020), Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3): 50–60. LiT SahuA K TalwalkarA 2020 Federated learning: Challenges, methods, and future directions IEEE Signal Processing Magazine 37 3 50 60 10.1109/MSP.2020.2975749 Search in Google Scholar

Li T, Sahu A K, Zaheer M, et al. (2020), Federated optimization in heterogeneous networks. Proceedings of Machine Learning and Systems, 2: 429–450 LiT SahuA K ZaheerM 2020 Federated optimization in heterogeneous networks Proceedings of Machine Learning and Systems 2 429 450 Search in Google Scholar

Cooper A, Reimann M R. (2005), Software concept revolution: the essence of interaction design. Beijing: Electronic Industry Press CooperA ReimannM R 2005 Software concept revolution: the essence of interaction design Beijing Electronic Industry Press Search in Google Scholar

Lin Y, Xie X. (2018), User Portrait of Diversified Groups in Micro-blog Based on Social Identity Theory. Information Studies: Theory & Application. 041(003):142–148 LinY XieX 2018 User Portrait of Diversified Groups in Micro-blog Based on Social Identity Theory Information Studies: Theory & Application 041 003 142 148 Search in Google Scholar

Konečný J, McMahan H B, Yu F X, et al. (2016), Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492 KonečnýJ McMahanH B YuF X 2016 Federated learning: Strategies for improving communication efficiency arXiv preprint arXiv:1610.05492 Search in Google Scholar

Liu Y, Kang Y, Xing C, et al. (2018), Secure Federated Transfer Learning. arXiv preprint arXiv:1812.03337 LiuY KangY XingC 2018 Secure Federated Transfer Learning arXiv preprint arXiv:1812.03337 10.29007/h5vs Search in Google Scholar

Zeng Z, Sun S. (2020), Research on Personalized Mobile Visual Search of Smart Library Based on User Portrait. Library & Information, (4):8 ZengZ SunS 2020 Research on Personalized Mobile Visual Search of Smart Library Based on User Portrait Library & Information 4 8 Search in Google Scholar

He Z, Zhu Q, Bai M. (2021), The Construction of Urban Elderly User Portrait from the Perspective of Pension Service. Journal of Intelligence, 40(09):154–160 HeZ ZhuQ BaiM 2021 The Construction of Urban Elderly User Portrait from the Perspective of Pension Service Journal of Intelligence 40 09 154 160 Search in Google Scholar

Ren Z, Zhang P, Lan Y, et al. (2019), Emotional Tendency Prediction of Emergencies Based on the Portraits of Weibo Users Taking “8 12” Accident in Tianjin as an Example. Journal of Intelligence, 38(11):130–137 RenZ ZhangP LanY 2019 Emotional Tendency Prediction of Emergencies Based on the Portraits of Weibo Users Taking “8 12” Accident in Tianjin as an Example Journal of Intelligence 38 11 130 137 Search in Google Scholar

Blei D M, Ng A Y, Jordan M I. (2003), Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993–1022 BleiD M NgA Y JordanM I 2003 Latent Dirichlet allocation Journal of Machine Learning Research 3 993 1022 Search in Google Scholar

Zhao Y, Li M, Lai L, et al. (2018), Federated learning with non-IID data. arXiv preprint arXiv:1806.00582 ZhaoY LiM LaiL 2018 Federated learning with non-IID data arXiv preprint arXiv:1806.00582 10.1016/j.neucom.2021.07.098 Search in Google Scholar

Bonawitz K, Eichner H, Grieskamp W, et al. (2019), Towards Federated Learning at Scale: System Design. CoRR abs/1902.01046 BonawitzK EichnerH GrieskampW 2019 Towards Federated Learning at Scale: System Design CoRR abs/1902.01046 Search in Google Scholar

Sattler F, Müller K R, Samek W. (2021), Clustered Federated Learning: Model-Agnostic Distributed Multi-task Optimization under Privacy Constraints. IEEE Transactions on Neural Networks and Learning Systems 32(8):3710–3722 SattlerF MüllerK R SamekW 2021 Clustered Federated Learning: Model-Agnostic Distributed Multi-task Optimization under Privacy Constraints IEEE Transactions on Neural Networks and Learning Systems 32 8 3710 3722 10.1109/TNNLS.2020.301595832833654 Search in Google Scholar

Liu L, Zheng F. (2021), A Bayesian Federated Learning Framework with Multivariate Gaussian Product. CoRR abs/2102.01936 LiuL ZhengF 2021 A Bayesian Federated Learning Framework with Multivariate Gaussian Product CoRR abs/2102.01936 Search in Google Scholar

Wang J, Kong L, Huang Z, et al. (2020), Research review of federated learning algorithms. Big Data Research, 6(6):64–82 WangJ KongL HuangZ 2020 Research review of federated learning algorithms Big Data Research 6 6 64 82 Search in Google Scholar

Hahn S J, Lee J. (2019), Privacy-preserving Federated Bayesian Learning of a Generative Model for Imbalanced Clinical Data. CoRR abs/1910.08489 HahnS J LeeJ 2019 Privacy-preserving Federated Bayesian Learning of a Generative Model for Imbalanced Clinical Data CoRR abs/1910.08489 Search in Google Scholar

Nilsson A, Smith S, Ulm G, et al. (2018), A Performance Evaluation of Federated Learning Algorithms. DIDL at Middleware: 1–8 NilssonA SmithS UlmG 2018 A Performance Evaluation of Federated Learning Algorithms DIDL at Middleware 1 8 10.1145/3286490.3286559 Search in Google Scholar

Sahu A K, Li T, Sanjabi M, et al. (2018), On the Convergence of Federated Optimization in Heterogeneous Networks. CoRR abs/1812.06127 SahuA K LiT SanjabiM 2018 On the Convergence of Federated Optimization in Heterogeneous Networks CoRR abs/1812.06127 Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo