With the rapid development of computer technology and network technology, the number of network information services and applications is growing rapidly. China Internet Network Information Center reported statistics, as of June 2016, the size of China’s Internet users reached 710 million, a total of 21.32 million new netizens in half a year [1]. With the increase in the number of people on the Internet, Internet information has also seen explosive growth. How to find interesting and effective information in this vast data is a very difficult thing. In order to solve this problem, academia and industry put forward personalized recommendation system[2]. According to the user’s personal information and historical habits, it can discover the potential interest of the user and recommend the resources of interest to the user actively.
Personalized recommendation system is a special form of information filtering system[3]. The recommendation system can be divided into the following categories: collaborative filtering recommendation system, content-based recommendation system and hybrid recommendation system. Because of its wide applicability, strong interpretability and good stability, the collaborative filtering recommendation system based on neighborhood model is widely used in various fields. Therefore, this paper focuses on the collaborative filtering recommendation system based on neighborhood model.
The accuracy of recommendation results in the recommendation system is the main index to measure the recommendation effect. Sarwar et al.[4] proposed an item-based collaborative filtering recommendation algorithm that looked into cosine-based similarity to compute the similarity between products. This method provides dramatically better performance than traditional recommendation algorithm, while at the same time providing better accuracy. Chen and Cheng[5] use the rating data to compute the similarity between users, and use the ranking data as the weight of similarity calculation. Yang and Gu[6] propose to use user behavior information to construct the user’s interest points and use the interest points to compute the similarity between users. Experiments show that these methods are better than the classic collaborative filtering algorithm. However, these methods only consider the user-item behavioral data, and neglect the user portrait and item features, which causes deviations in the accuracy of the recommendation result. This paper improves similarity calculation method in collaborative filtering recommendation algorithm based on neighborhood model, and uses the user portrait, item characteristics and user-item behavior data to compute similarity. We experimentally evaluate our results and compare them to the classic collaborative filtering algorithm. Experiments suggest that the improved similarity calculation method can improve the accuracy of the recommended results.
Collaborative filtering recommendation algorithm is to select the same custom hobby user groups, use other people’s experience to meet their own needs, in order to achieve the purpose of reducing overhead. Typically, the workflow of a collaborative filtering system is:
Compute the similarity between users.
Determine the neighbor set. Find the k users whose user interest is the most similar through the similarity size, and set these users as the user sets.
According to the user sets prediction rating. The system recommends items that the users have rated highly but not yet being rated by this user.
The most important thing in collaborative filtering algorithm is similarity calculation. For the calculation of similarity, the researchers put forward a variety of similarity calculation methods.
Cosine-based similarity: For user
Pearson correlation coefficient[7-8]: In this case, similarity is computing based on the vector of the rating. Among them,
The similarity between users can be calculated by the above formula, and the similarity ranking of each user with other users can be obtained to obtain the nearest neighbor user set. After getting the user set, the next step is interest prediction computation. We can denote the prediction
This paper considers the user rating data from the overall situation, introduces the characteristics of personal habits, item quality and category to improve the similarity computation formula. Thus the approximated correlation coefficient is given by:
Here,
In the above (6) formula,
Because the proposed model of this paper is different from the traditional one, the data matrix cannot be directly applied to the training of the model. So the matrix of the training data is restructured, adding personal habits, item quality and category bias. For example, when a movie i was released, it was called a masterpiece of elements such as comics, entertainment, suspense, etc. These classified data were useful for the model but could not be used. Through the transformation of the data format, useful information is used, and the information is vectorized according to the classification categories. Each row of data through the transformation training matrix can be expressed as: {(
The training data matrix is shown in Table I.
TRAINING DATA MATRIX
The complete algorithm steps are described as follows:
In order to verify the actual recommendation effect of the proposed algorithm (Improved Pearson Similarity Collaborative Filtering, IP-CF) in this paper, the MoiveLens film data set was used for verification. This data set consists of: ①100,000 ratings (1-5) from 943 users on 1682 movies. ②User data and item data have simple feature portraits. ③ Users with less complete personal portraits and fewer comments in the data have been cleaned. The dataset information is shown in Table II.
MOIVELENS DATASET INFORMATION
Select the root mean square error (RMSE) and mean absolute error (MAE) to evaluate the accuracy of the recommendation algorithm on the rating data[12-13]. The smaller the value, the higher the accuracy of the prediction. For a user
The traditional cosine similar algorithm (Cosine-CF) and Pearson similar algorithm (Pearson-CF) were compared with the proposed algorithm (IP-CF). We tested them on our data sets by computing RMSE and MAE. The size k of similar user set is from 5 to 180. Figure 1 shows the experimental results.
Comparison of precision between IPCF algorithm and traditional algorithm
It can be observed from the results that the RMSE and MAE values of the improved similarity algorithm proposed in this paper decrease with the increase of the neighborhood. When the number of near-neighbor sets reaches a certain amount, it tends to a fixed value. The traditional collaborative filtering algorithm (Pearson similarity and cosine similarity) needs to find the optimal result, if the number is too large, it will affect the accuracy of the recommendation result. Overall, the RMSE and MAE of the rating prediction are 0.82% and 1.16% lower than the traditional algorithm respectively. The IP-CF algorithm has better accuracy.
Comparison between the baseline predictors model and the traditional model
The following experiments verify the effectiveness of the baseline predictors model. The experiment compares the user mean model (BU), basic baseline predictors model (BP), and improved baseline predictors model (UBP). Experimental results show that the improved baseline predictors model significantly improves the accuracy of the baseline prediction.
In this paper, a collaborative filtering recommendation algorithm based on improved similarity computation is proposed, which takes into account user portrait, item characteristics and user behavior data in recommendation process. Experiments have shown that user portraits and item features played an important role in improving the accuracy of recommendations, and which are an important basis for analyzing potential needs. Secondly, we found that in the Top-N recommendation, the number of neighbors and the evaluation index are not a positive or negative relationship, and the size of the neighbor will affect the accuracy of the recommendation. Our further work will research the relationship between the number of neighbors and the effectiveness of recommendations, especially how to choose the best neighbor value to improve the accuracy of recommendations.