With the rapid development and popularisation of internet technology, a huge number of information resources appear on the internet, as to which users find it difficult to achieve or attain the specific information they are in need of or searching for. Users are not successful in finding information resources who are at time crunch, and many of them cannot interpret if the resource found by them is correct, which wastes a lot of their time and information resources. At the same time, users have to invest much time and energy to search for information resources they need. Such a phenomenon with booming information resources but low utilisation rates of information resources is known as ‘information overload’. This problem can be traditionally solved by using search engines, but search engines do not take users’ personalised demands into consideration. Though different people search for information using different keywords, they only achieve the same results. Therefore, the problem of information overload is still not well solved. In order to solve this problem, this paper puts forward personalised service technology which aims to provide different information resources for different users. By counting on users’ own information, this technology can find differences between users and their own characteristics. Besides, it can provide different users with targeted, differentiated services according to the above features.
At present, many scholars have done a lot of researches on personalised services whose core point is the personalised recommendation system. Therefore, the recommendation system is an important part of the research work of personalised service technology. Recommendation algorithm plays the most important role in the recommendation system [1], and the advantages and disadvantages of the algorithm affect the recommendation performance of the system.
Some existing recommendation systems can be divided as follows: recommendation system [2] based on knowledge contents, collaborative filtering recommendation system and mixed recommendation system. Many manufacturing industry systems have fully appreciated the benefits of the recommendation system, but the recommendation system is also facing severe security problems. Due to the openness of the recommendation system itself and its sensitivity to user information, people begin to pay more and more attention to the trusted recommendation mechanism of the collaborative filtration recommendation system, making it gradually a hot research topic.
Similarity connection is used to find all data groups whose similar values are greater than the threshold, which are given by users from a given data set under the specified similarity function measure. Similarity connection has been widely used in many fields. For example, in geography, similarity connection can be used to detect the collision or proximity of geographical features, such as landmarks, houses and roads. In the application of medical imaging, similarity connection can be used to detect whether the direct distances of some cancer cells are less than a certain threshold. In the recommendation system, similarity connection can be used to compare the similarity of recommended items. However, if the similarity connection is used for unordered and unindexed data sets, it will cost a lot in computation.
Because the scale of data grew very fast in the recent years, similarity connection is a bottleneck which blocks its development to a larger scale. The enquiry done through similarity connection of mass data means that all groups of similar objects can be searched from mass data sources. This is a basic operation in similarity connection's dealing with mass data as it attracts wide attention.
Similarity connection query of data can be divided into the following three main categories according to different classification standards: (1) in terms of different definitions of “similarity”, it can be divided into threshold join query and Topk join query; (2) in terms of difference in the number of data sources, it can be divided into singlesource (selfjoin) query, doublesource query and multisource join query (singlesource join query means that all objects of similar twotuples found come from the same data source) and (3) in terms of different data types, it can be divided into set, character string, vector, graph and other join queries. Join queries of data sets aim to deal with data sets. Similarity connection queries of different categories, such as threshold connection query, can be interlocked. It can be a singlesource connection query or dualsource connection query.
There are still many problems on new characteristics of data development and application development that have not been studied yet, for example, (1) highdimensional data similarity connection technology and (2) massive online realtime similarity connection technology. Similarity connection query of highdimensional data faces great challenges due to the existence of dimension disasters. The traditional query algorithm with index structure as the basis does not work now. Similarity connection queries of highdimensional data will be basic operations of many data mining and machine learning tasks. For massive online realtime similarity connection technology, at present, the MapReduce framework featured by good extensibility, fault tolerance and usability is used to perform the similarity connection of mass data. However, as a batch processing model, MapReduce is not suitable for realtime data processing. The online realtime processing of similarity connection of mass data needs to be further studied. In order to solve these two problems, we should think about (1) whether we can use the piecewise cumulative approximation method to reduce the highdimensional data to a lowdimensional space, which would help us overcome the shortcoming of the rapid performance decline caused by the increase of dimensions and carry out effective filtering at a lower cost, and whether the parallel join query algorithm can be used to process largescale highdimensional data? (2) Whether data processed through dimension reduction filtration can be further divided into several subspaces according to certain rules and be prepared for streaming data processing in memory in the next step? (3) Can the reverse index list of character strings be regarded as states to iteratively process the incremental character string of streaming data? Can the new states be used to carry out similarity connection for the incremental data and to process online data in real time rapidly within limited memories?
y calculating the unilateral function of multiple sets and their conjunctive functions. In order to calculate the unilateral function, we only need to A lot of innovative achievements have been made in terms of similarity connection query technology. However, there is no unified definition of similarity connection query in domestic and foreign literature. For example, Shim
There are also some research results on the similarity connection of massive data. Ma
This paper carries out iterative division of original data sets so that the data sets obtained can be as small as possible. Then, an efficient memory similarity connection algorithm is used to process them so as to obtain partial results. Finally, partial results are summarised to obtain the final result. Albeanu [9] put forward that a twostep method based on MapReduce and filtering technology shall be used to complete the similarity connection query of singlesource multiple collections, collections, character strings or vectors. The basic idea is that the similarity calculation of multiple set pairs can be obtained by scan multiple sets, and in order to calculate conjunctive function, we need to scan the intersection. Therefore, we need to calculate the unilateral function of multiple sets and their conjunctive function before we calculate their similarity and obtain the final result. Silva and Reed [10] put forward the MapReduce algorithm, which uses Euclidean distance function to complete the query of singlesource vector Topk similarity connection.
The fundamental concept is as follows: copies shall be used to divide the original data sets into different groups. The Topk similar vector pairs of each group will be then counted. After that, the results of all groups shall be aggregated and ranked to obtain topk similar vector pairs of the original data sets.
There are many types of similarity connection. According to data types of objects connected, it can be divided into similarity connection of character strings, set similarity connection and graph similarity connection. In recommendation system experiments studied in this paper, recommendation and ranking are mainly implemented according to the character string matching method.
In order to simplify them, this paper ignores the consistency between contents and topics and adopts the character string matching size of the keywords entered by the user and the index keywords.
Definition of editing distance of character string
Standardised editing distance:
Editing similarity
Two data sets,
We often use the filterverification framework in practice. However, this framework has several shortcomings: (1) it cannot efficiently process short character strings (character strings with an average length of <30); (2) the algorithm has low processing efficiency if the data sets are dynamically updated and (3) many indexing operations need to be done [11].
Due to the retrieval difficulties caused by the continuous growth of similarity connection of mass data and the high retrieval consumption caused by the disordered and unindexed data sets, research studies on similarity connection have become more and more important. Although traditional methods can effectively carry out the similarity connection of character strings, the continuously updated state need to be preserved during the execution, which takes a lot of storage space in practical applications, especially in the age of big data.
Various enterprises in the manufacturing industry will spend more and more time and energy using this method as time goes by. In addition to the rapid growth of data size, data models become increasingly complex and dense in practical applications. What is worse is that the improvement of data dimensions also makes calculation more complex. In addition, existing studies mainly focus on the discbased similarity connection algorithm, which lacks effectiveness and scalability in memory connection calculation. Because of the urgent demand for processing realtime similarity connection of mass data, the similarity connection is still a bottleneck in many scientific applications.
In order to meet the growing application demands and to solve problems effectively, two problems still exist in the effective similarity connection in the highdimensional, massive and realtime data.
If the mdimensional mapping distance of Ddimensional vectors
A new dimension reduction method based on the piecewise cumulative approximation method in time series can be used to reduce the highdimensional data to the lowdimensional space dimension. Besides, the distance of the lowdimensional space is the lower bound of the original distance.
On the basis of the previous research, this paper finds an effective distance space to calculate the distance between two vectors in the mdimensional space. If Δm(
As shown in Figure 1, dimension reduction can be carried out first. m random vectors
With the rapid development of internet and mobile technology, there have been more and more online application systems. Big data brings challenges to existing application systems, and streaming data make it urgent to improve the batch processing methods and technologies. However, the existing character string similarity connection algorithms are all based on the space of limited memory, which requires that the data must be read into the memory at one time. In this era of big data, this method is not feasible.
This paper will adopt an incremental character string similarity connection method based on memory computation to process streaming data. It takes the reverse index list of character strings as the state to iteratively update states obtained in our processing historical data. Besides, new states are also used to carry out similarity connection for incremental data. The state consists of
This paper adopts a space efficient algorithm for the longest common subsequence (LCS) in
Given two sequences
A similar problem is the LCS at least k problem (LCS ≥
LCSk can be solved by using a dynamic programming algorithm. Let
LCSk








return 
By adopting the optimised incremental similarity connection method based on memory computation optimised above to process streaming data, we can obtain the longest common additive sequence of two given input sequences easily. On the basis of the recursive equation, this paper uses a very simple linear space algorithm to solve this problem and adopts a new state to carry out similarity connection of incremental data.
The experiment uses the fabric data sets collected in large textile factories, and such data can be used for fabric recommendation. There are 972 kinds of classification in the data set, 486 kinds of training set are divided semi randomly and the remaining 486 kinds are used as the test set. This paper compared the contentbased recommendation method BMR, the recommendation method BFR based on collaborative filtering, the improved similarity connection technology recommendation method TBRR [13] and the LCS similarity connection technology recommendation method LSCR proposed in this paper.
Besides, the fabric data sets also compared the three indexes of recall rate (RECALL), precision (PRECISION) and mean average error (MAE); these are the recommended systems to measure the accuracy of the forecast.
Eq. (6) is the recalculation method of RECALL.
Eq. (9) is the computation method of MAE
As shown in Figures 2–5, we compare the contentbased recommendation method BMR, the collaborative filteringbased recommendation method BFR, the timerelated composite filtering recommendation TBRR [13] and the recall (RECALL), precision (PRECISION), f1(F1MEASURE) and mean absolute error (MAE) of the LCS similarity connection technology recommendation method LSCR proposed by us in the fabric data sets with recommended values of
It can be seen that the LSCR proposed in this paper combines the advantages of the optimised incremental similarity connection method and the longest common additive sequence of two given input sequences, and the related technical indicators are superior to the traditional BMR, BFR and our previous algorithm TBRR.
In order to solve the timeliness problem of the recommendation system, this paper extracts time series of data according to the time distribution of data and adopts incremental processing method to greatly reduce the calculation amount on the premise that the precision of realtime recommendation system is ensured. For fabric data sets, the technology used in this paper is better than the traditional BMR, BFR and TBRR algorithms in recall, precision and MAE. The characteristic of this paper is to solve the dimensionality reduction method in the process of processing massive similarity connection technology and the method of using the longest common substring algorithm in memory computing. The time series are cut according to the piecewise accumulation in the time series, and the incremental processing method is adopted to greatly reduce the amount of calculation under the condition of ensuring the accuracy of the realtime recommendation system. The optimised incremental similarity connection method based on memory computing is used to process streaming data. The longest common increasing subsequence of two given input sequences is simply calculated. Based on the recursive equation, a simpler algorithm is found, and the new state is used to connect the incremental data. This method, which is applied in the manufacturing industry recommendation system, can respond to changing behaviours of users and adjust the ranking of recommendation results in real time, which can continuously improve users’ experience in the recommendation system.
