With the rapid development of the Internet and its related fields, a large amount of data is continuously imported into the network, of which text data accounts for the vast majority. In the environment of information explosion, obtaining news information through the network has become an important way for people to understand the world and the era of science and technology. How to effectively analyze and use these data has become an urgent problem to be solved. Text classification is the work of mapping a text to one or more categories by analyzing the inherent characteristic information of the text. It is the basis of intelligent text reading. Text classification is an indispensable part of analyzing and processing a large number of text data, which is very challenging. It can serve a variety of downstream tasks, such as Machine Translation, Text Summarization, Recommendation Systems, etc., and is the focus of academic and industrial circles. Therefore, to a certain extent, accurate text classification can not only facilitate its downstream tasks, but also shorten the time for users to search text information and avoid information overload.
In the field of natural language processing (NLP), scholars have been trying to take the characters, words and sentences contained in documents as the starting point, and use their own features and associated features to integrate learning documents to achieve the final classification task. Text classification tasks can be divided into two categories: manual classification and automatic classification [1]. Manual classification methods often use professionals to interpret the content, compile a set of language rules after translation, and then classify according to this rule. This method only relies on manual processing, and there are great challenges in the face of a large number of data sets. It usually costs a lot of time and expensive cost, and the classification accuracy is easily affected by human factors. Automatic classification methods can be divided into machine learning based and depth learning based methods. Although the method based machine learning relies on a variety of calculation methods such as Bayesian, Laplace, Laplace's derivation of least square method and Markov chain to process information and can bring significant performance improvement, it is still limited by the characteristics of manual design. With the continuous development of deep learning technology, neural network provides a new way to solve complex problems in various fields, such as intelligently extracting text content and constructing text feature learning documents by using Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) or Graph Neural Network (GNN).
In recent years, further research on GNN and graph embedding has also attracted extensive attention of scholars [2–3]. Generally, in the task of text classification, the feature engineering of text information often uses sequence data for expression, and how to effectively express text information with graph data structure, that is, the representation method of text is an important basis for the further development of graph neural network and graph embedding in the direction of NLP. This paper constructs a large heterogeneous graph, which can model the self-information and auxiliary information of the text, and connect the explicit information and potential semantic information of the text through co-occurrence. At the same time, a text graph convolution classification algorithm is proposed. This method can effectively capture the three kinds of information of documents, topics and words in heterogeneous information networks, as well as the relationship between them, enhance the semantic information contained in document nodes, and effectively alleviate the semantic deficiency caused by the graphic representation of text.
The text classification problem is divided into two parts: feature engineering and classifier. Feature engineering transforms text into text representation data. The traditional method is mainly manual processing, and the deep learning method takes word embedding as the core. The classifier uses Support Vector Machine (SVM), decision tree or neural network to classify the text representation data with features. The text classification method based on deep learning takes the pre-labeled samples as training data, constructs a learning model, learns and infers the internal relationship between the text and its labels, and classifies and predicts the new text set according to the learned correlation [1]. It can complete the text representation in large-scale text classification tasks. Some recent studies have shown that the effectiveness of word embedding can largely affect the accuracy of deep learning classification methods [4].
Traditional feature engineering uses word2vec and Bag-of-Word methods to realize text representation. However, the words in the text are not simply listed, but have certain rules. Kim [5] proposed TextCNN (Text Convolutional Neural Networks) model, which maps the text into a vector, then uses multiple filters to capture the local semantic information in the text as features, and finally obtains the probability distribution of classification labels through the full connection layer. Lai [6] et al. proposed Recursive Convolutional Neural Networks (RCNN), considering the context representation of words to obtain the probability representation of labels. This kind of text representation gives priority to the order or local information of the text after text data preprocessing. Although they can well capture the semantic and grammatical information in continuous sequences, they do not analyze the internal structure of the text.
In the field of natural language, text has not only the sequential expression of context, but also the internal graph structure. For example, syntactic and semantic parsing trees define the syntactic and semantic relationships between words in sentences. TextRank [7], the earliest one based on graph model, proposed to represent natural language text as a graph
To sum up, in order to improve the classification effect, this paper proposes a topic fusion graph convolution neural network text classification algorithm. In this method, all words in the corpus are considered. By inserting topic information, LDA Topic Heterogeneous Graph (LDA-THG) is generated to further improve the accuracy of text classification.
In this paper, the text classification model based on LDA-THG is proposed. After preprocessing the text documents, the LDA Topic model (Latent Dirichlet Allocation Topic Model) is used to conduct topic training on the global corpus. Then, the probability distribution obtained by training is combined with the self-information of each text to conduct text modeling. This method constructs semantic association information among words, topics and documents, and uses graph convolutional network to process the embedded information of document nodes and their neighborhood nodes, so as to alleviate the problem of insufficient semantic information when using graph as text representation in text classification. The framework of text classification algorithm model based on LDA-THG graph convolution is shown in Figure 1.
Framework of Graph Convolution Text Categorization Algorithm Based on LDA-THG
Graph Convolutional Network [11] is a multi-layer neural network that transforms the traditional convolutional neural network that processes regular spatial structure data into a multi-layer neural network that processes graph data. Generally, when processing graph G, it is assumed that each node in the graph can form a connection with itself, that is ∀
For one-layer GCN, the new k-dimensional node characteristic matrix
Where,
Where l represents the number of layers of the graph,
In graph-based convolution algorithm, LDA-THG constructs a large heterogeneous graph. It is a fusion of topics and contains three types of nodes-words, documents and topics. The structure of LDA-THG diagram is shown in Figure 2. This method can explicitly model and realize topic and global word co-occurrence, making it easier to use GCN for joint learning reasoning.
The structure of LDA-THG
In Figure 2, Doc_n represents document node, Topic_m represents the topic node extracted from all documents in the corpus through LDA model, word_k is the node of the only word in the corpus. The curves in the figure are the edges of figure G, which connect document nodes and word nodes, word nodes and topic nodes, word nodes and word nodes, document nodes and topic nodes, respectively. The number of nodes is |
In this paper, one-hot encoding is used to initially represent the features, that is, the feature matrix of the graph is simply set as a unit matrix, and each node in the graph is represented as a one-hot vector as the input of GCN. For the co-occurrence information of global words, a fixed size sliding window is used to collect the co-occurrence statistical information of all documents in the corpus. For topic information, the whole corpus is processed by LDA topic model to obtain topic distribution
The process diagram of topic and word distribution obtained by LDA model
According to the word frequency in the document, the coexistence of words in the whole corpus, and the topic distribution and word distribution obtained by LDA processing the whole corpus, the edges between nodes are constructed. The weight of the edges between two nodes i and j are defined as:
In formula (3), PMI[8] ratio is point level mutual information, which represents the association measure of a popular word and can be used to calculate the association weight between word node i and word node j. And the value of PMI is always greater than zero.
After the heterogeneous graph is successfully constructed, the text graph is input into the GCN layer. As the input text graph is a heterogeneous graph with three types of nodes and interactive information between nodes, a large feature matrix
Among them, ⊕ is the connecting symbol of the matrix to splice the three matrices.
In this paper, three-layer GCN is selected in the graph convolution network layer to process the input features and send them into a
With formula (1) and (2), the weight parameters of the first, second, and third layers are, respectively,
In formula (4),
In order to verify the effectiveness of the model, this paper selects three data sets for experimental verification, namely, two public benchmark English text data sets 20NewsGroup and AGNews, and a self-collected Chinese dataset Ch_News. Among them, the English data set 20NewsGroup collected about 18,846 newsgroup documents, which were divided into 20 newsgroups sets with different topics, including three kinds of data information-document content, document title and document attribution tag. And there were no duplicate documents in the corpus. The English data set AGNews was published by Cornell University in 2004. It contains nearly one million news articles, which are mainly divided into four categories-World, Sports, Business and Sci/Tec. 8000 news articles were selected as corpus set in this paper. And the data volume of the four categories of news in these 8000 news articles is equal. The Chinese dataset Ch_News is made by manually complied and tagged about 1,000 news texts from Ecns.cn and english.chinamil.com.cn. The data come from five categories, namely, culture and education, finance and economics, military equipment, biotechnology and sports.
For Chinese and English datasets, different processing is carried out according to their language characteristics. For two English datasets, there are spaces between words to segment, and the preprocessing of removing low word frequency and stop words is directly carried out. For the Chinese dataset, according to the characteristics that there is no space to separate two characters in Chinese and the news contains a large number of domain knowledge entities, an entity dictionary is built, which contains the knowledge nouns in each domain in the dataset, and then the preprocessing of removing stop words, Chinese word segmentation and low-frequency words is carried out in turn.
Table 1 summarizes the statistics for the three data sets. Each data set takes 70% of the data as the training set, 30% as the test set, and then randomly selects 10% of the data in the training set as the verification set.
Statistical table of data sets used in experiments
20NewsGroup | 18,846 | 13,192 | 5,654 | 42,739 | 20 |
AGNews | 8,000 | 5,600 | 2,400 | 32,653 | 4 |
CH_News | 1,080 | 756 | 324 | 13,324 | 5 |
In order to verify the effect of the model on the text classification task, experiments are carried out based on tensorflow and keras, and the programming language is python3.6. In this paper, SVM+LDA, TextCNN and Text-GCN are used to compare with the proposed method. In the experiments of the three benchmark models, the parameters set in the original paper or the default parameters are used to reproduce. In the experiments of the three benchmark models, the parameters set in the original paper or the default parameters are used to reproduce. The three algorithms are briefly introduced as follows.
Text classification is a basic and classical task with classical evaluation parameters. Considering that this model solves the problem of multiple text classification. Accuracy is selected as the evaluation index in this paper, and its calculation method is as follows:
Where TP represents the number of correctly identified positive samples, FP represents the number of incorrectly identified positive samples, FN represents the number of incorrectly identified negative samples, and TN represents the number of correctly identified wrong samples.
Table 2 shows the accuracy test of text classification task for three data sets using different methods. By comparison, the method proposed in this paper has achieved better classification results on three data sets, especially in the self-constructed Chinese data set, which proves the effectiveness of the model.
Comparison of experimental results
SVM+LDA | 0.6827 | 0.7241 | 0.6956 |
CNN | 0.7160 | 0.8059 | 0.6893 |
Text-GCN | 0.8634 | 0.9245 | 0.7402 |
LDA-THG+GCN | 0.8796 | 0.9310 | 0.7594 |
The main reasons why using LDA-THG and graph convolution method is superior to the other three methods are: 1) text graph can capture the self-information of document and subject related auxiliary information at the same time; 2) GCN can well capture the information between high-order neighborhood nodes, and can calculate the new characteristics of nodes as the weighted average of themselves and their third-order neighbors. And the document and topic nodes in LDA-THG integrate the potential information of each other and enrich the expression of text features.
The impact of different topic numbers K on the classification results in three data sets is shown in Figure 4. It can be seen that for the three datasets, the categories of the datasets themself are 20, 4 and 5, respectively. The range of their high classification accuracy is when K is [14,20], [4,12] and [4,12], that is, when K value is within the range corresponding to the category of the data set itself, the accuracy is relatively high. Therefore, when the number of topics is maintained in a specific range, the classification accuracy is high. When the number of topics K exceeds this range, it will decrease in varying degrees. The main reason for this phenomenon is that when there are too many or too few topics, the found topics will be very rough or too detailed, both of which will cause the document to lose some relevant topic information. At the same time, the grammatical and expressive features of the text of English and Chinese lead to a great difference in the classification accuracy of AGNews and Ch_News under similar circumstances.
Influence of topic number on classification results of LDA-THG+GCN model on three data sets
On Chinese data set Ch_News, the influence of the sliding window size W of this model on classification results is shown in Figure 5. As can be seen from the figure, with the increase of the sliding window, the accuracy rate will be improved first and then slowly decrease. Accuracy is highest when the window length is equal to 10. This shows that when the sliding window is too small, sufficient global word co-occurrence information can not be generated to represent the probability relationship of word co-occurrence, while when the sliding window is too large, edge relations will be added between the nodes that are not close, making the information to be trained redundant. This not only increases the difficulty of training and storage, but also makes the features in the text representation fail to highlight the main features contained in the text itself.
Influence of W on classification results of LDA-THG+GCN model on Ch_News data set
In this paper, a new method of news text classification based on graph convolution neural network is proposed by combining text topics and words in text. This method uses the topic distribution and the word distribution corresponding to the topic in the LDA model to construct the weights of nodes and edges in the graph model, and generates a heterogeneous network information graph. Then, the point and edge information of the graph is integrated through the multilayer graph convolution neural network and