Email has become a major way of communication for people at present, but the problem of spam comes behind. The harm of spam is mainly manifested as the following aspects: occupying bandwidth, leading to the congestion of the email server and reducing the efficiency of the network; consuming the time of the user and affecting the work efficiency. Therefore, the effective distinction between normal email and spam, so as to maximize the possible of filtering spam has become a research hotspot currently.
Naive bayes algorithm is a kind of frequently-used email classification and it is a statistical-based classification algorithm[1-3], which has the characteristics of simple realization and fast classification. However, it assumes that the attributes are independent of each other when given the target value[4]. This hypothesis is apparently impossible in the email classification, so the accuracy of email classification based on naive bayes algorithm is low. In allusion to the problem of poor accuracy of email classification based on naive bayes algorithm, scholars have proposed some new email classification algorithms. The email classification algorithm based on deep neural network (DNN) is one kind of them.
The basic concept of artificial neural network is based on the hypothesis and model construction of how the human brain responds to complex problems[4-6]. The deep neural network is an artificial neural network with full connection between layer and layer, and its structure is shown in figure 1. The full connection between layer and layer means that any neuron in the ith layer must be connected to any of the neurons in the (i + 1)th layer. Although the deep neural network looks complex, it is still the same as the perceptron from a small local model.
Structure diagram of deep neural network
We use
We assume that
Here σ respresents the non-linear activation function of the nodes on the hidden layers, and the traditional DNN uses sigmoid function usually, as shown in expression (3). Because the sigmoid function has properties such as monotone increasing and its inverse function has the property of monotone increasing, it is often used as a threshold function of neural networks, It maps the variables between 0 and 1. The sigmoid function curve is shown in figure 2:
The sigmoid function curve
Implementation process of mail classification algorithm based on deep neural network was shown in Figure 3.
Algorithm execution process
In order to verify the effect of the email classification algorithm based on DNN, in this paper we constructeda DNN with 2 hidden layers. The number of nodes in each hidden layer was 30. When the training set was trained, we set up 2000 batches, and each batch has 3 trained data. We used the famous SpamBase dataset as the data set, which was from the UCI machine learning library at the University of California, USA. The specific situation is shown in table I.
SPAMBASE DATA SET
We compared the two kinds of email filtering algorithms of DNN and naive Bayes with accuracy, which is the main evaluation standard of email filtering technology. The accuracy is defined as follows:
We did five groups of experiments in this paper.The selection case of training set and testing set in each experiment is shown in table II.
THE SELECTION CASE OF TRAINING SET AND TESTING SET
The experimental results were shown in Figure 4.
The comparison of accuracy of the two algorithms
The experiment result showed that DNN was higher than naive Bayes in the accuracy of email classification when the proportion of the training set was 10%, 20%, 30%, 40% and 50% respectively, and DNN showed a good classification effect.
The application of email classification algorithm based on deep neural network is studied in this paper. The algorithm constructed multiple hidden layers and generated DNN classifiers through training. The experiment results showed that the accuracy of the algorithm is obviously higher than the naive Bayes algorithm.
With the development of science and technology, spam manifests in many forms and the damage of it is more serious, this puts forward higher requirements for the accuracy of spam recognition. The focus of next research will becombining various algorithms to further improve the effect of email classification.