Rough sets theory is a mathematical tool to deal with vague and uncertain knowledge presented by Pawlak in 1982 [1]. It can mine the hidden knowledge from the mass data and has been successfully applied in fields such as data mining, pattern recognition, decision support system [2,3], etc. In recent years, it is also used in big data [4], information sharing [5], and other fields. Attribute reduction is one of the core contents of the Rough sets theory. It has been proved to be a NP-Hard problem to find all reductions or a minimal reduction in Rough Sets [6]. So, the common method is to adopt the heuristic algorithm. Normally there are two main classes, the algorithms that are based on the discernibility matrices and the algorithms that are based on positive region.
Because the attribute reduction algorithms based on discernibility matrix is succinct and easy to understand, it has attracted a great deal of attention from scholars [7]. But in this kind of algorithms the discernibility matrix need be constructed and there are a lot of repeat elements in the discernibility matrix. In disposing big datasets, this kind of algorithms not only cost a great deal of memory space, but also waste much computer time in reduction. Because the discernibility matrix need not be constructed, the attribute reduction algorithms based on positive region attract more and more attention from scholars [8]. But the time complexity of this kind of algorithms are still far from satisfactory in processing large data sets [9].
If unnecessary elements are removed from decision table and only essential elements are extracted for attribute reduction, high efficiency of reduction algorithms can be reached. Hart first puts forward the concept of the sample extraction. There are some typic algorithms, such as the CNN [10], the RNN [11], etc. Numerous sample extraction algorithms can be broadly divided into two kinds: classification competence enhancement approaches and classification competence preservation approaches [12]. At present, many scholars have done lots of correlative investigations to the sample extraction.
In the paper [9], condition attributes are clustered firstly. Then the cluster are randomly sampled. Finally, attribute reduction is conducted on the sample data set. The literature [13] proposes an attribute reduction algorithm that is based on sample extraction. There are three steps in this algorithm. First, important instances are selected from the set of objects. Second, discernibility matrix is constructed by the use of the selected instances. Finally, all reductions of decision table are computed. In [14], an attribute reduction algorithm based on granular computing and discernibility ability is presented. This algorithm uses stratified sampling in statistics to divide original large dataset into several sample subsets and conduct attribute reduction on each subset. In the end, weight fusion was made to get final reduction result of original large dataset according to reduction results of all the subset. In [15], it is stated that useless samples will be filtered to reduce memory space and running time during reduction calculation. The algorithm called condensation rule based on fuzzy rough sets(FRSC) is introduced in [16]. In rough sets theory, the samples near the decision boundary are often these samples that have smaller lower approximation. Based on this characteristic, FRSC uses lower approximation as a basis for sample extraction.
Most sample extraction algorithms use random sampling or sampling methods based on a certain criteria. But the reasons for that and how to achieve optimum sample effect have not been fully interpreted in theory and practice. According to the above scholars’ researches, an attribute reduction method based on sample extraction and priority is presented in this paper. Because the samples near the classification boundary have great uncertainty, they can provide more information on classification and play an important role in classification. So, in this paper, the samples near the classification boundary are extracted and used for attribute reduction. Experiments illustrate the validity and efficiency of this method.
This paper makes some improvements on the previous attribute reduction algorithms which are based on sample extraction to compress decision table, improve sampling method and enhance the reduction efficiency. These improvements primarily reflected in the following aspects.
In some references, it is believed that sample extraction can removes redundant objects from decision table. But in fact the selected samples are far more likely to be still redundant if only the sample extraction method is used. Therefore data preprocessing must be done by using other methods before sample extraction to remove redundant objects from decision table completely. This paper proposes that
Let
for(
sort
for(
if
else
return
The running time of Algorithm 1 consists of two parts: quick sort and equivalence class division, among which the main cost is quick sort. Because the total cost of |
A redundant decision table.
1 | 0 | 1 | 0 | |
2 | 0 | 1 | 1 | |
1 | 0 | 1 | 0 | |
0 | 0 | 0 | 0 | |
1 | 2 | 0 | 1 | |
1 | 2 | 0 | 1 | |
2 | 2 | 0 | 1 |
On the basis of sorting, non-redundant data are chosen to form the compressed decision table. The compressed decision table is defined as follows.
In decision table
According to Definition 1, generation algorithm of the compressed decision table is given below.
computing
for(
return
For Algorithm 2, time complexity is O(|
Compressed decision table.
2 | 0 | 1 | 1 | |
1 | 0 | 1 | 0 | |
0 | 0 | 0 | 0 | |
1 | 2 | 0 | 1 | |
2 | 2 | 0 | 1 |
After removing redundant objects in the previous step, to improve the efficiency of attribute reduction, we try to further reduce the number of objects by sampling. In some papers, instances are selected by
cluster the objects in
for(
for(
if(
return
In Algorithm 3,
Through sample extraction, objects that have an important influence on the reduction are selected, which greatly reduces the number of objects and improves the efficiency of reduction.
After sample extraction, attribute reduction can be conducted. Some literature reduce the attributes using the discernibility matrix. However, the discernibility matrix method shows some disadvantages that lead to its lower efficiency and thus largely limit the wider applications, such as it occupies a large amount of RAM and has high time complexity. The attribute reduction algorithms based on positive region have better time and space complexity because they do not need to establish the discernibility matrix. According to the concept of positive region, an attribute reduction algorithm based on priority is put forward in this paper. Then the effect of this algorithm is illustrated by the time complexity analysis and experiments.
This paper introduces the concept of priority to measure the discernibility ability. The definition of priority is presented here.
For decision table
It can be seen from the above definition that the more equivalence classes are divided and the more balanced, the stronger the ability of the attribute set to distinguish the object set.
For decision table
Clearly, the smaller the priority is, the larger the discernibility ability of attribute set is. Theorem 1 is given on this basis.
According to the above definitions, Theorem 1 holds clearly.
According to the above theoretical foundation, the attribute reduction algorithm which is based on priority is given below.
calculate
while(
{calculate priority of each attribute in
choose the attribute
return
For Algorithm 4, time complexity is O(|
To determine the reasonable value of
UCI data sets.
| |
| |
|
---|---|---|
Statlog | 6435 | 36 |
Winequality | 4898 | 11 |
Contraceptive | 1473 | 9 |
First, experiment to explore the relationship between different sampling distances and reduction effectiveness with Statlog data set is conducted and reasonable value of
The relationship between different sampling distances and reduction effectiveness for Statlog.
0.6 |
5473 | 9 | 329.61 |
0.8 |
3965 | 8 | 195.13 |
2648 | 8 | 111.42 | |
1.05 |
2344 | 7 | 79.82 |
1.1 |
2077 | 7 | 65.15 |
1.2 |
1628 | 6 | 54.59 |
From Table 4 it can be seen that the smaller the sampling distance is, the fewer the number of samples is and the greater the difference between the number of the reduced attributes and traditional algorithm is. To bring reduction effectiveness in line with traditional algorithm as far as possible, a proper sampling distance must be chosen. Clearly, if sampling distance is
Next, traditional attribute reduction algorithm based on positive region and Algorithm 4 in this paper are separately used to calculate the time required for getting reduction and compare the efficiency of the algorithms. The corresponding experimental results on Winequality data set are listed. There are 11 condition attributes in database and computing time of this paper is the sum of redundancy removal time, sampling time and reduction time.
As seen from Table 5, the original Winequality data set has nearly 20% redundant data, which is a very common phenomenon in real data. At the same time, the sampling rate is 45%. So, the number of objects of the original data set has been greatly decreased after redundancy removal and sampling. Moreover, reduction of samples basically coincides with the original data set, but computing time just accounts for 29% of the original data set. Overall, it is clear that reduction efficiency is greatly improved.
Comparison on the effect of reduction for Winequality data set
| |
||||||
---|---|---|---|---|---|---|
Original data set | Redundancy removal | Sampling | Reduction result | Reduction time (s) | Reduction result | Computing time (s) |
4898 | 3961 | 1788 | 5 | 113.62 | 4 | 33.25 |
The experimental results on Contraceptive data set are listed here. There are nine condition attributes in database and computing time of this paper is the sum of redundancy removal time, sampling time and reduction time.
The result shown in Table 6 is similar to that in Table 5. Redundant data provide 3% of the original Contraceptive data set and the sampling rate reaches 47%. Thus, the number of objects drops significantly. Consequently, computing time of Algorithm 4 in this paper just accounts for 33% of the original data set. This also illustrates the efficiency of the proposed approach. Meanwhile, comparing Table 5 with Table 6, it is found that the higher the data volume is, the more obvious the advantage of Algorithm 4 is.
Comparison on the effect of reduction for Contraceptive data set.
| |
||||||
---|---|---|---|---|---|---|
Original data set | Redundancy removal | Sampling | Reduction result | Reduction time (s) | Reduction result | Computing time (s) |
1473 | 1425 | 665 | 9 | 12.62 | 8 | 4.13 |
For large data sets, redundant data expend a lot of computer run time and memory in attribute reduction. To overcome this issue, the paper puts forward an attribute reduction method based on sample extraction and priority. To start with, this method removes redundant data using quick sort. Then, the samples near the classification boundary are extracted and used for reduction calculation. According to the concept of positive region, an attribute reduction algorithm that is based on priority is proposed in this paper and used for reduction. Experiments show that this method suits attribute reduction calculation on large data sets well, specifically. With the rise of smart medical care, a large number of medical data resources have been integrated. Medical big data knowledge mining has become a research focus in the current academic field. The method proposed in this article can be used to effectively reduce the dimensionality of massive medical data and improve the efficiency of knowledge mining. This is the next application and research direction.