Attribute Reduction Method Based on Sample Extraction and Priority

Rough sets theory is a mathematical tool to deal with vague and uncertain knowledge presented by Pawlak in 1982 [1]. It can mine the hidden knowledge from the mass data and has been successfully applied in fields such as data mining, pattern recognition, decision support system [2,3], etc. In recent years, it is also used in big data [4], information sharing [5], and other fields. Attribute reduction is one of the core contents of the Rough sets theory. It has been proved to be a NP-Hard problem to find all reductions or a minimal reduction in Rough Sets [6]. So, the common method is to adopt the heuristic algorithm. Normally there are two main classes, the algorithms that are based on the discernibility matrices and the algorithms that are based on positive region.

Because the attribute reduction algorithms based on discernibility matrix is succinct and easy to understand, it has attracted a great deal of attention from scholars [7]. But in this kind of algorithms the discernibility matrix need be constructed and there are a lot of repeat elements in the discernibility matrix. In disposing big datasets, this kind of algorithms not only cost a great deal of memory space, but also waste much computer time in reduction. Because the discernibility matrix need not be constructed, the attribute reduction algorithms based on positive region attract more and more attention from scholars [8]. But the time complexity of this kind of algorithms are still far from satisfactory in processing large data sets [9].

If unnecessary elements are removed from decision table and only essential elements are extracted for attribute reduction, high efficiency of reduction algorithms can be reached. Hart first puts forward the concept of the sample extraction. There are some typic algorithms, such as the CNN [10], the RNN [11], etc. Numerous sample extraction algorithms can be broadly divided into two kinds: classification competence enhancement approaches and classification competence preservation approaches [12]. At present, many scholars have done lots of correlative investigations to the sample extraction.

In the paper [9], condition attributes are clustered firstly. Then the cluster are randomly sampled. Finally, attribute reduction is conducted on the sample data set. The literature [13] proposes an attribute reduction algorithm that is based on sample extraction. There are three steps in this algorithm. First, important instances are selected from the set of objects. Second, discernibility matrix is constructed by the use of the selected instances. Finally, all reductions of decision table are computed. In [14], an attribute reduction algorithm based on granular computing and discernibility ability is presented. This algorithm uses stratified sampling in statistics to divide original large dataset into several sample subsets and conduct attribute reduction on each subset. In the end, weight fusion was made to get final reduction result of original large dataset according to reduction results of all the subset. In [15], it is stated that useless samples will be filtered to reduce memory space and running time during reduction calculation. The algorithm called condensation rule based on fuzzy rough sets(FRSC) is introduced in [16]. In rough sets theory, the samples near the decision boundary are often these samples that have smaller lower approximation. Based on this characteristic, FRSC uses lower approximation as a basis for sample extraction.

Most sample extraction algorithms use random sampling or sampling methods based on a certain criteria. But the reasons for that and how to achieve optimum sample effect have not been fully interpreted in theory and practice. According to the above scholars’ researches, an attribute reduction method based on sample extraction and priority is presented in this paper. Because the samples near the classification boundary have great uncertainty, they can provide more information on classification and play an important role in classification. So, in this paper, the samples near the classification boundary are extracted and used for attribute reduction. Experiments illustrate the validity and efficiency of this method.

Proposed methodology

This paper makes some improvements on the previous attribute reduction algorithms which are based on sample extraction to compress decision table, improve sampling method and enhance the reduction efficiency. These improvements primarily reflected in the following aspects.

2.1

Data preprocessing

In some references, it is believed that sample extraction can removes redundant objects from decision table. But in fact the selected samples are far more likely to be still redundant if only the sample extraction method is used. Therefore data preprocessing must be done by using other methods before sample extraction to remove redundant objects from decision table completely. This paper proposes that U is classified according to C to solve the problem of data redundancy, namely computation U/C. That is, U is sorted by each condition attribute in order. Then, the sorted U is analysed and equivalence classes are divided. Next, non-redundant objects can be selected out. The quick sort method is used in the sorting stage.

Let S=(U, C∪D, V, f) be a decision table to be ordered and U={u₁, u₂, . . . , u_n}. Let ${u_{1}^{'}, u_{2}^{'}, \dots, u_{n}^{'}}$ \left\{ {u_1^\prime,u_2^\prime, \ldots ,u_n^\prime} \right\} be the set of objects which are ordered by C. H_i denotes the i-th equivalence class and H is the set of H_i. The basic design ideas are in Algorithm 1.

Algorithm 1

Algorithm of computing U/C using quick sort.

Input: S, U={u₁, u₂, . . . , u_n}, C={a₁, a₂, . . . , a_k}.

Output: U/C.

for(i=1;i<=k;i++)

sort U with quick sort by a_i;

t=1; $H_{1} = {u_{1}^{'}}$ {H_1} = \left\{ {u_1^\prime} \right\} ;

for(j=2;j<=n;j++)

if $(f (u_{j}^{'}, a_{i}) = = f (u_{j - 1}^{'}, a_{i}) for \forall a_{i})$ \left( {f\left( {u_j^\prime,{a_i}} \right) = = f\left( {u_{j - 1}^\prime,{a_i}} \right)\;{\rm{for}}\;\forall {a_i}} \right)

$H_{t} = H_{t} \cup {u_{j}^{'}};$ {H_t} = {H_t} \cup \left\{ {u_j^\prime} \right\};

else ${t + +; H_{t} = {u_{j}^{'}}$ \left\{ {t + + ;{H_t} = \left\{ {u_j^\prime} \right\}} \right.

return H.

The running time of Algorithm 1 consists of two parts: quick sort and equivalence class division, among which the main cost is quick sort. Because the total cost of |C| times quick sort is O(|C||U|log|U|), the time complexity of Algorithm 1 is O(|C||U|log|U|) consequently. An example of equivalence class partition is given later. Table 1 is a redundant decision table. Among the objects, u₁ and u₃ are duplicate, and u₅ and u₆ are duplicate.

Table 1

A redundant decision table.

U	a	b	c	D
u₁	1	0	1	0
u₂	2	0	1	1
u₃	1	0	1	0
u₄	0	0	0	0
u₅	1	2	0	1
u₆	1	2	0	1
u₇	2	2	0	1

U/C is computed through Algorithm 1. First, U is sorted through condition attribute a using quick sort. After sorting, U is {u₄, u₃, u₁, u₅, u₆, u₂, u₇}. Similarly, U is sorted through other condition attributes in order. Eventually, U/{a,b,c,} is {{u₇}, {u₆, u₅}, {u₄}, {u₃, u₁}, {u₂}}.

On the basis of sorting, non-redundant data are chosen to form the compressed decision table. The compressed decision table is defined as follows.

Definition 1

In decision table S, $U / C = {[u_{1}^{'}] c, [u_{2}^{'} c], \dots, [u_{d}^{'} c]}$ U/C = \left\{ {\left[ {u_1^\prime} \right]c,\left[ {u_2^\prime c} \right], \ldots ,\left[ {u_d^\prime c} \right]} \right\} , where $u_{1}^{'}, u_{2}^{'}, \dots, u_{d}^{'}$ u_1^\prime,u_2^\prime, \ldots ,u_d^\prime are certain elements in various equivalence classes. ${u_{1}^{'}, u_{2}^{'}, \dots, u_{d}^{'}}$ \left\{ {u_1^\prime,u_2^\prime, \ldots ,u_d^\prime} \right\} is denoted as U′. S′ = (U′,C ∪ D,V ′, f ′) is defined as compressed decision table, where U′ is the domain of S′;V ′; the set of values of all attributes in S′; f ′; the information function of S′; f ′.

According to Definition 1, generation algorithm of the compressed decision table is given below.

Algorithm 2

Generation algorithm of the compressed decision table

Input: S

Output: S′ = (U′,C ∪ D,V ′, f ′)

U′ = ∅;

computing $U / C = {{[u_{2}^{'}]}_{C}, {[u_{2}^{'}]}_{C}, \dots, {[u_{d}^{'}]}_{C}}$ U/C = \left\{ {{{\left[ {u_2^\prime} \right]}_C},{{\left[ {u_2^\prime} \right]}_C}, \ldots ,{{\left[ {u_d^\prime} \right]}_C}} \right\} using Algorithm 1;

for(i=1;i<=d;i++)

$U_{i}^{'} = U^{'} \cup {u_{i}^{'}}$ U_i^\prime = U' \cup \left\{ {u_i^\prime} \right\}

return S′.

For Algorithm 2, time complexity is O(|C||U|log|U|). The compressed decision table of Table 1 can be got through Algorithm 2, which is shown in Table 2.

Table 2

Compressed decision table.

U	a	b	c	D
u₂	2	0	1	1
u₃	1	0	1	0
u₄	0	0	0	0
u₅	1	2	0	1
u₇	2	2	0	1

2.2

Sample extraction

After removing redundant objects in the previous step, to improve the efficiency of attribute reduction, we try to further reduce the number of objects by sampling. In some papers, instances are selected by K-Means. The value of K needs be determined artificially beforehand and cannot be changed during the whole computation process. Furthermore, the value of K often cannot be precisely estimated for large data sets with high dimensionality. But, iterative self-organising data analysis technique algorithm (ISODATA) makes improvement, which is just aiming at this question. Thus, ISODATA is chosen to extract samples in this paper. Sample extraction algorithm based on ISODATA is provided below. Let λ be the number of clusters. N_i is the number of objects in the i-th cluster. D_ij is the distance between object u_j in the i-th cluster and cluster centre. The average distance from all objects in the i-th cluster to cluster centre is D_i. The sampling distance is mD_i.

Algorithm 3

Sample extraction algorithm based on ISODATA.

Input: S′ = (U′,∪D,V ′, f ′)

Output: S″ = (U″,∪,D,V″, f″)

U″ = ∅

cluster the objects in U′ using the ISODATA;

for(i=1;i<= λ ; j + +)

for(j=1;j<=N_i; j + +)

if(D_ij>mD_i)

U″ = U″ ∪ u_ij;

return S″.

In Algorithm 3, m is a constant and its reasonable value can be determined by the following experimental research. As time complexity of ISODATA is O(|U′|²) [17] and time complexity of the later dual loop is O((|U′|N_i × N_i) = O(|U′|) time complexity of sample extraction algorithm is O(|U′|² = O(|U|²).

Through sample extraction, objects that have an important influence on the reduction are selected, which greatly reduces the number of objects and improves the efficiency of reduction.

2.3

Attribute reduction

After sample extraction, attribute reduction can be conducted. Some literature reduce the attributes using the discernibility matrix. However, the discernibility matrix method shows some disadvantages that lead to its lower efficiency and thus largely limit the wider applications, such as it occupies a large amount of RAM and has high time complexity. The attribute reduction algorithms based on positive region have better time and space complexity because they do not need to establish the discernibility matrix. According to the concept of positive region, an attribute reduction algorithm based on priority is put forward in this paper. Then the effect of this algorithm is illustrated by the time complexity analysis and experiments.

This paper introduces the concept of priority to measure the discernibility ability. The definition of priority is presented here.

Definition 2

For decision table S, attribute set P ⊆ C divides U in k equivalence classes. The number of elements of each equivalence classes are |E₁|, |E₂|, . . . , |E_k|, respectively. The priority of P is defined as $pri (P) = \sum_{i = 1}^{k} (| E_{i} | \frac{| E_{i} |}{| U |})$ pri\left( P \right) = \sum\limits_{i = 1}^k {\left( {\left| {{E_i}} \right|{{\left| {{E_i}} \right|} \over {\left| U \right|}}} \right)} Particularly, pri (∅)= |U|.

It can be seen from the above definition that the more equivalence classes are divided and the more balanced, the stronger the ability of the attribute set to distinguish the object set.

Definition 3

For decision table S, the priority of attribute a with regard to set of attributes P ⊆ C is defined as pri(P∪{a}).

Clearly, the smaller the priority is, the larger the discernibility ability of attribute set is. Theorem 1 is given on this basis.

Theorem 1

In S, for R ⊆ C and ∀a∈ R, R is an attribute reduction of C if and only if pri(R) = pri(C) and pri(R − {a}) > pri(C).

Proof

According to the above definitions, Theorem 1 holds clearly.

According to the above theoretical foundation, the attribute reduction algorithm which is based on priority is given below.

Algorithm 4

Attribute reduction algorithm based on priority

Input: S″ = (U″,∪D,V″, f″)

Output: an attribute reduction R

R∅;

calculate pri(C);

while(pri(R)>pri(C))

{calculate priority of each attribute in C with regard to R;

choose the attribute a that has the smallest priority;

R=R∪{a}, C=C-{a};}

return R.

For Algorithm 4, time complexity is O(|C|²U″/C|)=O(|C|²|U/C|)=O(|C||U|).When Algorithms 1–4 are all took into account, total time complexity of reduction is O(|U|²), because normally |C|<<|U| in the big data environment. That is to say, most of the time is spent on sample extraction using ISODATA. The higher the data volume is, the quicker the time increases.

Results and discussions

To determine the reasonable value of m and compare the time complexity of the algorithms, we use 3 data sets from UCI machine learning databases to make experiments with MATLAB. The experimental conditions are as follows: Windows 10 OS, Inter Core i7 CPU and 8 GB RAM. The basic information about the data sets is shown in Table 3. Further to ensure the accuracy of reduction, the values of attributes are pre-treated. The results are average values of five experiments.

Table 3

UCI data sets.

Data sets	\|U\|	\|C\|
Statlog	6435	36
Winequality	4898	11
Contraceptive	1473	9

First, experiment to explore the relationship between different sampling distances and reduction effectiveness with Statlog data set is conducted and reasonable value of m is determined. Reduction result obtained by the traditional attribute reduction algorithm based on positive region [18] is that reduction contains 10 attributes and reduction time is 849.51 s. Reduction result obtained by the Algorithm 4 in this paper is shown in Table 4, where computing time is the sum of sampling time and reduction time, D_i is the average distance from all objects in the i-th cluster to cluster centre.

Table 4

The relationship between different sampling distances and reduction effectiveness for Statlog.

Sampling distance	Sample number	Attribute number of minimum reduction	Computing time (s)
0.6D_i	5473	9	329.61
0.8D_i	3965	8	195.13
D_i	2648	8	111.42
1.05D_i	2344	7	79.82
1.1D_i	2077	7	65.15
1.2D_i	1628	6	54.59

From Table 4 it can be seen that the smaller the sampling distance is, the fewer the number of samples is and the greater the difference between the number of the reduced attributes and traditional algorithm is. To bring reduction effectiveness in line with traditional algorithm as far as possible, a proper sampling distance must be chosen. Clearly, if sampling distance is D_i, namely m=1, Algorithm 4 achieves a good balance between the number of samples and reduction effectiveness. In this case, the number of samples is relatively small and time complexity of Algorithm 4 is low correspondingly. Reduction time accounts for just 13% of traditional algorithm. But at the same time, reduction effectiveness basically coincides with the traditional algorithm. The experimental research results on other data sets are similar to the Statlog. Hence the value of m is 1 in all the later experiments. That is to say, the sampling distance is D_i.

Next, traditional attribute reduction algorithm based on positive region and Algorithm 4 in this paper are separately used to calculate the time required for getting reduction and compare the efficiency of the algorithms. The corresponding experimental results on Winequality data set are listed. There are 11 condition attributes in database and computing time of this paper is the sum of redundancy removal time, sampling time and reduction time.

As seen from Table 5, the original Winequality data set has nearly 20% redundant data, which is a very common phenomenon in real data. At the same time, the sampling rate is 45%. So, the number of objects of the original data set has been greatly decreased after redundancy removal and sampling. Moreover, reduction of samples basically coincides with the original data set, but computing time just accounts for 29% of the original data set. Overall, it is clear that reduction efficiency is greatly improved.

Table 5

Comparison on the effect of reduction for Winequality data set

\|U\|			Traditional algorithm		Algorithm 4
Original data set	Redundancy removal	Sampling	Reduction result	Reduction time (s)	Reduction result	Computing time (s)
4898	3961	1788	5	113.62	4	33.25

The experimental results on Contraceptive data set are listed here. There are nine condition attributes in database and computing time of this paper is the sum of redundancy removal time, sampling time and reduction time.

The result shown in Table 6 is similar to that in Table 5. Redundant data provide 3% of the original Contraceptive data set and the sampling rate reaches 47%. Thus, the number of objects drops significantly. Consequently, computing time of Algorithm 4 in this paper just accounts for 33% of the original data set. This also illustrates the efficiency of the proposed approach. Meanwhile, comparing Table 5 with Table 6, it is found that the higher the data volume is, the more obvious the advantage of Algorithm 4 is.

Table 6

Comparison on the effect of reduction for Contraceptive data set.

\|U\|			Traditional algorithm		Algorithm 4
Original data set	Redundancy removal	Sampling	Reduction result	Reduction time (s)	Reduction result	Computing time (s)
1473	1425	665	9	12.62	8	4.13

Conclusions

For large data sets, redundant data expend a lot of computer run time and memory in attribute reduction. To overcome this issue, the paper puts forward an attribute reduction method based on sample extraction and priority. To start with, this method removes redundant data using quick sort. Then, the samples near the classification boundary are extracted and used for reduction calculation. According to the concept of positive region, an attribute reduction algorithm that is based on priority is proposed in this paper and used for reduction. Experiments show that this method suits attribute reduction calculation on large data sets well, specifically. With the rise of smart medical care, a large number of medical data resources have been integrated. Medical big data knowledge mining has become a research focus in the current academic field. The method proposed in this article can be used to effectively reduce the dimensionality of massive medical data and improve the efficiency of knowledge mining. This is the next application and research direction.

eISSN:: 2444-8656
Langue:: Anglais

Périodicité:: Volume Open
Sujets de la revue:: Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics

RSS Feed de la revue

Attribute Reduction Method Based on Sample Extraction and Priority

Publié en ligne: 12 avr. 2021

Pages: 219 - 226

Reçu: 28 nov. 2020

Accepté: 31 janv. 2021

DOI: https://doi.org/10.2478/amns.2021.1.00036

Mots clésrough sets, sample extraction, priority, attribute reduction, algorithm

© 2021 Biqing Wang, published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Mots clés
rough sets, sample extraction, priority, attribute reduction, algorithm