1. bookVolume 6 (2021): Issue 1 (January 2021)
Journal Details
License
Format
Journal
First Published
01 Jan 2016
Publication timeframe
2 times per year
Languages
English
access type Open Access

Attribute Reduction Method Based on Sample Extraction and Priority

Published Online: 12 Apr 2021
Page range: 219 - 226
Received: 28 Nov 2020
Accepted: 31 Jan 2021
Journal Details
License
Format
Journal
First Published
01 Jan 2016
Publication timeframe
2 times per year
Languages
English
Abstract

Attribute reduction is a key issue in the research of rough sets. Aiming at the shortcoming of attribute reduction algorithm based on discernibility matrix, an attribute reduction method based on sample extraction and priority is presented. Firstly, equivalence classes are divided using quick sort for computing compressed decision table. Secondly, important samples are extracted from compressed decision table using iterative self-organizing data analysis technique algorithm(ISODATA). Finally, attribute reduction of sample decision table is conducted based on the concept of priority. Experimental results show that the attribute reduction method based on sample extraction and priority can significantly reduce the overall execution time and improve the reduction efficiency.

Keywords

Introduction

Rough sets theory is a mathematical tool to deal with vague and uncertain knowledge presented by Pawlak in 1982 [1]. It can mine the hidden knowledge from the mass data and has been successfully applied in fields such as data mining, pattern recognition, decision support system [2,3], etc. In recent years, it is also used in big data [4], information sharing [5], and other fields. Attribute reduction is one of the core contents of the Rough sets theory. It has been proved to be a NP-Hard problem to find all reductions or a minimal reduction in Rough Sets [6]. So, the common method is to adopt the heuristic algorithm. Normally there are two main classes, the algorithms that are based on the discernibility matrices and the algorithms that are based on positive region.

Because the attribute reduction algorithms based on discernibility matrix is succinct and easy to understand, it has attracted a great deal of attention from scholars [7]. But in this kind of algorithms the discernibility matrix need be constructed and there are a lot of repeat elements in the discernibility matrix. In disposing big datasets, this kind of algorithms not only cost a great deal of memory space, but also waste much computer time in reduction. Because the discernibility matrix need not be constructed, the attribute reduction algorithms based on positive region attract more and more attention from scholars [8]. But the time complexity of this kind of algorithms are still far from satisfactory in processing large data sets [9].

If unnecessary elements are removed from decision table and only essential elements are extracted for attribute reduction, high efficiency of reduction algorithms can be reached. Hart first puts forward the concept of the sample extraction. There are some typic algorithms, such as the CNN [10], the RNN [11], etc. Numerous sample extraction algorithms can be broadly divided into two kinds: classification competence enhancement approaches and classification competence preservation approaches [12]. At present, many scholars have done lots of correlative investigations to the sample extraction.

In the paper [9], condition attributes are clustered firstly. Then the cluster are randomly sampled. Finally, attribute reduction is conducted on the sample data set. The literature [13] proposes an attribute reduction algorithm that is based on sample extraction. There are three steps in this algorithm. First, important instances are selected from the set of objects. Second, discernibility matrix is constructed by the use of the selected instances. Finally, all reductions of decision table are computed. In [14], an attribute reduction algorithm based on granular computing and discernibility ability is presented. This algorithm uses stratified sampling in statistics to divide original large dataset into several sample subsets and conduct attribute reduction on each subset. In the end, weight fusion was made to get final reduction result of original large dataset according to reduction results of all the subset. In [15], it is stated that useless samples will be filtered to reduce memory space and running time during reduction calculation. The algorithm called condensation rule based on fuzzy rough sets(FRSC) is introduced in [16]. In rough sets theory, the samples near the decision boundary are often these samples that have smaller lower approximation. Based on this characteristic, FRSC uses lower approximation as a basis for sample extraction.

Most sample extraction algorithms use random sampling or sampling methods based on a certain criteria. But the reasons for that and how to achieve optimum sample effect have not been fully interpreted in theory and practice. According to the above scholars’ researches, an attribute reduction method based on sample extraction and priority is presented in this paper. Because the samples near the classification boundary have great uncertainty, they can provide more information on classification and play an important role in classification. So, in this paper, the samples near the classification boundary are extracted and used for attribute reduction. Experiments illustrate the validity and efficiency of this method.

Proposed methodology

This paper makes some improvements on the previous attribute reduction algorithms which are based on sample extraction to compress decision table, improve sampling method and enhance the reduction efficiency. These improvements primarily reflected in the following aspects.

Data preprocessing

In some references, it is believed that sample extraction can removes redundant objects from decision table. But in fact the selected samples are far more likely to be still redundant if only the sample extraction method is used. Therefore data preprocessing must be done by using other methods before sample extraction to remove redundant objects from decision table completely. This paper proposes that U is classified according to C to solve the problem of data redundancy, namely computation U/C. That is, U is sorted by each condition attribute in order. Then, the sorted U is analysed and equivalence classes are divided. Next, non-redundant objects can be selected out. The quick sort method is used in the sorting stage.

Let S=(U, CD, V, f) be a decision table to be ordered and U={u1, u2, . . . , un}. Let {u1,u2,,un} \left\{ {u_1^\prime,u_2^\prime, \ldots ,u_n^\prime} \right\} be the set of objects which are ordered by C. Hi denotes the i-th equivalence class and H is the set of Hi. The basic design ideas are in Algorithm 1.

Algorithm of computing U/C using quick sort.

Input: S, U={u1, u2, . . . , un}, C={a1, a2, . . . , ak}.

Output: U/C.

for(i=1;i<=k;i++)

  sort U with quick sort by ai;

t=1; H1={u1} {H_1} = \left\{ {u_1^\prime} \right\} ;

for(j=2;j<=n;j++)

  if (f(uj,ai)==f(uj1,ai)forai) \left( {f\left( {u_j^\prime,{a_i}} \right) = = f\left( {u_{j - 1}^\prime,{a_i}} \right)\;{\rm{for}}\;\forall {a_i}} \right)

     Ht=Ht{uj}; {H_t} = {H_t} \cup \left\{ {u_j^\prime} \right\};

  else {t++;Ht={uj} \left\{ {t + + ;{H_t} = \left\{ {u_j^\prime} \right\}} \right.

return H.

The running time of Algorithm 1 consists of two parts: quick sort and equivalence class division, among which the main cost is quick sort. Because the total cost of |C| times quick sort is O(|C||U|log|U|), the time complexity of Algorithm 1 is O(|C||U|log|U|) consequently. An example of equivalence class partition is given later. Table 1 is a redundant decision table. Among the objects, u1 and u3 are duplicate, and u5 and u6 are duplicate.

A redundant decision table.

U a b c D
u1 1 0 1 0
u2 2 0 1 1
u3 1 0 1 0
u4 0 0 0 0
u5 1 2 0 1
u6 1 2 0 1
u7 2 2 0 1

U/C is computed through Algorithm 1. First, U is sorted through condition attribute a using quick sort. After sorting, U is {u4, u3, u1, u5, u6, u2, u7}. Similarly, U is sorted through other condition attributes in order. Eventually, U/{a,b,c,} is {{u7}, {u6, u5}, {u4}, {u3, u1}, {u2}}.

On the basis of sorting, non-redundant data are chosen to form the compressed decision table. The compressed decision table is defined as follows.

Definition 1

In decision table S, U/C={[u1]c,[u2c],,[udc]} U/C = \left\{ {\left[ {u_1^\prime} \right]c,\left[ {u_2^\prime c} \right], \ldots ,\left[ {u_d^\prime c} \right]} \right\} , where u1,u2,,ud u_1^\prime,u_2^\prime, \ldots ,u_d^\prime are certain elements in various equivalence classes. {u1,u2,,ud} \left\{ {u_1^\prime,u_2^\prime, \ldots ,u_d^\prime} \right\} is denoted as U′. S′ = (U,CD,V, f ′) is defined as compressed decision table, where U′ is the domain of S′;V ′; the set of values of all attributes in S′; f ′; the information function of S′; f ′.

According to Definition 1, generation algorithm of the compressed decision table is given below.

Generation algorithm of the compressed decision table

Input: S

Output: S′ = (U,CD,V, f ′)

U′ = ∅;

computing U/C={[u2]C,[u2]C,,[ud]C} U/C = \left\{ {{{\left[ {u_2^\prime} \right]}_C},{{\left[ {u_2^\prime} \right]}_C}, \ldots ,{{\left[ {u_d^\prime} \right]}_C}} \right\} using Algorithm 1;

for(i=1;i<=d;i++)

   Ui=U{ui} U_i^\prime = U' \cup \left\{ {u_i^\prime} \right\}

return S′.

For Algorithm 2, time complexity is O(|C||U|log|U|). The compressed decision table of Table 1 can be got through Algorithm 2, which is shown in Table 2.

Compressed decision table.

U a b c D
u2 2 0 1 1
u3 1 0 1 0
u4 0 0 0 0
u5 1 2 0 1
u7 2 2 0 1
Sample extraction

After removing redundant objects in the previous step, to improve the efficiency of attribute reduction, we try to further reduce the number of objects by sampling. In some papers, instances are selected by K-Means. The value of K needs be determined artificially beforehand and cannot be changed during the whole computation process. Furthermore, the value of K often cannot be precisely estimated for large data sets with high dimensionality. But, iterative self-organising data analysis technique algorithm (ISODATA) makes improvement, which is just aiming at this question. Thus, ISODATA is chosen to extract samples in this paper. Sample extraction algorithm based on ISODATA is provided below. Let λ be the number of clusters. Ni is the number of objects in the i-th cluster. Dij is the distance between object uj in the i-th cluster and cluster centre. The average distance from all objects in the i-th cluster to cluster centre is Di. The sampling distance is mDi.

Sample extraction algorithm based on ISODATA.

Input: S′ = (U,D,V, f ′)

Output: S″ = (U,,D,V, f″)

U″ = ∅

cluster the objects in U′ using the ISODATA;

for(i=1;i<= λ ; j + +)

  for(j=1;j<=Ni; j + +)

    if(Dij>mDi)

    U″ = U″ ∪ uij;

return S″.

In Algorithm 3, m is a constant and its reasonable value can be determined by the following experimental research. As time complexity of ISODATA is O(|U|2) [17] and time complexity of the later dual loop is O((|U′|Ni × Ni) = O(|U|) time complexity of sample extraction algorithm is O(|U|2 = O(|U|2).

Through sample extraction, objects that have an important influence on the reduction are selected, which greatly reduces the number of objects and improves the efficiency of reduction.

Attribute reduction

After sample extraction, attribute reduction can be conducted. Some literature reduce the attributes using the discernibility matrix. However, the discernibility matrix method shows some disadvantages that lead to its lower efficiency and thus largely limit the wider applications, such as it occupies a large amount of RAM and has high time complexity. The attribute reduction algorithms based on positive region have better time and space complexity because they do not need to establish the discernibility matrix. According to the concept of positive region, an attribute reduction algorithm based on priority is put forward in this paper. Then the effect of this algorithm is illustrated by the time complexity analysis and experiments.

This paper introduces the concept of priority to measure the discernibility ability. The definition of priority is presented here.

Definition 2

For decision table S, attribute set PC divides U in k equivalence classes. The number of elements of each equivalence classes are |E1|, |E2|, . . . , |Ek|, respectively. The priority of P is defined as pri(P)=i=1k(|Ei||Ei||U|) pri\left( P \right) = \sum\limits_{i = 1}^k {\left( {\left| {{E_i}} \right|{{\left| {{E_i}} \right|} \over {\left| U \right|}}} \right)} Particularly, pri (∅)= |U|.

It can be seen from the above definition that the more equivalence classes are divided and the more balanced, the stronger the ability of the attribute set to distinguish the object set.

Definition 3

For decision table S, the priority of attribute a with regard to set of attributes PC is defined as pri(P∪{a}).

Clearly, the smaller the priority is, the larger the discernibility ability of attribute set is. Theorem 1 is given on this basis.

Theorem 1

In S, for RC andaR, R is an attribute reduction of C if and only if pri(R) = pri(C) and pri(R − {a}) > pri(C).

Proof

According to the above definitions, Theorem 1 holds clearly.

According to the above theoretical foundation, the attribute reduction algorithm which is based on priority is given below.

Attribute reduction algorithm based on priority

Input: S″ = (U,D,V, f″)

Output: an attribute reduction R

R∅;

calculate pri(C);

while(pri(R)>pri(C))

    {calculate priority of each attribute in C with regard to R;

    choose the attribute a that has the smallest priority;

    R=R∪{a}, C=C-{a};}

return R.

For Algorithm 4, time complexity is O(|C|2U/C|)=O(|C|2|U/C|)=O(|C||U|).When Algorithms 1–4 are all took into account, total time complexity of reduction is O(|U|2), because normally |C|<<|U| in the big data environment. That is to say, most of the time is spent on sample extraction using ISODATA. The higher the data volume is, the quicker the time increases.

Results and discussions

To determine the reasonable value of m and compare the time complexity of the algorithms, we use 3 data sets from UCI machine learning databases to make experiments with MATLAB. The experimental conditions are as follows: Windows 10 OS, Inter Core i7 CPU and 8 GB RAM. The basic information about the data sets is shown in Table 3. Further to ensure the accuracy of reduction, the values of attributes are pre-treated. The results are average values of five experiments.

UCI data sets.

Data sets |U| |C|
Statlog 6435 36
Winequality 4898 11
Contraceptive 1473 9

First, experiment to explore the relationship between different sampling distances and reduction effectiveness with Statlog data set is conducted and reasonable value of m is determined. Reduction result obtained by the traditional attribute reduction algorithm based on positive region [18] is that reduction contains 10 attributes and reduction time is 849.51 s. Reduction result obtained by the Algorithm 4 in this paper is shown in Table 4, where computing time is the sum of sampling time and reduction time, Di is the average distance from all objects in the i-th cluster to cluster centre.

The relationship between different sampling distances and reduction effectiveness for Statlog.

Sampling distance Sample number Attribute number of minimum reduction Computing time (s)
0.6Di 5473 9 329.61
0.8Di 3965 8 195.13
Di 2648 8 111.42
1.05Di 2344 7 79.82
1.1Di 2077 7 65.15
1.2Di 1628 6 54.59

From Table 4 it can be seen that the smaller the sampling distance is, the fewer the number of samples is and the greater the difference between the number of the reduced attributes and traditional algorithm is. To bring reduction effectiveness in line with traditional algorithm as far as possible, a proper sampling distance must be chosen. Clearly, if sampling distance is Di, namely m=1, Algorithm 4 achieves a good balance between the number of samples and reduction effectiveness. In this case, the number of samples is relatively small and time complexity of Algorithm 4 is low correspondingly. Reduction time accounts for just 13% of traditional algorithm. But at the same time, reduction effectiveness basically coincides with the traditional algorithm. The experimental research results on other data sets are similar to the Statlog. Hence the value of m is 1 in all the later experiments. That is to say, the sampling distance is Di.

Next, traditional attribute reduction algorithm based on positive region and Algorithm 4 in this paper are separately used to calculate the time required for getting reduction and compare the efficiency of the algorithms. The corresponding experimental results on Winequality data set are listed. There are 11 condition attributes in database and computing time of this paper is the sum of redundancy removal time, sampling time and reduction time.

As seen from Table 5, the original Winequality data set has nearly 20% redundant data, which is a very common phenomenon in real data. At the same time, the sampling rate is 45%. So, the number of objects of the original data set has been greatly decreased after redundancy removal and sampling. Moreover, reduction of samples basically coincides with the original data set, but computing time just accounts for 29% of the original data set. Overall, it is clear that reduction efficiency is greatly improved.

Comparison on the effect of reduction for Winequality data set

|U| Traditional algorithm Algorithm 4
Original data set Redundancy removal Sampling Reduction result Reduction time (s) Reduction result Computing time (s)
4898 3961 1788 5 113.62 4 33.25

The experimental results on Contraceptive data set are listed here. There are nine condition attributes in database and computing time of this paper is the sum of redundancy removal time, sampling time and reduction time.

The result shown in Table 6 is similar to that in Table 5. Redundant data provide 3% of the original Contraceptive data set and the sampling rate reaches 47%. Thus, the number of objects drops significantly. Consequently, computing time of Algorithm 4 in this paper just accounts for 33% of the original data set. This also illustrates the efficiency of the proposed approach. Meanwhile, comparing Table 5 with Table 6, it is found that the higher the data volume is, the more obvious the advantage of Algorithm 4 is.

Comparison on the effect of reduction for Contraceptive data set.

|U| Traditional algorithm Algorithm 4
Original data set Redundancy removal Sampling Reduction result Reduction time (s) Reduction result Computing time (s)
1473 1425 665 9 12.62 8 4.13
Conclusions

For large data sets, redundant data expend a lot of computer run time and memory in attribute reduction. To overcome this issue, the paper puts forward an attribute reduction method based on sample extraction and priority. To start with, this method removes redundant data using quick sort. Then, the samples near the classification boundary are extracted and used for reduction calculation. According to the concept of positive region, an attribute reduction algorithm that is based on priority is proposed in this paper and used for reduction. Experiments show that this method suits attribute reduction calculation on large data sets well, specifically. With the rise of smart medical care, a large number of medical data resources have been integrated. Medical big data knowledge mining has become a research focus in the current academic field. The method proposed in this article can be used to effectively reduce the dimensionality of massive medical data and improve the efficiency of knowledge mining. This is the next application and research direction.

UCI data sets.

Data sets |U| |C|
Statlog 6435 36
Winequality 4898 11
Contraceptive 1473 9

A redundant decision table.

U a b c D
u1 1 0 1 0
u2 2 0 1 1
u3 1 0 1 0
u4 0 0 0 0
u5 1 2 0 1
u6 1 2 0 1
u7 2 2 0 1

The relationship between different sampling distances and reduction effectiveness for Statlog.

Sampling distance Sample number Attribute number of minimum reduction Computing time (s)
0.6Di 5473 9 329.61
0.8Di 3965 8 195.13
Di 2648 8 111.42
1.05Di 2344 7 79.82
1.1Di 2077 7 65.15
1.2Di 1628 6 54.59

Compressed decision table.

U a b c D
u2 2 0 1 1
u3 1 0 1 0
u4 0 0 0 0
u5 1 2 0 1
u7 2 2 0 1

Comparison on the effect of reduction for Contraceptive data set.

|U| Traditional algorithm Algorithm 4
Original data set Redundancy removal Sampling Reduction result Reduction time (s) Reduction result Computing time (s)
1473 1425 665 9 12.62 8 4.13

Comparison on the effect of reduction for Winequality data set

|U| Traditional algorithm Algorithm 4
Original data set Redundancy removal Sampling Reduction result Reduction time (s) Reduction result Computing time (s)
4898 3961 1788 5 113.62 4 33.25

Pawlak Z, Rough sets. International Journal of Computer and Information Science, 1982, 11(5), pp. 341–356. PawlakZ Rough sets International Journal of Computer and Information Science 1982 11 5 341 356 Search in Google Scholar

Bera, S., Roy, S.K. Fuzzy Rough Soft Set and Its Application to Lattice. Granular Computing, 2020, 5, pp. 217–223. BeraS. RoyS.K. Fuzzy Rough Soft Set and Its Application to Lattice Granular Computing 2020 5 217 223 Search in Google Scholar

Sun, B., Ma, W., Chen, X. et al. Multigranulation Vague Rough Set over Two Universes and Its Application to Group Decision Making. Soft Computing, 2019, 23, pp. 8927–8956. SunB. MaW. ChenX. Multigranulation Vague Rough Set over Two Universes and Its Application to Group Decision Making Soft Computing 2019 23 8927 8956 Search in Google Scholar

Chelly Dagdia, Z., Zarges, C., Beck, G. et al. A Scalable and Effective Rough Set Theory-based Approach for Big Data Pre-processing. Knowledge and Information Systems,2020, 62, pp. 3321–3386. Chelly DagdiaZ. ZargesC. BeckG. A Scalable and Effective Rough Set Theory-based Approach for Big Data Pre-processing Knowledge and Information Systems 2020 62 3321 3386 Search in Google Scholar

Wafo Soh, C., Njilla, L.L., Kwiat, K.K. et al. Learning Quasi-identifiers for Privacy-preserving Exchanges: a Rough Set Theory Approach. Granular Computing, 2020, 5, pp. 71–84. Wafo SohC. NjillaL.L. KwiatK.K. Learning Quasi-identifiers for Privacy-preserving Exchanges: a Rough Set Theory Approach Granular Computing 2020 5 71 84 Search in Google Scholar

Wong S K M, Ziarko W, Optimal decision rules in decision table. Bulletin of Polish Academy of Sciences, 1985, 33(11–12), pp. 693–696. WongS K M ZiarkoW Optimal decision rules in decision table Bulletin of Polish Academy of Sciences 1985 33 11–12 693 696 Search in Google Scholar

Liu W J, Gu Y D, Feng Y B, et al., An improved attribute reduction algorithm of decision table. Pattern Recognition and Artificial Intelligence, 2004, 17(1), pp. 119–123. LiuW J GuY D FengY B An improved attribute reduction algorithm of decision table Pattern Recognition and Artificial Intelligence 2004 17 1 119 123 Search in Google Scholar

Du J L, Chi Z X, Zhai W, An improved algorithm for reduction of knowledge based on significance of attribution. Mini-Micro System, 2003, 24(6), pp. 976–978. DuJ L ChiZ X ZhaiW An improved algorithm for reduction of knowledge based on significance of attribution Mini-Micro System 2003 24 6 976 978 Search in Google Scholar

Li Hua, Research on the Model and Algorithms of Rough Set for Multi-label Data. Ph.D. thesis, Shanxi University, Taiyuan, China, 2017. LiHua Research on the Model and Algorithms of Rough Set for Multi-label Data Ph.D. thesis, Shanxi University Taiyuan, China 2017 Search in Google Scholar

Hart P, The condensed nearest neighbor rule. IEEE Transaction on Information Theory, 1968, 14(5), pp. 515–516. HartP The condensed nearest neighbor rule IEEE Transaction on Information Theory 1968 14 5 515 516 Search in Google Scholar

Gates G W, The reduced nearest neighbor rule. IEEE Transactions on Information Theory, 1972, 18(3), pp. 431–433 GatesG W The reduced nearest neighbor rule IEEE Transactions on Information Theory 1972 18 3 431 433 Search in Google Scholar

Brighton H, Mellish C, Advances in instance selection for instance-based learning algorithms. Data Mining and Knowledge Discovery, 2002, 6(2), pp. 153–172. BrightonH MellishC Advances in instance selection for instance-based learning algorithms Data Mining and Knowledge Discovery 2002 6 2 153 172 Search in Google Scholar

Wang Xizhao, Wang Tingting, Zhai Junhai, An Attribute Reduction Algorithm Based on Instance Selection. Journal of Computer Research and Development, 2012, 49(11), pp. 2305–2310. WangXizhao WangTingting ZhaiJunhai An Attribute Reduction Algorithm Based on Instance Selection Journal of Computer Research and Development 2012 49 11 2305 2310 Search in Google Scholar

JI Su-Qin, SHI Hong-Bo, Lv Ya-Li, An Attribute Reduction Algorithm Based on Granular Computing and Discernibility. Pattern Recognition and Artificial Intelligence, 2015, 28(4), pp. 327–334. JISu-Qin SHIHong-Bo LvYa-Li An Attribute Reduction Algorithm Based on Granular Computing and Discernibility Pattern Recognition and Artificial Intelligence 2015 28 4 327 334 Search in Google Scholar

Yang Yanyan, Rough Set Based Mechanisms and Algorithms for Incremental Attribute Reduction. Ph.D. thesis, North China Electric Power University, Beijing, China, 2017. YangYanyan Rough Set Based Mechanisms and Algorithms for Incremental Attribute Reduction Ph.D. thesis, North China Electric Power University Beijing, China 2017 Search in Google Scholar

Pan Wei. Rough Sets Model with Entropy for Multi-criteria Ordernal Decision System. Ph.D. thesis, University of Electronic Science and Technology of China, Chengdu, China, 2017. PanWei Rough Sets Model with Entropy for Multi-criteria Ordernal Decision System Ph.D. thesis, University of Electronic Science and Technology of China Chengdu, China 2017 Search in Google Scholar

Jin ping, Zong Yu, Jiang He, et al., Muti-space FCM Algorithm. Computer Engineering and Applications, 2007, 43(28), pp. 162–165. Jinping ZongYu JiangHe Muti-space FCM Algorithm Computer Engineering and Applications 2007 43 28 162 165 Search in Google Scholar

Zhang Wenxiu, Wu Weizhi, Liang Jiye, et al., Rough Set Theory and Method. Beijing: Science Press, 2001. ZhangWenxiu WuWeizhi LiangJiye Rough Set Theory and Method Beijing Science Press 2001 Search in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo