Otwarty dostęp

Application of Incremental Updating Association Mining Algorithm in Geological Disasters System


Zacytuj

INTRODUCTION

Geological disasters are geological phenomena that cause serious harm or potential threat to human lives and property. Human activities can affect the occurrence of geological disasters and the extent of their damage, but they can not be completely eliminated or prevented. In addition to earthquakes, volcanoes, tsunamis and other sudden geological disasters can cause unmanageable destructive disaster to humans. Some slowly changing geological disasters such as landslides, land subsidence and ground fissures will also bring huge losses to the lives and property of the people, the economic development in cities and areas. Geological disasters have seriously affected the sustainable development of society and affected social stability. Therefore, geological disaster data is an important basic resource related to national economy and the people’s livelihood, because the data contains a lot of useful information. But understanding and relying on these data to make scientific decisions is beyond human capacity. How to make full use of geological exploration data to make relevant prediction and scientific decision-making has become one of the concerns of production decision makers.

Association Rule Mining[1] is one of the most explored, most commonly used data mining technology. This method is one of the most active research areas in the field of data mining. It mainly helps to discover the implicit and valuable relationships between data items in massive databases to guide decision makers in various fields such as commerce Strategic analysis. Researching association rule algorithms in big data technology environment is a very important and challenging research topic. Considering that the data in the database of geological disasters always change constantly, the incremental association rule updating techniques have been proposed to effectively maintain the association rules of updated data. The technology should have some characteristics:

The association rules should vary with the data;

The rule updating should avoid dealing with the old data again, as much as possible to use the previous processing results;

Updating maintenance methods should be applied to a variety of occasions as much as possible. FUP algorithm is iterative update algorithm based on Apriori algorithm.

In order to solve the problem of incremental updating mining of association rules, Cheung and other scholars proposed a fast update algorithm (FUP)[2]. Later, domestic scholars such as Zhu Yuquan proposed a method FUKFIA[3] for rapidly updating frequent item sets based on the FUP algorithm. They define a new set of frequent items that reduce the number of database scans to a certain extent.

Inevitably, the above incremental updating algorithm of association rules, like the Apriori algorithm, requires layer-by-layer traversal to generate frequent item sets. Therefore, there is also the problem of overhead in processing databases and generating huge candidate item sets. FUP2 algorithm[4] is based on the FUP algorithm to improve, put forward an improved algorithm for transaction records in the constantly updated, correct and delete operations. Feng Yucai and other scholars put forward incremental update association mining algorithm IUA and PIUA[5]. In the algorithm, splicing and pruning techniques are used to solve the generation of candidate item sets. When the transaction database is unchanged and the minimum support threshold is changed. CATS-tree algorithm[6], IUAR algorithm[7] are through a variety of methods to reduce the number of scanning the original database to achieve incremental update association rules maintenance issues.

To sum up, the FUP algorithm and the FUP2 algorithm are used to maintain the incremental updating association mining when the minimum support threshold and the minimum confidence threshold do not change, and the transaction records are continuously updated. IUA algorithm, PIUA algorithm and CATS-tree algorithm handle the maintenance of incremental updating association mining when the transaction record data does not change, the minimum support threshold and the minimum confidence threshold change. The IUAR algorithm is used to solve the problem of maintaining and updating the association mining when the transaction database is continuously updated and the minimum support threshold is changed. The basic idea of the algorithm is to obtain the extended set of candidate frequent items by reducing the support degree. When accessing the updated database, the association rules are incrementally updated by constantly updating the candidate frequent item sets. Although this algorithm has been greatly optimized in terms of the large number of candidate item sets, there are still some shortcomings that it is necessary to retrieve the original transaction database multiple times and produce a large number of candidate item sets. So far, there has been relatively little research on incremental updates in the context of big data environments when both the transaction database and the minimum support threshold are changed. Therefore, it is very necessary to find an incremental update mining method that can effectively solve such problems.

In this paper, we propose an incremental update mining algorithm for inverted index tree, which is applied to geological disaster monitoring system. Firstly, the frequent item sets are excavated according to the database, and then the association rules are obtained and analyzed. Finally, the association rules are applied to the early warning work.

INVERTED INDEX AND INVERTED INDEX TREE
Inverted index

The inverted index is the most commonly used data structure in the information retrieval system. In the index, each index item consists of the attribute value and the location information that appears, <Key, storage address>. At the time of querying, you can get all the documents that contain the keyword at once, so the retrieval efficiency is higher than the forward index. For example, Table 1 is a partial transaction log of the groundwater table (derived from the original transaction database of the Geological Disaster Monitoring System), and Figure 1 is the inverted index map (IIP)[8] of TABLE I.

TABLE I.

GROUNDWATER TABLE IN GEOLOGICAL DISASTERS DATABASE (MM)

Figure 1.

TABLE. I corresponds to the inverted index map(IIP)

Inverted index tree based on B+ tree implementation

B+ tree is the deformation of the B-tree. A m-order B+ tree[9] should meet the following characteristics:

The number of keywords per node is equal to the number of children. The keywords of all non-lowest inner nodes are the largest keywords on the corresponding sub-tree, and the bottom node contains all the keywords;

Branch nodes can be placed (m-1) keyword, leaf nodes can put m keywords;

All leaf nodes are in the same layer of the number structure and do not contain any information. Thus, the tree height of the tree is balanced.

B+ tree is a commonly used index mechanism in the database and a one-dimensional data index structure design[10].

As shown in Figure 2, a B+ tree consisting of the divided character string is m=3.

Figure 2.

Order three of the B+ tree

ALGORITHM OF INCREMENTAL UPDATING RULE MINING FOR INVERTED INDEX TREE

The above traditional frequent item sets mining algorithm has disadvantages when generating association rules mining will generate a large number of candidate item sets and repeat retrieval processing of the original database. In this paper, we implement the inverted index tree based on the characteristics of B+ tree and put forward the incremental update association mining algorithm of inverted index tree to deal with the association rules efficiently.

Basic ideas

Statistics in the database transaction items, get transaction items set.

constructs the inverted index tree (II Tree) based on the B+ tree creation method.

The bottom-most leaf node of II Tree contains all the item sets. The frequent item sets are obtained by comparing the number of items with the minimum support, and other infrequent item sets remain in their leaf nodes to ensure that future data updating become frequent nodes. The different item sets in the frequent item sets are combined and their confidences are calculated. When the confidence is greater than the minimum confidence, the item set combination is the association rule.

Compared with the previous incremental updating correlation mining algorithm, Inverted Index Tree Incremental Updating Association mining algorithm introduces B+ tree structure, as well as the database settings. When adding new data, we only need to retrieve the frequent item sets that deal with the new part of the data without the need to re-retrieve the entire database.

The algorithm takes advantage of the B+ tree’s balanced tree properties. The leaf nodes at the bottom of the tree contain all the keywords of the whole tree, and they are linked together like a doubly linked list. The inverted index[9] is realized based on the B+ tree.

Set the threshold

The minimum support threshold and the minimum confidence threshold will change with different user needs and database updates. When the threshold is set too low, the more rules that are excavated, the lower the usefulness of the rules. Conversely, when the threshold is set too high, there are few rules for mining, so some useful rules will be lost. Therefore, setting the appropriate threshold is very important when dealing with incremental databases.

When mining association rules for the first time, setting the minimum support threshold is a trial and error. Select a small part of data sets randomly from the entire database to be excavated, set initial support thresholds and confidence thresholds according to the user’s requirements or experience, and obtain n number of association rules. Compare n with the number of association rules the user expects m. If n/n′ ≥ d, it is considered that the threshold set by the user is smaller, so that the excess number of rules dug up is expected. We should increase the support threshold by a certain value and rerun it. If b < n/n′ < d, it is considered that the user is substantially satisfied with the result of the association rule mined at this support threshold. If n/n′ < b, it is considered that the set threshold is too large, some important rules may be lost, and then a slightly smaller support threshold is selected to re-excavate the algorithm.

When the transaction database is updated, it is possible that the previously set threshold is no longer applicable, so the threshold needs to be reset. Based on the support threshold, confidence threshold, the number of association rules output by the algorithm at the last time and the current mining targets, the Newton interpolation formula is used to predict the support threshold that should be adopted currently, which makes the mining association rules more effective.

Algorithm implementation

In order to achieve the inverted index[11] with the B+ tree, Inverted index tree incremental update association mining algorithm is proposed.

The algorithm steps are briefly described as follows:

Traverse the inverted index map, get the item set. Based on the data of groundwater level, soil moisture content, rainfall and topography in the database of geological disaster monitoring system, inverted index maps were established. Traversing the index graph to get the item set, and then by the B+ tree to build inverted index tree.

Get all frequent item sets based on the generated II Tree. In the B+ tree, the bottom leaf node contains all the keywords. Then, the confidence level is calculated for the combinations of different item sets in frequent item sets, and the correlation rules are obtained, and the correlation analysis of the rules is carried out to obtain a more realistic association rule.

When the transaction database update records, in accordance with the above steps to retrieve some of the new data processing, the item set inserted II Tree. Add a keyword to the leaf node at the bottom of the II Tree. If the number of keywords contained in the node does not exceed M, the insertion is successful. Otherwise, the node needs to be split. The number of keywords included in the two new nodes after splitting should not be less than (M/2 + 1). If the parent node of the node has enough space (M/2 ~ M-1), the parent node should contain the smallest keyword in the right node. Otherwise, the parent node is split and a new keyword is added to the parent of the parent node, followed by a higher order until the new root node is generated (1 ~ M-1).

After the database is updated, repeat the Step2 operation and then generate the association rules.

Incremental update is illustrated by groundwater level data. TABLE II is the database updating data, Figure 3 is the entire database corresponding to the II Tree.

TABLE II.

UPDATE THE RECORD DATA (MM)

Figure 3.

Inverted index tree of updated data

Algorithm flow chart shown in Figure 4:

Figure 4.

Flow chart of incremental index updating algorithm of inverted index tree mining

Correlation Analysis of Association Rules

Although, association rules have metrics of interest such as support and confidence. However, association rules may include causal association as well as random association or even negative correlation. Here’s an example to illustrate:

In a supermarket database system, the customer’s product purchase information is recorded. Of the 10000 purchases, 8000 of them included bread, 6000 had cookies, 4800 had both breads and biscuits. If you set a minimum support of 30% and a minimum confidence of 40%, then can get the following association rules:

Rule 1: Buy bread ⇒ Buy biscuits {support = 48%, confidence = 60%}

In reality buying bread and buying biscuits may be negative because buying bread will reduce the number of people buying biscuits. At the same time, consider the following negative correlation rules:

Rule 2: Buy bread ⇒ Do not buy cookies {support = 32%, confidence = 40%}

In a sense, the second rule is more realistic. Thus, under given threshold conditions, two contradictory rules are obtained.

It can be seen from the above examples that judging the true meaning of association rules can not be based solely on the measure of support and confidence, but rather on a comprehensive examination of the data set. To do this, put forward some other methods such as chi-square statistics or correlation analysis[12]. The core idea of these methods is to measure the correlation between data items. Chi-square statistics calculation formula is as follows(χ2):χ2(A,B)=[P(A)*P(B)P(AB)]2P(A)*P(B)

If chi-square statistics is zero, then there is no dependency between the data item A and the data item B, and they are independent of each other. Otherwise, the data items are interdependent. Relevance calculations more clearly show that this dependence is mutual promotion or mutual restraint. Correlation is calculated as follows:corr(A,B)=p(AB)P(A)*P(B)

If corr(A, B) is equal to 1, then data item A and data item B are independent; if corr(A, B) is greater than 1, data item A and data item B are positively correlated; if corr(A, B) is less than 1, then data item A and data item B are negative Related.

Support-Confidence Frame Theory is not perfect: Some rules are of no practical value, even if both support and confidence are high. The association rule AB does not give the user information whether A and B are constructive or counterproductive. Relevance analysis of association rules is to overcome this deficiency, allowing users to rationally view the association rules. Therefore, after the association rules of the geological disaster monitoring system database are excavated, the correlation analysis should be carried out to ensure the practicability of this rule. If it is determined that the rule is positively correlated, the value of the rule is used as the alarm threshold of the geological hazard monitoring system. In the future of new data acquisition, if the conditions of this rule, then the alarm. People can prevent it in advance.

EXPERIMENT RESULTS AND ANALYSIS

The algorithm uses the record of rainfall, groundwater level, soil water content and topography data of Shang Nan County in Shang Luo City in the geological disaster monitoring system of last year as the experimental data set. Use C language in Win7, dual-core 2.3GHZ CPU, 4GB memory on the PC for simulation.

Comparison of II Tree algorithm and IUAR algorithm

The analysis of time complexity of the algorithm:

When the data is updated, only in the database to scan the updated data;

When building an inverted index tree, scan the tree structure once and insert the new item set. Analysis shows that when the minimum support is constant, the execution time of the algorithm is related to the amount of data updated each time. To extract a small number of experimental samples from the data set, the minimum support for controlling IUAR algorithm is unchanged (0.1), increase the amount of updated data in turn. Record the experiment time of IUAR algorithm and II Tree algorithm respectively, time comparison shown in Figure 5:

TABLE III.

IUAR, IITREE ALGORITHM TO RUN THE EXPERIMENTAL TIME(S)

Figure 5.

Algorithms time comparison for IUAR and II Tree

As shown in Figure 5, when the minimum support is constant, the execution time of the IUAR algorithm increases rapidly but the ones of the II Tree algorithm grows more slowly when the amount of updated data increases. In the same amount of data, IUAR algorithm takes more time than II Tree.

The analysis of spatial complexity of thealgorithm:

In the inverted index map, only the updated data is stored, so the size of the memory space is related to the amount of updated data;

In the II Tree algorithm, the frequent item sets determined by the minimum support are stored, so the memory space is associated with the minimum support. This experiment mainly studies the effect of minimum support on memory usage. When the minimum support of IUAR algorithm is changed from 0.2 to 0.6, the data samples and increments remain unchanged. 800 new records have been added to the raw groundwater level data set as test samples. The memory usage of IUAR algorithm and II Tree algorithm is compared according to the experimental results.

Figure 6 shows that the smaller the minimum support, the more the number of item sets produced, the greater the memory footprint. In the case of support change, the IUAR algorithm updates the candidate set with the change of the support degree, saves the candidate item set and the frequent item set in the memory space. The II Tree algorithm does not generate the candidate item set in the change of the support degree, so the occupied memory space by IUAR algorithm is greater than the ones by II Tree algorithm.

Figure 6.

Memory usage comparison

Application of II Tree Algorithm in Geological Disaster Monitoring System

If the relationship between the table properties are Boolean attributes, then mining rules from this relational table are Boolean rules. The problem now is that geological disaster monitoring system databases are numerical data. The quantitative attributes must be dealt with in a necessary way so that the mining of quantitative rules can be transformed into the mining of Boolean rules. The main strategy is to divide the range of the number attribute into intervals, and to decompose a quantity attribute into several Boolean attributes. In order to reduce the computational workload, the original data are standardized and divided into different sections. The data are grouped according to the sections and the frequency is recorded in the II Tree. Then the frequent item sets can be excavated. The frequent item sets are divided according to the average value and divided into low value area and high value area respectively. Data of groundwater level, rainfall, soil moisture content and ground deformation in the past year were selected from the database of geological disasters and the association rules were excavated. Select some of the data for analysis, as shown in TABLE IV:

TABLE IV.

DISASTER MONITORING DATA TO BOOLEAN DATA CONVERSION TABLE

According to the data of geological hazard monitoring system, the association rule mining is carried out and a series of rules are obtained. Some rules are as follows:

Rule 1: If G2=1 and R1=1 and S2=1 then D2=1

Rule 2: If G1=1 and R2=1 and S1=1 then D2=1

Rule 3: If G2=1 and R2=1 and S2=1 then D2=1

Rule 4: If G2=1 and R1=1 and S2=1 then D2=1

Rule 5: If G1=1 and R2=1 and S1=1 then D1=1

Rule 6: If G1=1 and R1=1 and S1=1 then D2=1

After analyzing the above rules, we can get:

In the case of high groundwater tables, heavy rainfall, and high soil moisture levels, large-scale deformation of the ground is promoted (according to rules 1 and 3);

Under conditions of heavy rainfall, the deformation of the ground may also be induced, even if the groundwater level and soil moisture are not high (according to Rule 2);

When the groundwater level and soil moisture content is high, it is possible to promote the occurrence of ground deformation (according to Rule 4);

When the groundwater level is low, the soil moisture content is low and the rainfall is very little, the ground will be deformed due to dryness (according to Rule 6).

In summary, when the data of the local water table, rainfall, soil moisture content and ground deformation reach the value of the association rules, further analysis can be used as the warning threshold of landslide or ground fissure.

CONCLUSION

In this paper, An algorithm of Inverted Index Tree Incremental Updating Association Mining (II Tree) is proposed. The algorithm is effectively implemented when the database is updated, without having to scan the original database. The new data will be inserted into the original B+ tree to get frequent item sets. Experiment results show that the II Tree algorithm consumes less than 2/5 of the IUAR algorithm when the number of transactions and the amount of new data are the same, which improves the efficiency of data processing. When the minimum support of IUAR algorithm changes, II Tree algorithm takes up less memory than IUAR algorithm. The application of II Tree algorithm to the data analysis of geological disaster monitoring system has some improvements in efficiency and memory usage and can be better applied to the early warning of the system.

eISSN:
2470-8038
Język:
Angielski
Częstotliwość wydawania:
4 razy w roku
Dziedziny czasopisma:
Computer Sciences, other