Deep Integration of Health Information Service System and Data Mining Analysis Technology

It is one of the important topics for future research to innovate medical and health services by using the development of computer software technology and information technology. The developed countries like the United States have been studying this for decades. On April 27, 2004, president Bush of the United States issued Presidential Decree 13335, which explicitly stated that all the health records of US should be digitised within 10 years which means all the records will become electronic health records (EHRs). In 2005, the United States Government established an advisory committee, AHIC, to provide recommendations and solutions to the problem of medical informatisation in the United States. In 2009, President Barack Obama promulgated President's Decree 13507, promulgating the HITECH Act as part of the health care reform of the United States; in 2010, the United States signed the Patient Protection and Affordable Medical Act again, followed by the creation of the Federal Medicare and Medicaid EHR incentive programs. Thus, the United States pays more attention to the application of electronic health records and electronic medical records in when shaping the health information policy. Japan is the leading nation in Asia in terms of health care policy and informatisation. Japan's National Health Insurance in 1961 and the Elderly Health Act in 1983 made public social insurance available to all citizens. The European Commission's ’Action Plan for European Informatization 2002 and 2005’ calls for a focus on medical informatisation, which is strongly supported by the European Council. In May 2003, the European Commission and Member States put forward the concept of ‘High Level Conference on Medical and Health Informatisation’. The purpose of these meetings is to provide expertise and knowledge to leaders and to show the latest technological achievements in the construction of medical informatisation in the European Union. In the same year, the Ministers of Health at the Brussels Conference announced their firm commitment to building information-based health care. Ireland introduced the European Action Plan on Health In 2004, and set the EU targets for 2008 and beyond. In 2017, the World Health Organization revised the Strategy for the Prevention and Control of New Diseases in the Asia-Pacific Region, proposing to improve the efficiency of cross-sectoral emergencies through health integration mechanism, systematic coordination and information sharing mechanism.

Currently, the most common health information push tool is mobile terminal software. Health APP has gradually become the most important application of mobile phones with the rapid development of smart terminal devices. Users can simply check health data, consult health problems online, search health knowledge base, etc. Some personal health management software also has the functions of ‘mass messaging’ and ‘intelligent reply’ to ensure the authenticity and uniqueness of users. As shown in Figure 1, Technical Roadmap of Health Information Precision Service Model.

Technical Roadmap of Health Information Precision Service Model

On the other hand, with the progress of society and the rapid advancement in science and technology, computer, communication and Internet technology have penetrated into all sectors and are changing everyday life of the mankind [1,2]. Health information service system generates, collects and stores a large amount of data using the new technologies in computer field. The ever increasing amount of data has become a perennial problem for various industries. A lot of information brings benefit to all sectors of society, but also brings a lot of problems.

(1)

Information is too much to digest.

(2)

It is difficult to identify the true or false information.

(3)

Information security is difficult to guarantee

(4)

Information forms are inconsistent and difficult to deal with uniformity.

Current database systems can efficiently execute data entry, query, statistics and other functions, but cannot find the causal relationship and rules in the data and cannot predict the future development trend based on the existing data. So, people began to put forward a new slogan: ‘Learn to discard information’. At the same time, people began to think, ‘How can we not be inundated with information, but find useful knowledge in time and improve the utilisation rate of information?’ Faced with the challenge of ‘abundant data and lack of knowledge’, data mining technology has emerged at the historic moment, accompanied by the emergence of new computer technologies and new theories, and thriving in the information era. Its application in telecommunications, banking, biology, fraud, supermarkets and other fields shows its strong vitality [3,4,5].

Relevant Research Based on Computer Software Technology and Data Mining Technology

2.1

Characteristics of Computer Software Technology

Computer software technology refers to the program or related data set up in order to ensure the normal operation of the computer. Software is the interface between users and hardware. It is the core component of maintaining the normal operation of the computer. It is also the channel of communication between users and computers. It can improve the overall structure of the computer, meticulousness and reliability [6,7,8]. Computer software Technology is a type of computer technology, including data processing, artificial intelligence, process control and scientific calculation. It has the following characteristics: (1)

The cost of development continues to increase.

(2)

The difficulty of development is increasing.

(3)

The internal structure is becoming more and more complex.

2.2

Technical Analysis of Data Mining

Data Mining is the process of mining required data from large amounts of data stored in databases, data warehouses, or other storage information bases. It is generally accepted that data mining is a process of extracting hidden, potential, effective, novel, useful and ultimately understandable knowledge from many incomplete, noisy, fuzzy and random data [9, 10]. The general flow of data mining is shown in Figure 2.

(1)

Data preprocessing

It includes raw data collection, data cleaning, data extraction and data transformation. The purpose of raw data collection is to determine the object of operation of discovery task, namely target data. It collects data according to specific needs and requirements.

(2)

Data Mining

Data mining first determines the tasks of mining, such as data summary, classification, clustering, association rule discovery, sequential pattern discovery and so on. After the task is determined, it is necessary to decide the mining algorithm to be used. The same mining task can be implemented using different mining algorithms.

(3)

Model Evaluation and Knowledge Representation

Patterns are the results of data mining. The interesting patterns that represent knowledge are identified based on the some measure of interest. This process is called pattern evaluation, in which the specific values of the measures are given by specific domain experts, users or domain standards. The result is ‘knowledge’, which can be submitted to users through visualisation technology or by converting the results into user-friendly representations.

2.3

Construction of Health Information Resource Module

Currently, every country has been strengthening the construction of information resources services and standardising the content of information services continuously. Based on the research of other scholars, this paper builds an accurate health knowledge base, mainly from four aspects to ensure its accuracy. As shown in Figure 3, there is a multi-faceted and authoritative source of health information. We develop a set of web crawler technology to update the knowledge base in real time, select content scoring tools to measure the quality of information, and integrate heterogeneous digital health resources.

Process of Health Information Resource Module

Health knowledge base has a wide range of information sources to extract more accurate and valuable health information from more comprehensive information. Faced with numerous health information sources, this study designed a web crawler program to obtain the latest health information released by various information sources in time. Current web crawler technologies include: General Purpose Web Crawler, Focused Web Crawler, Incremental Web Crawler and Deep Web Crawler. ‘General Purpose Web Crawler’ crawls a wide range, a large number, and is generally used to crawl search engines, but due to commercial reasons, it is not completely open to the public; ‘Focused Web Crawler’ is a technology applied on the premise of identifying crawling topics, so it is also known as ‘Theme Network’. Because the purpose of crawling is very clear, it only crawls the web resources related to the theme, so it has fast crawling speed and less hardware occupation; ‘Incremental Web Crawler’ is to crawl new pages after determining the crawl, which is more real-time than periodic crawling; ‘Deep Web Crawler’ is a needle. For some deep web pages, such as those accessible only by using user accounts, there is a requirement for the user's right of access to the web pages. Most of the information sources of the health knowledge base constructed in this study are the websites of some government agencies and universities. So the amount of information released in a single day will not be much, the task of crawling is not very large, and the process control of crawling will not be cumbersome. Combining with the characteristics of this study, the strategy of ‘Focused Web Crawler + Incremental Web Crawler’ is selected to crawl health information sources.

As the part of data integration, this research integrates three technologies, XML, Web Service and Message Middleware, to integrate heterogeneous data. First, the heterogeneous data sources are shielded by message middleware technology. Then, the standard XML format data is generated by client software. Then Web Service is used for data integration. The processed data is output by middleware for storage and invocation.

Establishment of Data Mining Algorithmic Model

3.1

Overview of Decision Tree Algorithmic Model

Decision tree learning is one of the most widely used classification algorithms in the field of data mining. It is a method of approximating discrete-valued functions. It has good robustness to noise data and its rules are expressed in disjunctive expressions which are easy to understand. Since the 1960s, decision tree method has been widely used in classification, prediction, rule extraction and other fields. Especially after Quilan proposed ID3 algorithm, it has been further applied and developed in the field of machine learning and knowledge discovery. Further, the representative decision tree methods include CART, SLIQ and SPRINT. In this paper, we adopt simple ID3 algorithm. In ID3 algorithm, a significant improvement of CLS algorithm is that it determines ‘certain rules of selecting test attributes’ in CLS algorithm as selecting test attributes based on information gain. In ID3 algorithm, the window method is introduced for incremental learning to solve the problem that when the training instance set is too large, all data cannot be put into memory.

3.2

Algorithmic Model Computation

The flow chart of the algorithm is shown i n Figure 2. Let T be a set of t data samples. Its class attributes can take n different values, corresponding to n different classes C_i,i ∈ (1,2,3,...,n). Assuming that S_i is the number of samples in category C_i, the amount of information needed to classify a given sample data object is: (1) $I (t_{1}, t_{2}, \dots, t_{m}) = - \sum_{i = 1}^{m} p_{i} {log}_{2} (p_{i})$ I({t_1},{t_2}, \cdots ,{t_m}) = - \sum\limits_{i = 1}^m {p_i}\mathop {\log }\nolimits_2 ({p_i})

Among them, p_i is the probability that any sample belongs to C_i, and it is advantageous to S_i/S calculation.

Let attribute A have k different values (a₁,a₂,...,a_k) and attribute A divide T into k subsets (S₁,S₂,...,S_k), where samples contained in T_j have a value a_j(j = 1,2,...,k) on A, if attribute A is selected as the test attribute (i.e., the best partitioning attribute), then these subsets correspond to the branches growing from the set S. Let T_ij be the sample number of class C_i in subset T_j. It is known ‘the smaller the entropy, the higher the purity of subset partition’. Entropy of subset partitioned according to A or expected information is given by the following formula: (2) $E (A) = \sum_{j = 1}^{ν} \frac{t_{1 j} + \dots + t_{mj}}{t} (t_{1 j}, \dots, t_{mj})$ E(A) = \sum\limits_{j = 1}^\nu {{{t_{1j}} + \ldots + {t_{mj}}} \over t}({t_{1j}}, \ldots ,{t_{mj}})(3) $I (t_{1 j}, \dots, t_{mj}) = - \sum_{i = 1}^{n} p_{ij} {log}_{2} (p_{ij})$ I({t_{1j}}, \ldots ,{t_{mj}}) = - \sum\limits_{i = 1}^n {p_{ij}}\mathop {\log }\nolimits_2 ({p_{ij}})

The encoding information to be obtained by branching on A is: (4) $Gain (A) = I (t_{1}, t 2, \dots, t_{m}) - E (A)$ Gain(A) = I({t_1},t2, \ldots ,{t_m}) - E(A)

That is to say, Gain is considered to be the reduction of information entropy obtained by dividing the sample set according to the value of attribute A. ID3 algorithm calculates the information gain of each attribute and selects the attribute with the largest information gain as the test attribute of a given set S, and decides to generate the corresponding branch nodes. The generated nodes are marked as the corresponding attributes, and the corresponding branches are generated according to the different values of this attribute. Each branch represents a divided sample subset. The information gain of attributes describes the concept that for a training sample set T, the amount of information needed to identify the category of a sample from T after partitioning T with an attribute A is reduced. Through it, the ability of training sample set for attribute classification can be measured. As shown in Figure 4, the algorithm flow chart.

Integration of Computer Software Technology and Data Mining Analysis Technology in Internet Age

4.1

Selection of Characteristic Attributes of Computer Software

In an engineering project with large software, the defect report in software system should be described in detail and in total. Bug in software may have the following attributes: report source, author and responsible person, submission time and solution time, report content and description information, notes, priority, severity and Bug's current status. Using Bugzilla's export function, we can get the explanations of the selected attributes of all Bug report XML files as shown in Table 1, where the path relationship of the code modification function in the table is shown in Table 2.

Table 1

Five Attributes and Interpretation of Bug Report Constructing Classifier

Attribute types	Number	Attribute	Describe
Personnel attributes	A₁	Solution personnel	Repair Bug Registrar
Bug Report and Solution Properties	A₂	Bug priority	Priority level
	A₃	Bug state	Current status of Bug reports
	A₄	Modification instructions	Problem description
Code Modification Call Properties	A₅	Notes	F₁ → F₃F₁ → F₃ → F₄F₁ → F₂ → F₄...

Table 2

Source Code Entities – Numbers corresponding to sequence of function operations

Login ()	Show ()	Modify ()	Statistic ()	Add ()
F₁	F₂	F₃	F₄	F₅

Decision tree learning method is generally easy to transform into if-then rules, and classification rules are easy to understand. After the above analysis, we use ID3 decision tree learning method to construct classifier, which can achieve better classification effect in theory. Although the attributes of continuous values can be dealt with in decision tree learning algorithm, the learning efficiency and classification accuracy can be improved by discretising their values. The selection of discrete intervals is shown in Table 3. The values of each attribute are mapped to 0, 1, 2.

Table 3

Discrete Processing of Characteristic Attributes

Quantization value Feature attribute	0	1	2
A₁	Same person	Change to another	More than two persons
A₂	commonly	main	serious
A₃	Only once revised	Two revisions	More than two revisions
A₄	Containing general errors or miswritings	Including variable or function errors	Containing class or function errors
A₅	F₁ → F₃	F₁ → F₃, F₁ → F₂	F₁ → F₂ → F₄

4.2

Classification and Analysis of Computer Software Defects

To help software maintainers to quickly understand the running status of the current system from the software version system and defect tracking system, we classify the bug repair reports of the current software system stored in Bugzilla into two categories: fixed-Bug and potential fixed-Bug.

(1)

Fixed-Bug belongs to the Bug that has been repaired by the software system. It has never been reopened in the running process of the software system. These Bugs belong to the minor defects in the process of software maintenance, but maintenance personnel need to understand these defects. When similar problems occur in the running of the system, they can be repaired promptly and locate the source code of Bug.

(2)

Fixed-Bug belongs to Reopen, and the modification cycle is long, and the repairman is changed many times. This kind of Bug belongs to potential Bug, which may cause problems at any time during the operation of the software system. At the same time, such bug fixes may bring new defects, requiring the Bug Report to be opened again. So software maintainers should pay more attention to potential fixed-Bug, locate each repair version through SVN and Bug number, and analyse whether new defects can still be introduced.

4.3

Experiments and analysis

(1)

Experimental data

By using SAX parser based on event model to parse the generated XML file, we get 600 change history records after analysis. We select two kinds of small training sample sets from 600 records to train the classifier to form a pattern library, which is convenient for us to predict future defects and potential defects, as shown in Table 4. The decision tree generated by the training sample set is shown in Figure 5, where C1 and C2 in the leaf nodes represent defects and potential defects respectively.

Table 4

Training Set for Discrete Processing of Characteristic Attributes

Category Feature attribute	C1(0)	C2(1)
Attribute Value of A1 ~ A5	0, 0, 0, 0, 00, 1, 1, 0, 11, 0, 0, 1, 10, 1, 2, 1, 00, 2, 1, 0, 11, 1, 2, 2, 0	1, 2, 0, 1, 21, 1, 1, 2, 21, 1, 2, 2, 10, 2, 1, 2, 21, 0, 1, 2, 11, 1, 2, 1, 2

Decision Tree Generated by Defect Report

(2)

Experimental result

In this paper, we filter out 60 records from 600 bug reports and mark their categories. We use the generated decision tree to test the classification. The test results are shown in Table 5. The classification accuracy of the two kinds of defects is 85.7% and 63.6% respectively, and the overall classification accuracy is 81.7%. The accuracy of classification is shown in Table 5.

Table 5

Classification accuracy of decision tree

Category	Sample size	Number of correct classifications	Accuracy rate
Defect	49	42	85.7%
Potential defects	11	7	63.6%
Total	60	49	81.7%

Bug reports are categorized into closed Bug reports: Bugs and potential Bugs (which have been modified several times and may be opened again). Software maintainers need to understand these Bugs of the system. For normal Bug maintainers, they only care about Bug number and Bug number annotations of source code repair in the SVN library, which can quickly locate the changed source files. For potential defective Bugs, maintainers need to focus more on these Bugs. Once they are opened again, they can locate them in the SVN Library in time. Source code file changes location.

Conclusion

With the progression of data mining technology and the expansion of data in software engineering knowledge base, researchers of health information service system have paid more and more attention to the research of algorithms suitable for software knowledge base mining. First, this paper gives a brief overview of data mining technology and computer software technology, and uses decision tree graph mining algorithm to mine functional drawings of software system definition classes, and then adds source code annotations to the relevant call relationships. In this way, software developers can quickly understand the system architecture and the associated modification of system source code files based on the function call dependency graph with source code annotations help software maintainers understand the current defect status of the system, and timely discover the change of potential introducing defects., Potential bugs in the code can be detected through these dependency graphs. This paper lays a foundation for the deep integration of computer software technology and data mining technology in the Internet era.

eISSN:: 2444-8656
Lingua:: Inglese

Frequenza di pubblicazione:: Volume Open
Argomenti della rivista:: Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics

Feed RSS della rivista

Deep Integration of Health Information Service System and Data Mining Analysis Technology

Pubblicato online: 21 dic 2020

Pagine: 443 - 452

Ricevuto: 04 giu 2020

Accettato: 14 ago 2020

DOI: https://doi.org/10.2478/amns.2020.2.00063

Parole chiave
health information service system, computer software, data mining, decision tree, data analysis

© 2020 Zhihao Cui et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Deep Integration of Health Information Service System and Data Mining Analysis Technology

Pubblicato online: 21 dic 2020

Pagine: 443 - 452

Ricevuto: 04 giu 2020

Accettato: 14 ago 2020

DOI: https://doi.org/10.2478/amns.2020.2.00063

Parole chiavehealth information service system, computer software, data mining, decision tree, data analysis

© 2020 Zhihao Cui et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Parole chiave
health information service system, computer software, data mining, decision tree, data analysis