Applied Mathematics and Nonlinear Sciences

In the past two years, with the development of big data technology and the revolutionary innovations it has caused within various industries


Introduction
2013 is the first year of the rise of the concept of "big data" the information storm brought by big data is changing our life, work and thinking.Big data has opened a major transformation of the times, the core of big data is to predict, and this will be the source of more innovation and development; more changes in our social life are waiting to happen![1][2].Big Data is not only a revolution in the field of information technology but also a powerful tool for initiating transparent government, accelerating enterprise innovation, and leading social change worldwide.The research and application of "big data" are gradually penetrating all areas of our society [3][4].
Deepen the reform of the administrative approval system, continue to simplify and decentralize the government, and promote the transformation of government functions to create a good development environment, provide quality public services and maintain social justice [5][6].Steadily promote the reform of large departmental systems and improve the system of departmental responsibilities.However, the current government statistical institutions still have a very strong brand of the planned economic system, which is increasingly not adapted to the development of the economy and society, and to a certain extent, even has a certain constraining effect on the development of the economy and society [7][8].Researching the construction of government statistical institutions is an objective requirement for building a service-oriented government [9][10].Big data technology can greatly improve the quality of statistical services, enable statistical institutions at all levels to concentrate more professional and technical forces on the development and application of statistical information, and provide more high-quality statistical products to the general public as well as to more people who demand data and information [11][12].
In the literature [13], the "TOE" framework theory is used to optimize the suggestions for improving the quality of statistical data.In terms of methods and approaches, they include reforming the traditional statistical methods and systems, computerizing statistical data, establishing a scientific data quality monitoring system, and conducting a thorough review of data coverage; strengthening statistical legal publicity and law enforcement inspections, and improving the business quality and professional ethics of statisticians.The literature [14] examines the current situation of statistical data quality of Chinese local governments, finds the above six problems affecting the quality of Chinese local government statistics from the perspective of government management, deeply analyzes their causes, and actively studies four major countermeasures to strengthen the quality management of Chinese local government statistics: the literature [15] tries to propose the work of government statistical departments by analyzing the internal structure of the big data supply chain Innovative ideas for the future.The preliminary model of the big data supply chain is introduced, including the main processes of forming data products, the main functions of data-bearing subjects at each stage, the main relationships between statistical government departments and other subjects in the supply chain, and finally, the innovative ideas and suggestions for statistical departments are given.
In the literature [16], a multidimensional data model that facilitates various analyses is studied and built around a statistical-oriented OLAM implementation framework.A general aggregation algorithm based on the OLAP principle is used to complete the aggregation operation for the grassroots statistics and obtain various multidimensional data sets containing statistical groupings and measures.The aggregation algorithm does not rely on any APIs provided by the database itself and is a fully cross-database platform to help statisticians discover more potentially valuable information in statistical information.In the literature [17], in response to the increasing impact of big data on government statistics, the paper proposes a framework for big data statistical algorithms based on sparse representation.Firstly, Bootstrap is used to form different data subsets by sampling samples and features in the data, fusing the heterogeneous data in the subsets with a polymorphic preservation similarity method, then making data transformations on the fused data to make the data easy to handle and informative, and finally forming a primitive dictionary from the transformed data, which is weighted to form a dictionary matrix of sparse representation.The literature [18] suggests that blockchain technology improves the efficiency of government statistics and achieves an effective fit in terms of data security and information sharing.The integration process of "blockchain + government statistics" should be accelerated, and China should strive to be at the forefront of theory and occupy the high ground of innovation in the field of blockchain.This paper uses data mining decision tree classification algorithms to conduct a detailed, in-depth study of government statistics.A theoretical framework is constructed for government statistics, in which four elements are included, namely, statistical subjects, statistical objects, statistical processes and results.The decision tree algorithm analyzes the indicators of government statistical work from three aspects: grassroots statistical institutions, government revenue, and government forestry construction statistics.This study verifies that the decision tree algorithm is conducive to improving the efficiency of government statistics, thus contributing to the reform and construction of government statistics.

Big data technology
The origin of the concept of "big data" can be traced back to the beginning of the 1980s when the famous American futurist Alvin Toffler mentioned it in his "The Third Wave".Toffler mentioned it in his "The Third Wave".It was not until the mid-1990s, with the long development of social information technology, that it attracted the wide attention of the information industry and academia.After 2000, the rapid development of electronic information technology triggered the explosive growth of data, which gradually expanded and penetrated various fields of people's lives, triggering more widespread concern, and governments, world-renowned institutions and experts have focused on "Big Data", forming richer research results on various fields.And due to the different government statistical systems, the research and development of big data in China are a little later than in foreign countries, but the Chinese government attaches great importance to the development and research of big data applied to government statistics and is making fruitful efforts in the statistical system, information construction of statistical work, and research on big data applications.A detailed and indepth study of government statistics using a data mining decision tree classification algorithm is carried out.

Overview of decision tree classification algorithm
First of all, you need to understand the definition of the decision tree; the decision tree structure is very similar to the flow chart, each non-leaf node on the tree is a test of an attribute, and the test output is each branch of the tree, the root node of the decision tree is the initial node, the class distribution or class is each leaf node.The decision tree classification algorithm generally consists of two major steps: decision tree construction and pruning.In the whole data mining process, it is crucial to classify the data reasonably, and the effectiveness of the classification result directly affects the accuracy of the data mining results.Corresponding data items of unknown data sets to categories achieve the purpose of classification.Both classification and regression techniques can be used for prediction, the difference is that classification techniques deal with discrete data, and regression techniques deal with continuous data.

Decision tree construction algorithm
For the decision tree construction algorithm, we need a sample dataset with class labels as input data, the shape of the decision tree is a binary or multinomial tree, the internal nodes of the decision tree are usually logical judgments of the form ( is an attribute value of , where is an attribute), and the final branching conclusion of the logical judgment is the edges of the decision tree.The decision tree is divided recursively, and the recursion is terminated when one of the following conditions occurs: 1) All data classes of a node are the same.
2) There are no remaining attributes used to classify the sample further, and we use the majority voting method to decide the class attribution of attributes by converting these remaining attributes into leaf nodes and labeling them with the class label of the class in which the largest number of attributes in the sample data are located.
3) If there is no sample data for some branch test attributes, this is also decided by the majority voting method, and the class label number that occupies the majority of attributes in the data sample set is directly used to create a leaf node.Choosing a good judgment logic or attribute is difficult for constructing a good decision tree, and the same decision tree can find different decision tree counterparts.Usually, the smaller the tree structure, the better the predictive power of the tree.

Decision tree pruning algorithm
In daily working life, the sample data obtained are generally defective; some data have missing values on the attribute fields; necessary data may be missing making the data incomplete; there may also be inaccurate data again, containing noise.
The basic decision tree generation algorithm does not consider the data noise, so the generated decision tree is the same as the sample data.Due to the interference of anomalous data, it is necessary to use the pruning technique, a basic method to overcome the data noise.
Currently, there are two main pruning methods as follows: 1) Pre-pruning strategy: Decide whether to classify or stop the noisy training subsets when generating the decision tree.
2) Post-pruning strategy: A fitting-simplification approach.Initially, a fully fitted decision tree is generated using the data, after which the branches are pruned from low to high on the base of this tree.While pruning, the decision tree is tested using a test set, and if the accuracy or other measures of the decision tree remain unchanged when a node is removed, the node is removed; otherwise, the tree is stopped.
Theoretically, the pre-pruning strategy is not as effective as the post-pruning strategy, but the postpruning time complexity is higher, and the decision tree pruning process usually requires the use of some threshold or statistical parameters.It should be noted that pruning all data sets does not necessarily bring benefits, just as the smallest decision tree is not the best decision tree.When the amount of data is relatively sparse, pruning can bring side effects, and this is prevented.

Introduction to information theory of decision tree algorithm
Shannon proposed and developed information theory in 1948, a mathematical approach including the use of probability theory and mathematical statistics to measure and study information, information entropy, etc.The amount of information is measured by eliminating uncertainty in the appearance of various symbols in the source after communication, and the concepts of self-information, conditional entropy, information entropy and average mutual information are proposed.
The amount of self-information, before  ! is received, ( ! ) is the measure of the degree of uncertainty of the recipient sending  ! to the source, i.e., the amount of self-information of the message symbol  !, then there is ( ! ) = −( ! ) where ( ! ) is the magnitude of the probability of the source sending : 1) Information entropy, according to the above definition, it can be seen that the confident quantity expresses the measure of uncertainty of information symbols, while information entropy expresses the measure of uncertainty of all symbols, with the following equation: ( where  is any of the number of all symbols of the source ; the information entropy here is then defined as the average of the self-information of the symbols emitted by the source. 2) Conditional entropy, if variable  is associated with source , the degree of uncertainty about variable  after the recipient receives variable  can be expressed in terms of conditional entropy (|).Let  be the source symbol   ,  be the source symbol   , and 1  |  2 denote the probability of  being   when  is   , with the following equation holding: (2) 3) The average mutual information, which indicates the magnitude of the information provided by signal about , is denoted by : (3)

Information gain calculation
In the initial stage of the government statistics, there is only an empty decision tree, and we do not know how to classify the instances based on attributes, so we use the decision tree model obtained from the previous statistics to classify the whole attribute space.It may be useful to have a training set classified as class as , to note that  ! is an instance of class , and || denotes the total number of instances in the training set .If we set the probability that an unknown instance belongs to class as , then we have: At this point, the division uncertainty measure is: p a l a p a l a p a l a p a lbp a In the case of determining the division C is recorded as : From the whole learning process of the decision tree, it can be seen that the uncertainty of the decision tree in classifying the data set is getting smaller and smaller; if the test is performed using test attribute , when , the number of samples belonging to class can be obtained as , which is denoted as : (9) That is, represents the magnitude of the probability that when , it belongs to class .The conditional entropy of the training set on attribute is the degree of uncertainty of the decision tree on the partition: (10) The information entropy of all branches extended after selecting test attribute for categorical information is: (11) Attribute for classification provides the information gain for: (12) The smaller the value of equation (11), the larger the value of equation (12), which indicates that test attribute can provide good information about the classification, and the smaller the uncertainty about the classification after choosing test attribute ; the algorithm is to choose the attribute that makes the largest value of information gain the most test attribute, i.e., to choose the attribute that makes the smallest value of attribute of equation (11).

Government statistical work combined with "big data" research process
To study the problem of government statistics and countermeasures in the context of big data technology, firstly, we should grasp the two keywords of big data and government statistics in the subject to carry out the study, and we should fully understand the relevant concepts and meanings before we can carry out the next research work; secondly, on the basis of fully understanding the concept of keywords, we should find the combination of the two, so that we can analyze and come up with the problems and countermeasures to solve them.Figure 1 shows the flow chart of government statistical work combined with "big data" research.Firstly, the concept of "big data" is analyzed in depth, its concept and meaning are analyzed in depth, its characteristics are analyzed in depth, and the fields in which it can be widely used are analyzed in depth; then, the current situation of government statistics is carefully sorted out and analyzed through relevant management theories and combined with the statistical work practice of a city.
Secondly, through the research in the previous part, we will find out the impact of "big data" on the concept, legal system, system, mechanism, security and data application of statistics, analyze and summarize the opportunities and challenges faced by statistics in the context of "big data", and find the combination point of "big data" and government statistics.Third, in-depth analysis of the current situation of a city's statistical work, from the composition of the elements of statistical work, the institutional situation, the operation mechanism, construction and data release, analysis and application of the five aspects of the analysis, through an in-depth analysis of the current situation, to find a city's statistical work in the context of "big data" problems, from the concept of statistics, legal system, institutions, mechanisms, construction and data use and other aspects of exploration and reflection, to find the causes of the problem.
Fourth, "big data" thinking as the perspective of problem-solving from a city government statistical work practice to solve the actual work of grassroots statistics as the main purpose, with the previous analysis of a city in the statistical work of the problems as the focus, and put forward the practical significance of grassroots statistical work solution measures.

Composition of the elements of municipal government statistics
As shown in Figure 2, the flow chart of the statistical work of the government is shown.From the concept of government statistics, it is clear that government statistics is a process in which the statistical subject performs a certain statistical work on the statistical object and achieves the desired statistical function.This process includes four elements, namely, statistical subject, statistical object, statistical process and statistical result.The main body of statistics naturally refers to the statistical agency itself.In a city, for example, it refers to a city Bureau of Statistics, and under the city Bureau of Statistics and statistical business as the main line of work to form a district (county) Bureau of Statistics (section), street statistical departments, community statisticians and another multi-business level of the composition of the main form of statistics.The statistical process generally includes four steps: statistical design, statistical survey, statistical collation and statistical analysis: First, statistical design in the statistical regulations at the municipal level to implement the implementation of the main; statistical system development and interpretation is also mainly based on the national and provincial statistical institutions, but the municipal statistical functions in the process of statistical work will be slightly different from the provincial, so the municipal level can also develop and use the statistical system developed at this level when necessary; statistical enforcement as an important part of ensuring the quality of statistical data, in municipal statistical work will be used following the law.
Second, the types of statistical surveys are mainly cyclical, special, spot checks and censuses, and other types.Specifically, cyclical surveys refer to various types of statistical objects at a certain point in time to report specific types of data statements annually, quarterly or monthly are required; special surveys refer to data or a class of points for a separate situation of the survey, can be cyclical or onetime; spot checks generally refer to and do not reach the statistical standards of the unit survey activities, can also be a special sample survey; census refers to the impact of certain important indicators for a comprehensive clean-up census, for example, the economic census and population census.
Third, statistical collation is a data flow process of reminding, querying, returning for revision and re-reporting of statistical reports, and a statistical collation process of data is needed at each statistical organization level to ensure the quality of statistical data.
Fourth, statistical analysis is the realization of statistical functions and is an important link to achieving the value of statistical data.The information function of statistics mainly refers to the release of statistical data to provide society with corresponding statistical data services; the consulting function of statistics mainly provides statistical data analysis suggestions to the government decisionmaking and digs deeper into the value of data, the core of which should be to see the development trend of society through data and enhance the foresight of government decision making; the monitoring function of statistics is to make an objective monitoring reflection on whether the policies promulgated by the government are effective or how effective they are.The supervision function is to reflect objective supervision on whether the policies promulgated by the government are effective or effective to achieve the function of supervising the relevant government departments to make corresponding adjustments to the policies.

Construction of theoretical framework for municipal government statistics
As shown in Figure 3, the development of statistics of a city government in the context of big data technology is shown in the research idea diagram.The two core concepts involved in this study are "big data" and government statistics, and the extension of the core concepts is relatively large, which requires a high level of conceptual and theoretical understanding.For this study, a deep understanding of the connotation of the core concepts is needed: In the first step, the aim is to find the characteristics of the concept, its practical meaning and the scope of its application by understanding its connotations.
The second step is to analyze the extension of the core concept based on its many conditions.
The third step is to find the intersection part of the two extents through comparison and analysis, lock the main theoretical area of the subject research, seize this point and find the breakthrough of the subject research.First, an in-depth analysis of "big data", its concept and meaning, its characteristics, and the areas in which it can be widely used.
Secondly, the concept, characteristics, functions and basic tasks of government statistics are understood in depth, and the relevant management knowledge is used to analyze government statistics in depth, and the statistical work is divided and studied from the perspective of "big data", and the conceptual framework of this study is obtained, which lays the theoretical foundation for the next study.
Third, through the research in the previous section, we will find the impact of "big data" on various elements of statistics, analyze and summarize the opportunities and challenges faced by statistics in the context of "big data", and find the combination of "big data" and government statistics.
Fourth, an in-depth analysis of the current status of a city's statistical work, through an in-depth analysis of the current situation, to find the problems of a city's statistical work in the context of "big data" from the main aspects of the composition of statistical work to explore and consider, to find the causes of the problem.
Fifth, the "big data" thinking as a problem-solving perspective, from a city's statistical work practice, to solve the actual grassroots statistical work as the main purpose, with the previous analysis of a city's statistical work in the problems as the focus, and put forward the practical significance of grassroots statistical work solution measures.

Analysis of grassroots statistical institutions
Among the statistical staff of each district and county statistical bureaus, street offices and township statistical stations, community committees, and village committees, the proportion of those with formal undergraduate degrees or above decreases in order, and there are even fewer of them studying economics and statistics.The culture level, working ability and professional level of the grassroots statistical staff are relatively low, which is not coordinated with the strong professionalism of statistical work.Table 1 shows the comparison of personnel and professions in Guiyang City, Huaxi District, and Nanming District (a).The transparency of the Chinese government's statistical work has been low, the operation and management of speculative work lack public supervision, and the government information released to the public rarely provides a detailed interpretation of statistical methods and systems, policies and regulations, and the quality of statistical data; the statistics and accounting methods of some important statistical indicators are not public enough, and the reliability analysis of important statistical data is not sound, leading to the lack of clarity of the source of statistical data from all walks of life and the data indicators with a skeptical attitude.As shown in Table 3, the GDP data of Guangxi in the past three years are compared: The value added of the secondary industry in 2018 was only published as the growth rate without specific values.While comparing the values of 2019 and 2017, the total amount of 2019 was lower than that of 2017, but the growth rate was on an upward trend year on year.During the interview with the comprehensive accounting staff of X City Bureau of Statistics, when asked why the total value added of the secondary industry in 2018 was not published and the growth rate in 2019 was still rising, the following answer was given: "2018 is the period of the fourth national economic census, so the data may be adjusted and not published for the time being, and the growth rate mentioned above is not the same as our usual The calculation is not the same as our usual algorithm, the concept of value added is price-inclusive, calculated at comparable prices, and the growth rate cannot be calculated directly by division, so the two algorithms are independent of each other.
In recent years, statistical data, especially economic data, is increasingly being used to assess the performance of leading cadres, especially GDP, fiscal revenue and other economic indicators.As long as the data in the performance assessment uses such economic indicators, there will be leading cadres to find ways to modify the data to achieve the purpose of improving political performance, "figures out of the official "Become the goal of the leading cadres competing to chase.To more intuitively compare the differences between the data before after water injection, the total changes in the general fiscal budget revenue indicators of the three provinces from 2010-2019 are selected, as shown in Figure 4.The total general budget revenue of Liaoning, Tianjin and Inner Mongolia shows a fluctuating trend of rising, then falling and then rising again, with a trough between 2015 and 2017, which is the real data of each province and city after squeezing out the water from the province's data.If the 19th National Congress had not proposed "high-quality development", squeezing water and removing falsehoods would not have been on the agenda so quickly.This reflects the contradiction between the central government and local government performance appraisal, which is ultimately due to the central government's assessment and evaluation of local government performance by the level of GDP, resulting in local government's one-sided pursuit of GDP growth, and to a certain extent, generating GDP index bubble.The falsified economic data will affect the central government's judgment on the economic situation and overdraw the development potential, so how can the resulting data ensure its authenticity and validity, and how can it be used for the country and the people.

Performance analysis of decision tree algorithm
The experiments tested the performance of data mining using decision tree algorithm analysis, also using "parcel" data provided by a municipal statistics bureau.Test data was added for subsequent larger-scale tests.The experiment was run 10 times at each scale, and the average of the 10 overheads was taken and then plotted in Figure 5 to show the mining performance of the decision tree approach.Showing the data mining performance in different dimensions and analyzing the individual bar time performance curves, it can be seen that as the scale gets larger and larger, the growth of time overhead also gets larger and larger, which is similar to the performance curve of decision tree method for data mining.Comparing the performance curves of different numbers of dimensions, it can be seen that in the case of smaller scale, the difference of various dimensional data mining performances is not very large, and with the gradual increase of scale, it can be seen that the number of dimensions, even in the case of three-dimensional analysis, the time overhead in this kind of multi-dimensional analysis data mining with the largest time overhead, in the case of the data volume of 160,000 rows, the time overhead is only about 8 seconds, this time overhead is not short This is not a short time, but the algorithm is very effective for the statistical analysis of this government statistics project when the data volume of the "parcel" data table is not of this scale.
According to recent years, due to the construction of cities and traffic roads, the destruction of natural resources is aggravated, which seriously affects the normal life and work of the people, based on the decision tree algorithm to evaluate the performance of the government forestry construction statistics.At the same time, the sample data as a whole with the set number of classes can not be judged.If the category is clear, the decision tree algorithm can determine the data outliers; the large class samples will be classified as normal samples.Through the accuracy of the small class samples to detect and reduce the workload, we use the decision tree algorithm for statistical performance testing to judge the statistical function of good or bad.As shown in Figure 6, the performance analysis of government forestry construction statistics under different algorithms is shown.In the performance analysis of the government forestry construction statistics under different algorithms, the accuracy rate metrics of the three different algorithms changed with the increase in the number of training sessions.In the decision tree algorithm, with the increase of training times, the accuracy rate of the algorithm is controlled between 83% and 91% with a variation range value of 8%.In the BP neural network algorithm, as the number of training sessions increases, the accuracy rate of the algorithm is controlled between 81% and 72%, with a range of 9%.In the genetic algorithm, the accuracy rate of the algorithm was controlled between 70% and 63% with a variation of 7% as the number of training times increased.In general, the decision tree algorithm performed better in the performance analysis of government forestry construction statistics.The algorithm can improve the efficiency and quality of government statistics to ensure that the government can better serve society.

Conclusion
This paper uses a decision tree algorithm to analyze the data indicators of government statistics.The decision tree algorithm is used to determine the data outliers, classify the large class samples as normal samples, and reduce the workload by detecting the accuracy of the small class samples.This study uses the decision tree algorithm to test the statistical performance of the forestry construction work and judge the good and bad statistical functions.In the performance analysis of government forestry construction statistical work under different algorithms, the accuracy rate indexes of three different algorithms change with the increase in training times.The accuracy rate of the decision tree algorithm is controlled between 83% and 91%, with a variation range value of 8%, which is more accurate than the other two algorithms.According to the analysis results of this paper, there is a good template for government policy-making and overall planning, reducing the pressure on government staff to handle data and referring to high accuracy.It promotes and guides the reform strategy for government statistical work problems.

a 3
Research the problems and reform strategies of government statistics based on big data technology

Figure 1 .
Figure 1.Flow chart of government statistics combined with "big data" research

Figure 2 .
Figure 2. Operation Flow Chart of Government Statistics

Figure 3 .
Figure 3. Thinking Chart of Shenyang Statistical Work Development Research under the Background of Big Data

Figure 4 .
Figure 4. Comparison of General Public Budget Revenue of Some Provinces and Cities in China in Recent Ten Years

Figure 5 .
Figure 5. Mining performance of the decision tree method

Figure 6 .
Figure 6.Performance analysis of government forestry construction statistics under different algorithms

Table 1 .
Comparison of personnel and specialties between Guiyang,Hua Xi District and Nan Ming District(a)There are fewer statistics staff at the grassroots level, and the further you go, the fewer full-time staff there are.The Guiyang Bureau of Statistics has more than 100 people, of whom 87 are engaged in specific staff.The Nanming District Bureau of Statistics has the smallest community staffing in the city, with a total of 27 people, of whom only 13 are engaged in specific work, each responsible for professional statistical work; it's 13 subordinate street offices, and township statistical stations generally have only one full-time staff or one full-time and one part-time staff, responsible for all professional statistical work; and then only one person in the subordinate community neighborhood committees and village committees works part-time statistics work, responsible for all professional statistics work.Table2compares personnel and specialties between Guiyang City, Huaxi District, and Nanming District (b).

Table 2 .
Comparison Table of Personnel and Professions among Guiyang, Hua Xi District and Nan Ming District (b)At the municipal level, only the Division analyzes statistical information and coordinates statistical work, and the Regulatory Division completes statistical law enforcement, supervision and policy research, and nearly 89% of the other divisions (excluding offices and other divisions engaged in supporting work) are engaged in professional statistical surveys.In the district and county-level statistical bureaus, there is generally only 1-2 full-time or part-time personnel engaged in statistical data analysis, and no or only 1 personnel engaged in statistical law enforcement work on a part-time basis, while nearly 94% of the other personnel (excluding administrative leaders) are all engaged in statistical surveys.In this case, these functions and work, even the daily construction and management work, can not function properly, further behind the international level.The fundamental reason is that the functions of statistical bureaus at all levels are not positioned -whether they are statistical data collection and investigation departments or statistical work management services.The lack of clear positioning of the functions of comprehensive statistical agencies has led to the disharmony of costbenefit, statistical supply and demand in statistical work.

Table 3 .
Comparison of secondary industry data in Guangxi in recent three years