As the size of the big data information is increasing, big data containing different types of information require effective filtration and separation of different big data, improvement in the storage and management capabilities of data, optimization of the storage space of data, and improvement in large data information identification. Ability research on information classification methods and filtration methods for big data is of great significance in large database construction and cloud storage model design.
The filter model for data is based on mathematical statistical analysis and applies the statistical regression analysis model of data filtration. Using probability distribution and adaptive fuzzy clustering analysis methods, we optimize the filtration of data. Typical large data filtration models mainly include quantitative regression analysis model, autoregressive moving average (ARMA) model, correlation statistical analysis model, Backlund model etc. Based on the design of the probability mathematical model, combined with the pattern identification and the characteristic clustering method, data filtering is done; but the traditional method has a large interference with mode identification during data filtration. In response to the above problem, this paper proposes a data filtration method based on the probability mathematical model, and the statistical characteristic analysis model for large data filtration is constructed, combined with mathematical modeling and statistical analysis methods, and the data filter probability mathematical model is optimized. Finally, data analysis is performed to obtain conclusions regarding the effectiveness of the model [1, 2].
Financial data are typical time series data, which contain financial data, exchange rate interest rates, futures prices, etc. and various forms of one-dimensional or multidimensional data, which are characterized by nonlinearity of dynamics. Analysis of financial data is not only a hot problem of social hotspots but also a subject of academic community research. For decades, a large number of scholars have made great contributions to the development of the study, whether economics, statistics, or computer science. There are many kinds of financial data, such as stock, futures, exchange rates, etc., characterized by nonlinear dynamic changes, caused also because of the social development and economic policies of various countries. Correspondingly, a variety of research methods are more likely. Data are endless, and when they are financial, statistical or computer scholars hope to establish a relatively reliable model through a large amount of financial data to predict and analyze financial markets, thereby reducing investors’ mistakes in transaction decision-making, avoiding operational risks.
At the same time, these models explore the internal law of the financial market, providing an efficient solution for managers and decision-makers in the financial market to improve the market economy system. The statistical model of real-world data usually involves continuous probability distribution, such as normal, Laplace, or index distribution. This distribution is supported by many probabilistic modeling forms, including probability database systems. However, the traditional theoretical framework of the probability database is completely concerned with the limited probability database [3]. Yue et al. [4] developed mathematical theories in unlimited probability databases. In a previous paper (GROHE, LINDNER; ICDT 2020), a very general framework is proposed for the probability database that may involve a continuous probability distribution; the query has a clear definition in this framework. Extending the probability programming language to generate data records (PODS 2020) to achieve a continuous probability distribution and displaying such programs generates a continuous probability database generation model [4]. Zhang [5] has developed a new way to have a nonprogrammetry of boundaries to accurately surround all uncertain parameter data extracted in the actual project. This method is based on conventional statistical methods and related analysis techniques. First, the average value and the correlation coefficient of the uncertain parameter are calculated by using all given data.
Then, first, a simple but effective optimization process is introduced in the mathematical modeling process to induce the uncertain parameters to obtain their exact boundaries. This process operates on all of the given data by optimizing the convex mode. Thus, an effective mathematical expression of the final formulation of the convex mode is obtained. In order to test the predictive capacity and generalization ability of the proposed convex mode, the evaluation criteria used are volume ratio, standard volume ratio, and prediction accuracy. The performance of the proposed method is studied by testing standards and comparing with other existing competition methods. The results prove the effectiveness and efficiency of the method [5]. Herrera-Vega et al. [6] propose a probabilistic hesitant financial data (PHFD)-based portfolio selection and risk portfolio selection model, i.e., probabilistic hesitant portfolio selection (PHPS) model and risk PHPS (RPHPS) model, of risk probability hesitation.
In addition, investment decision methods are provided to show their practical applications in the financial market. It is expected that the PHPS model of general investors is built based on the highest score or minimum deviation principle to obtain the best investment ratio, and the RPHPS model provides the best investment ratio for the best return or to undertake risk. Finally, an empirical study based on the actual data of China's stock market is shown in detail. The results verify the effectiveness and practicality of the proposed method [6]. Li et al. used mathematical statistical methods to study the effectiveness of single-stage and multistage hydraulic fracturing technology. These studies are based on the geological and physical properties of the circulation of the prime wall, which makes it possible to demonstrate the choice of hydraulic fracturing technology.
Based on the probability and statistical analysis of the field efficiency data, quantitative standards selected by the crack parameters of propanes by length, height, and the injection of propane have been established. They can be used to predict the use of multistage and single-stage hydraulic fracturing methods. High-efficiency horizon AS12-3 has complex geological and physical structures and low filtering and storage performance, making it complicated, and contributes to the active formation of the reserves that are difficult to recover. Single-stage hydraulic fracturing technology is only 3–4 years old. Hydraulic fracturing technology with single-stage hydraulic fracturing technology is more effective [7].
The mathematical modeling analysis of data filtration is performed by the probability mathematical model. First, the sample descriptive statistical analysis method is used to analyze large data samples, the sampled major households are screened, and the statistical feature of the characteristic attribute that can effectively reflect the data filtration is determined. Sample regression analysis and adaptive processing on the sampling data are performed; the combined information fusion method implements the pattern identification of data filtration, and regression analysis is conducted according to the probability density of the detection statistic distribution in massive data environments; thus, data filtration according to the mode identification results. According to the above analysis, the overall structure of the probability mathematical model of data filtration is shown in Figure 1.
According to the probability mathematical model shown in Figure 1, the mathematical information flow analysis of data filtration is performed, and the sample time series of the sampled data information stream is assumed.
Data filtration statistical analysis in Leslie–Gower space is performed. The associated property rule analysis method is used, and the algorithm for data filtration is designed. The feature extraction algorithm for data filtration is described as follows.
Enter: construct data filtering table with minimum multiplining method Step 1: Step 2: Step 3: Step 4: Step 5: Step 6: Step 7: Step 8: Step 9: Step 10: Fusion decisions are performed within a limited field Step 11: Descriptive statistics matrix for output data filtration [8].
Looking for a series of factors affecting financial data change as an input of the HMM, i.e., observed sequences. The observed value belongs to the mixed Gaussian distribution; then:
For
On the basis of constructing a statistical characteristic analysis model for large data filtration, the optimization design of data filtering model is performed, and this paper proposes a data filtration method based on the probability mathematical model and gives data filtration probability density feature distribution mapping
In the formulas
The information entropy feature extraction is performed according to the association of the attribute. The scalar data time sequence attribute of data filtering is
Assuming that the decision threshold of
1.
//Set the primary value of the probability mathematical model of data filtering
2.
3.
4.
//Calculate data characteristics positioning matrix to get a fuzzy decision power index
5.
6.
7.
else
8.
9.
//With an independent main component analysis method, the fuzzy decision and probability statistics of data filtering are obtained, and the fuzzy identification matrix of data filtration output is obtained
10.
11.
12.
//use SVD breakout operation method, power index statistics, execute the following cycle.
13.
14.
15.
End
16.
Feature statistic for data filtering output
17.
18.
//End of iteration, realize the statistical regression analysis of large data filtration.
19.
20.
21.
Seeking the main feature
22.
If the convergence condition is met, the algorithm ends.
In order to test the application performance in data filtration, the data set of data sample distribution is database2016; the test data set is CSLOGS SET2, and the data sample number is
Classification attribute set of data filtration
Discourse domain | ||||
---|---|---|---|---|
53 | 2 | 2 | ||
3 | 6 | 3 | 3 | |
3 | 4 | 4 | 3 | |
53 | 6 | 3 | ||
3 | 2 | 5 | 4 | |
25 | 4 | 5 | ||
6 | 6 | 5 | 4 | |
6 | 5 | 2 | 3 | |
7 | 3 | 1 | 4 | |
5 | 4 | 1 | 5 |
According to the statistical rules of the large data sheet described above, the data are filtered, and data filtration is obtained as shown in Table 2.
Detection statistics rules table for data filtration
1 | 4 | * | 2 | 1 |
2 | 3 | * | 3 | 5 |
3 | 6 | * | 2 | 4 |
4 | * | 4 | 6 | 4 |
5 | 3 | * | 5 | 5 |
6 | 4 | 4 | * | 4 |
7 | 5 | * | 6 | 5 |
8 | 3 | 4 | 6 | 6 |
According to the statistical rule table, based on the above data filtration, data filtration is achieved, and the error rate of filtration is obtained, e.g., as shown in Figure 2; Figure 2 is analyzed and the accuracy of data filtration in this method is derived; the error is low.
Effective filtration and separation of different types of big data improve data storage and management capabilities, optimize data storage space, and improve large data information recognition capability. This paper proposes a data filtration method based on the probability mathematical model, using descriptive statistical analysis method, the statistical feature analysis model of large data filtration, combined with mathematical modeling and statistical analysis, implementation of data filter probability mathematical model, using threshold test and threshold decision methods to implement data filtering. Studies show that the method of data filtration is better and the error is low.