Application of mathematical probabilistic statistical model of base – FFCA financial data processing

As the size of the big data information is increasing, big data containing different types of information require effective filtration and separation of different big data, improvement in the storage and management capabilities of data, optimization of the storage space of data, and improvement in large data information identification. Ability research on information classification methods and filtration methods for big data is of great significance in large database construction and cloud storage model design.

The filter model for data is based on mathematical statistical analysis and applies the statistical regression analysis model of data filtration. Using probability distribution and adaptive fuzzy clustering analysis methods, we optimize the filtration of data. Typical large data filtration models mainly include quantitative regression analysis model, autoregressive moving average (ARMA) model, correlation statistical analysis model, Backlund model etc. Based on the design of the probability mathematical model, combined with the pattern identification and the characteristic clustering method, data filtering is done; but the traditional method has a large interference with mode identification during data filtration. In response to the above problem, this paper proposes a data filtration method based on the probability mathematical model, and the statistical characteristic analysis model for large data filtration is constructed, combined with mathematical modeling and statistical analysis methods, and the data filter probability mathematical model is optimized. Finally, data analysis is performed to obtain conclusions regarding the effectiveness of the model [1, 2].

Financial data are typical time series data, which contain financial data, exchange rate interest rates, futures prices, etc. and various forms of one-dimensional or multidimensional data, which are characterized by nonlinearity of dynamics. Analysis of financial data is not only a hot problem of social hotspots but also a subject of academic community research. For decades, a large number of scholars have made great contributions to the development of the study, whether economics, statistics, or computer science. There are many kinds of financial data, such as stock, futures, exchange rates, etc., characterized by nonlinear dynamic changes, caused also because of the social development and economic policies of various countries. Correspondingly, a variety of research methods are more likely. Data are endless, and when they are financial, statistical or computer scholars hope to establish a relatively reliable model through a large amount of financial data to predict and analyze financial markets, thereby reducing investors’ mistakes in transaction decision-making, avoiding operational risks.

1.1

More economic benefits

At the same time, these models explore the internal law of the financial market, providing an efficient solution for managers and decision-makers in the financial market to improve the market economy system. The statistical model of real-world data usually involves continuous probability distribution, such as normal, Laplace, or index distribution. This distribution is supported by many probabilistic modeling forms, including probability database systems. However, the traditional theoretical framework of the probability database is completely concerned with the limited probability database [3]. Yue et al. [4] developed mathematical theories in unlimited probability databases. In a previous paper (GROHE, LINDNER; ICDT 2020), a very general framework is proposed for the probability database that may involve a continuous probability distribution; the query has a clear definition in this framework. Extending the probability programming language to generate data records (PODS 2020) to achieve a continuous probability distribution and displaying such programs generates a continuous probability database generation model [4]. Zhang [5] has developed a new way to have a nonprogrammetry of boundaries to accurately surround all uncertain parameter data extracted in the actual project. This method is based on conventional statistical methods and related analysis techniques. First, the average value and the correlation coefficient of the uncertain parameter are calculated by using all given data.

Then, first, a simple but effective optimization process is introduced in the mathematical modeling process to induce the uncertain parameters to obtain their exact boundaries. This process operates on all of the given data by optimizing the convex mode. Thus, an effective mathematical expression of the final formulation of the convex mode is obtained. In order to test the predictive capacity and generalization ability of the proposed convex mode, the evaluation criteria used are volume ratio, standard volume ratio, and prediction accuracy. The performance of the proposed method is studied by testing standards and comparing with other existing competition methods. The results prove the effectiveness and efficiency of the method [5]. Herrera-Vega et al. [6] propose a probabilistic hesitant financial data (PHFD)-based portfolio selection and risk portfolio selection model, i.e., probabilistic hesitant portfolio selection (PHPS) model and risk PHPS (RPHPS) model, of risk probability hesitation.

In addition, investment decision methods are provided to show their practical applications in the financial market. It is expected that the PHPS model of general investors is built based on the highest score or minimum deviation principle to obtain the best investment ratio, and the RPHPS model provides the best investment ratio for the best return or to undertake risk. Finally, an empirical study based on the actual data of China's stock market is shown in detail. The results verify the effectiveness and practicality of the proposed method [6]. Li et al. used mathematical statistical methods to study the effectiveness of single-stage and multistage hydraulic fracturing technology. These studies are based on the geological and physical properties of the circulation of the prime wall, which makes it possible to demonstrate the choice of hydraulic fracturing technology.

Based on the probability and statistical analysis of the field efficiency data, quantitative standards selected by the crack parameters of propanes by length, height, and the injection of propane have been established. They can be used to predict the use of multistage and single-stage hydraulic fracturing methods. High-efficiency horizon AS12-3 has complex geological and physical structures and low filtering and storage performance, making it complicated, and contributes to the active formation of the reserves that are difficult to recover. Single-stage hydraulic fracturing technology is only 3–4 years old. Hydraulic fracturing technology with single-stage hydraulic fracturing technology is more effective [7].

Research method

2.1

Mathematical probability statistics model – preparatory knowledge

The mathematical modeling analysis of data filtration is performed by the probability mathematical model. First, the sample descriptive statistical analysis method is used to analyze large data samples, the sampled major households are screened, and the statistical feature of the characteristic attribute that can effectively reflect the data filtration is determined. Sample regression analysis and adaptive processing on the sampling data are performed; the combined information fusion method implements the pattern identification of data filtration, and regression analysis is conducted according to the probability density of the detection statistic distribution in massive data environments; thus, data filtration according to the mode identification results. According to the above analysis, the overall structure of the probability mathematical model of data filtration is shown in Figure 1.

According to the probability mathematical model shown in Figure 1, the mathematical information flow analysis of data filtration is performed, and the sample time series of the sampled data information stream is assumed. Y₁, Y₂, . . . , Y_N. Let A_j (L) for the fuzzy cluster center of the data probability statistics model, j = 1,2, . . . , k; the statistical regression analysis method was used, and the descriptive statistical analysis feature distribution matrix of massive data filtration was obtained from $K = {[K_{1}^{T} K_{2}^{T} K_{3}^{T} K_{4}^{T} K_{5}^{T}]}^{T}$ K = {\left[ {K_1^TK_2^TK_3^TK_4^TK_5^T} \right]^T} , $L = {[L_{1}^{T} L_{2}^{T} L_{3}^{T} L_{4}^{T} L_{5}^{T}]}^{T}$ L = {\left[ {L_1^TL_2^TL_3^TL_4^TL_5^T} \right]^T} , $M = {[M_{1}^{T} M_{2}^{T} M_{3}^{T} M_{4}^{T} M_{5}^{T}]}^{T}$ M = {\left[ {M_1^TM_2^TM_3^TM_4^TM_5^T} \right]^T} , and $W = {[W_{1}^{T} W_{2}^{T} W_{3}^{T} W_{4}^{T} W_{5}^{T}]}^{T}$ W = {\left[ {W_1^TW_2^TW_3^TW_4^TW_5^T} \right]^T} , according to the above-described overall model design, using a descriptive statistical analysis method to construct a large data filtration statistical analysis model.

Overall design of the probability mathematical model of data filtration.

2.2

Descriptive statistical analysis

Data filtration statistical analysis in Leslie–Gower space is performed. The associated property rule analysis method is used, and the algorithm for data filtration is designed. The feature extraction algorithm for data filtration is described as follows.

Algorithm 1

Enter: construct data filtering table with minimum multiplining method S = (U, A, V, f), U = {u₁u₂ ···, u_n}, A ={a₁, a₂, ···, a_m}. Output: S = (U,A,V, f), indicating the information entropy of data filtration, the descriptive statistic $lim_{n \to \infty} ‖ X_{n} - X^{*} ‖ = 0$ \mathop {\lim }\limits_{n \to \infty } \left\| {{X_n}} \right. - {X^*}\left\| { = 0} \right. of the output, where ‖·‖= 0 indicates a certain number. The characteristic sampling steps of data filtration are as follows:

Step 1: n ← |U|, matrix ← Φ, m ← |A|;

Step 2: s ← 1, d(s) ← Φ;

Step 3: i ← 1;

Step 4: j ← i + 1;

Step 5: k ← 1;

Step 6: iff (i,a_k) ≠ f (j,a_k) then.matrix (i, j) ← matrix (i, j) + a_k;

Step 7: k ← k + 1, using probability density feature extraction methods, high-order differential equations are constructed, and when statistical feature amount k<m, move to Step 6. In the limited weight Morrey–Herz convex space, the global regular fuzzy decision vector set of data filtered d (s) ← matrix (i, j), s ← s + 1, move to Step 8;

Step 8: j ← j + 1, if j ≤ n, move to Step 5, correct the verification statistic, convert the data filtering problem into the probability density function, and go to Step 9;

Step 9: i ← i + 1, if i ≤ n − 1, turn to Step4, through multilevel distribution decision, data feature extraction, if convergence conditions are met, turn to Step 10;

Step 10: Fusion decisions are performed within a limited field U = {x₁, x₂,··· x_m} of the data feature distribution space, and the conditional attribute element is spaced;

Step 11: Descriptive statistics matrix for output data filtration [8].

2.3

Hidden Markov model (HMM) prediction method for mathematical probability statistics model

2.3.1

Choice of raw data

Looking for a series of factors affecting financial data change as an input of the HMM, i.e., observed sequences. The observed value belongs to the mixed Gaussian distribution; then: (1) $b_{j} (o_{i}) = \sum_{k = 1}^{M} w_{jk} b_{jk} (o_{i}), j = 1, 2, \dots, N, 1 \leq k \leq M$ {b_j}\left( {{o_i}} \right) = \sum\limits_{k = 1}^M {w_{jk}}{b_{jk}}\left( {{o_i}} \right),j = 1,2, \cdots ,N,1 \le k \le M where m represents the number of mixed components; w_jk indicates the weight of each mixed component under each hidden state and is represented as follows: (2) $w_{jk} \geq 0, \sum_{k = 1}^{M} w_{jk} = 1$ {w_{jk}} \ge 0,\sum\limits_{k = 1}^M {w_{jk}} = 1

For b_jk (o_t), the common Gaussian distributions are as follows: (3) $b_{jk} (o_{t}) = N (o_{t}, μ_{jk}, σ_{jk}^{2})$ {b_{jk}}\left( {{o_t}} \right) = N\left( {{o_t},{\mu _{jk}},\sigma _{jk}^2} \right) where μ_jk indicates the mean of the Gaussian distribution, and $σ_{jk}^{2}$ \sigma _{jk}^2 indicates the difference between the Gaussian distributions. Gaussian distribution has a wide range of applications, such as medical statistics, height statistics, quality testing, etc., but there is no sufficient theoretical basis and scientific guidance for financial data. Because financial data are not naturally formed, they are affected by various factors, state policies, bank deposit interest rates, industry factors, etc. Gaussian distribution cannot accurately simulate financial data probability distribution. Although many experiments use Gaussian distribution as a probability density function of continuous observations and achieve better experimental results, it is also limited to further improve prediction accuracy. In addition, the mixed Gaussian model also increases the difficulty of constructing HMM, and the parameters m and w_jk need to be estimated by expectation–maximization (EM) algorithm, increasing the workload. In addition to direct prices as the observations, the stock price-related indicator function can be used as the input of HMM. Select the average real range, random momentum index closing price position value, smooth and rich indexes, average line, parabolic stop loss and volatility indicator, average tendency index, and cloth belt. Caijin wave indicator, amu simple fluctuation indicator, fund flow index and other technical indicators are screened using random forests. Select a number of factors such as interest rate, fund flow average, moving average, and basic morphology and find the most suitable factor combination by multiple factors [9].

2.4

Probability mathematical model construction – optimization of data filtration

On the basis of constructing a statistical characteristic analysis model for large data filtration, the optimization design of data filtering model is performed, and this paper proposes a data filtration method based on the probability mathematical model and gives data filtration probability density feature distribution mapping f : U × A → V in order to build data filtration in the phase space; the information function f (x,a) is the value on the element A in the resolution matrix, and the global registered gradient diversity method is used, and the core function of the data filter is X ⊂ U, the probability density is domain ${POS}_{A}^{*} (X)$ POS_A^*\left( X \right) , the negative domain is ${NEG}_{A}^{*} (X)$ NEG_A^*\left( X \right) ; in the boundary arrest, the feature distribution set ${BND}_{A}^{*} (X)$ BND_A^*\left( X \right) of massive data is obtained, whereby the fuzzy mathematical reasoning attribute set of data filtration is represented as follows: (4) ${POS}_{A}^{*} (X) = \cup {E | P (X | E) 〉 P (X), E \in U / A}$ POS_A^*\left( X \right) = \cup \left\{ {E\left| {P\left( {X\left| E \right.} \right)} \right.\rangle P\left( X \right),E \in U/A} \right\} (5) ${NEG}_{A}^{*} (X) = \cup {E | P (X | E) 〈 P (X), E \in U / A}$ NEG_A^*\left( X \right) = \cup \left\{ {E\left| {P\left( {X\left| E \right.} \right)\langle P\left( X \right)} \right.,E \in U/A} \right\} (6) ${BDN}_{A}^{*} (X) = \cup {E | P (X | E) = P (X), E \in U / A}$ BDN_A^*\left( X \right) = \cup \left\{ {E\left| {P\left( {X\left| E \right.} \right) = P\left( X \right)} \right.,E \in U/A} \right\}

In the formulas $P (X | E) = \frac{P (X)}{P (E)} = \frac{card (X \cap E)}{card (E)}$ P\left( {X\left| E \right.} \right) = {{P\left( X \right)} \over {P\left( E \right)}} = {{card\left( {X \cap E} \right)} \over {card\left( E \right)}} , $P (X) = \frac{card (X)}{cadr (U)}$ P\left( X \right) = {{card\left( X \right)} \over {cadr\left( U \right)}} , card (X), indicating the probability density feature statistic in line data filtration [10].

The information entropy feature extraction is performed according to the association of the attribute. The scalar data time sequence attribute of data filtering is D_x = 2, and the regression analysis of the probability density distribution of the detection statistics in mass data is $ξ_{c_{1}}^{d_{2}} = \frac{3}{5}$ \xi _{{c_1}}^{{d_2}} = {3 \over 5} , $ξ_{c_{2}}^{d_{2}} = \frac{2}{5}$ \xi _{{c_2}}^{{d_2}} = {2 \over 5} , $\max g_{c_{2}} (d_{2}) = \frac{3}{8}$ \max {g_{{c_2}}}\left( {{d_2}} \right) = {3 \over 8} , $\max g_{c_{3}} (d_{2}) = \frac{1}{10}$ \max {g_{{c_3}}}\left( {{d_2}} \right) = {1 \over {10}} .

Assuming that the decision threshold of X's data filtering is D_x = 3, the reference component of the feature statistic of the data filtered is $ξ_{c_{1}}^{d_{3}} = 1$ \xi _{{c_1}}^{{d_3}} = 1 , $ξ_{c_{2}}^{d_{3}} = 1$ \xi _{{c_2}}^{{d_3}} = 1 , $ξ_{c 3}^{d_{3}} = 1$ \xi _{c3}^{{d_3}} = 1 , $\max g_{c_{1}} (d_{3}) = \frac{7}{4}$ \max {g_{{c_1}}}\left( {{d_3}} \right) = {7 \over 4} , $\max g_{c_{2}} (d_{2}) = \frac{3}{8}$ \max {g_{{c_2}}}\left( {{d_2}} \right) = {3 \over 8} , $\max g_{c_{3}} (d_{2}) = \frac{7}{4}$ \max {g_{{c_3}}}\left( {{d_2}} \right) = {7 \over 4} . According to the results of the decision statistical regression analysis, data filtration is achieved by the decision attribute identification method, and the filter output is as follows: $\begin{matrix} m_{1}^{(1)} (d_{1}) = 0.2027, m_{1}^{(1)} (d_{2}) = 0.1622, m_{1}^{(1)} (d_{3}) = 0.2703; \\ m_{1}^{(1)} (Θ) = 0.3649; \\ m_{2}^{(1)} (d_{1}) = 0.1786, m_{2}^{(1)} (d_{2}) = 0.1429, m_{2}^{(1)} (d_{3}) = 0.3571; \\ m_{1}^{(1)} (Θ) = 0.3214; \\ m_{3}^{(1)} (d_{1}) = 0.2000, m_{3}^{(1)} (d_{2}) = 0.1333, m_{3}^{(1)} (d_{3}) = 0.3333, m_{3}^{(1)} (Θ) = 0.3333; \\ m_{1}^{(2)} (d_{1}) = 0.1512, m_{1}^{(2)} (d_{2}) = 0.1708, m_{1}^{(2)} (d_{3}) = 0.2491; \\ m_{1}^{(2)} (Θ) = 0.4288; \\ m_{2}^{(2)} (d_{1}) = 0.3000, m_{2}^{(2)} (d_{2}) = 0.3000, m_{2}^{(2)} (d_{3}) = 0.3000; \\ m_{1}^{(2)} (Θ) = 0.1000; \\ m_{3}^{(2)} (d_{1}) = 0.1625, m_{3}^{(2)} (d_{2}) = 0.0250, m_{3}^{(2)} (d_{3}) = 0.4375,; \\ m_{3}^{(2)} (Θ) = 0.3750 . \end{matrix}$ \matrix{ {m_1^{\left( 1 \right)}\left( {{d_1}} \right) = 0.2027,m_1^{\left( 1 \right)}\left( {{d_2}} \right) = 0.1622,m_1^{\left( 1 \right)}\left( {{d_3}} \right) = 0.2703;} \cr {m_1^{\left( 1 \right)}\left( \Theta \right) = 0.3649;} \cr {m_2^{\left( 1 \right)}\left( {{d_1}} \right) = 0.1786,m_2^{\left( 1 \right)}\left( {{d_2}} \right) = 0.1429,m_2^{\left( 1 \right)}\left( {{d_3}} \right) = 0.3571;} \cr {m_1^{\left( 1 \right)}\left( \Theta \right) = 0.3214;} \cr {m_3^{\left( 1 \right)}\left( {{d_1}} \right) = 0.2000,m_3^{\left( 1 \right)}\left( {{d_2}} \right) = 0.1333,m_3^{\left( 1 \right)}\left( {{d_3}} \right) = 0.3333,m_3^{\left( 1 \right)}\left( \Theta \right) = 0.3333;} \cr {m_1^{\left( 2 \right)}\left( {{d_1}} \right) = 0.1512,m_1^{\left( 2 \right)}\left( {{d_2}} \right) = 0.1708,m_1^{\left( 2 \right)}\left( {{d_3}} \right) = 0.2491;} \cr {m_1^{\left( 2 \right)}\left( \Theta \right) = 0.4288;} \cr {m_2^{\left( 2 \right)}\left( {{d_1}} \right) = 0.3000,m_2^{\left( 2 \right)}\left( {{d_2}} \right) = 0.3000,m_2^{\left( 2 \right)}\left( {{d_3}} \right) = 0.3000;} \cr {m_1^{\left( 2 \right)}\left( \Theta \right) = 0.1000;} \cr {m_3^{\left( 2 \right)}\left( {{d_1}} \right) = 0.1625,m_3^{\left( 2 \right)}\left( {{d_2}} \right) = 0.0250,m_3^{\left( 2 \right)}\left( {{d_3}} \right) = 0.4375,;} \cr {m_3^{\left( 2 \right)}\left( \Theta \right) = 0.3750.} \cr } The probability mathematical model of data filtration is constructed using a fuzzy decision method, using a threshold test and threshold decision method, to achieve large data filtration, and the implementation process of the algorithm is as follows:

Algorithm 2

1.	//Set the primary value of the probability mathematical model of data filtering
2.	X (1) = X, Y (1) = Y, N (1) = n, i = 1
3.	whileN (i)〉L
4.	//Calculate data characteristics positioning matrix to get a fuzzy decision power index p(i)
5.	if N (i) mod L〈m
6.	then p(i) = N (i)/L
7.	else
8.	p(i) = N (i)/L + 1
9.	//With an independent main component analysis method, the fuzzy decision and probability statistics of data filtering are obtained, and the fuzzy identification matrix of data filtration output is obtained X (i) and Y (i).
10.	X_ij = X (i) [(j − 1 + 1 : L* j)]
11.	Y_ij = Y (i) [L* (J − 1) + 1 : L* j]
12.	//use SVD breakout operation method, power index statistics, execute the following cycle.
13.	for j = 1 to p(i)
14.	$X_{ij} = U_{ij} \sum_{ij} V_{ij}^{T}$ {X_{ij}} = {U_{ij}}\sum\nolimits_{ij} V_{ij}^T
15.	End
16.	Feature statistic for data filtering output X (i + 1), Y (i + 1)
17.	i = i + 1
18.	//End of iteration, realize the statistical regression analysis of large data filtration.
19.	k = i − 1
20.	X (k) = U (k)∑(k)V(k)^T
21.	Seeking the main feature β^*, determine whether the convergence condition is met, if not satisfied, go to Step 10.
22.	If the convergence condition is met, the algorithm ends.

Analysis of results

In order to test the application performance in data filtration, the data set of data sample distribution is database2016; the test data set is CSLOGS SET2, and the data sample number is N = 1024, the data set of the test sample is 100. The time interval of the original data sampling is 0. 02 s; the classification attribute set distribution of data filtration is shown in Table 1.

Table 1

Classification attribute set of data filtration

u	c₁	c₂	c₃ D
x₁	53	2	2
3	6	3	3
3	4	4	3
x₄	53	6	3
3	2	5	4
x₆	25	4	5
6	6	5	4
6	5	2	3
7	3	1	4
5	4	1	5

According to the statistical rules of the large data sheet described above, the data are filtered, and data filtration is obtained as shown in Table 2.

Table 2

Detection statistics rules table for data filtration

Regular				D
1	4	*	2	1
2	3	*	3	5
3	6	*	2	4
4	*	4	6	4
5	3	*	5	5
6	4	4	*	4
7	5	*	6	5
8	3	4	6	6

According to the statistical rule table, based on the above data filtration, data filtration is achieved, and the error rate of filtration is obtained, e.g., as shown in Figure 2; Figure 2 is analyzed and the accuracy of data filtration in this method is derived; the error is low.

Performance comparison of data filtration.

Conclusion

Effective filtration and separation of different types of big data improve data storage and management capabilities, optimize data storage space, and improve large data information recognition capability. This paper proposes a data filtration method based on the probability mathematical model, using descriptive statistical analysis method, the statistical feature analysis model of large data filtration, combined with mathematical modeling and statistical analysis, implementation of data filter probability mathematical model, using threshold test and threshold decision methods to implement data filtering. Studies show that the method of data filtration is better and the error is low.

eISSN:: 2444-8656
Język:: Angielski

Częstotliwość wydawania:: Volume Open
Dziedziny czasopisma:: Life Sciences, other, Mathematics, Applied Mathematics, General Mathematics, Physics

Kanał RSS czasopisma

Application of mathematical probabilistic statistical model of base – FFCA financial data processing

Data publikacji: 15 gru 2021

Zakres stron: 491 - 500

Otrzymano: 16 cze 2021

Przyjęty: 24 wrz 2021

DOI: https://doi.org/10.2478/amns.2021.1.00053

Słowa kluczowe
mathematical probability model, data filtering, financial data, inspection statistics

© 2021 Zhengqing Li et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Fig. 1

Fig. 2

Application of mathematical probabilistic statistical model of base – FFCA financial data processing

Data publikacji: 15 gru 2021

Zakres stron: 491 - 500

Otrzymano: 16 cze 2021

Przyjęty: 24 wrz 2021

DOI: https://doi.org/10.2478/amns.2021.1.00053

Słowa kluczowemathematical probability model, data filtering, financial data, inspection statistics

© 2021 Zhengqing Li et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Fig. 1

Fig. 2

Słowa kluczowe
mathematical probability model, data filtering, financial data, inspection statistics