Modelling bid-ask spread conditional distributions using hierarchical correlation reconstruction

While we would like to predict exact values, available incomplete information is rarely sufficient - usually allowing only to predict conditional probability distributions. This article discusses hierarchical correlation reconstruction (HCR) methodology for such prediction on example of usually unavailable bid-ask spreads, predicted from more accessible data like closing price, volume, high/low price, returns. In HCR methodology we first normalize marginal distributions to nearly uniform like in copula theory. Then we model (joint) densities as linear combinations of orthonormal polynomials, getting its decomposition into (mixed) moments. Then here we model each moment (separately) of predicted variable as a linear combination of mixed moments of known variables using least squares linear regression - getting accurate description with interpretable coefficients describing linear relations between moments. Combining such predicted moments we get predicted density as a polynomial, for which we can e.g. calculate expected value, but also variance to evaluate uncertainty of such prediction, or we can use the entire distribution e.g. for more accurate further calculations or generating random values. There were performed 10-fold cross-validation log-likelihood tests for 22 DAX companies, leading to very accurate predictions, especially when using individual models for each company as there were found large differences between their behaviors. Additional advantage of the discussed methodology is being computationally inexpensive, finding and evaluation a model with hundreds of parameters and thousands of data points takes a second on a laptop.


I. INTRODUCTION
While it is more convenient to work on exact values, real life predictions usually have some uncertainty, controlling of which could allow e.g. to distinguish nearly certain predictions from the practically worthless ones.Generally, wanting to predict Y variable from X = (X 1 , . . ., X d ) variables, if there is no a strict relation, they often come from some complicated joint probability distribution -knowing X = x, we can only predict Pr(Y |X = x) conditional probability distribution.This article discusses such prediction of conditional probability distributions on example of bid-ask spreads, which is often publicly unavailable, from a Figure 1.General concept, some firsts functions of the used 1 and 2 dimensional basis of orthornormal polynomials (f j 1 j 2 (x) = f j 1 (x 1 )f j 2 (x 2 )), and application example.For simplicity we assume working on variables normalized to nearly uniform marginal densities on [0, 1] as e.g. in copula theory.We would like to model distortion from this uniform distribution for predicted variable Y based on the context X: as a linear combination e.g. of orthornormal polynomials here, for which coefficients have similar interpretation as moments/cumulants: a 1 shifts right/left like expected value, a 2 increases/decreases probability of extreme values as variance etc.Each such coefficient is separately modelled using analogous coefficients of X variable: ρ(y|x) = j f j (y) k β jk f k (x), e.g. using least-squares linear regression here.Such predicted density as polynomial sometimes gets below zero, hence there is used e.g.ρ = max(ρ, 0.03)/N instead, with N normalization factor to integrate to 1. Example of application of such predicted density is just taking its expected value, getting a conservative prediction of value (avoiding extremes), also with estimated uncertainty if additionally calculating variance of the predicted density.few variables which are available.There is used hierarchical correlation reconstruction (HCR) [1] methodology as briefly presented in Fig. 1: first normalize all variables to nearly uniform distribution like in copula theory [2], then model densities as polynomials using basis of orthonormal polynomials -for which coefficients are analogous to (mixed) moments.Then predict such coefficients of density of Y Top: examples of predicted conditional densities -predicted ρ(y|x i ) = j f j (y) k β jk f k (x i ) polynomial for i-th datapoint undergoes ρ = max(ρ, 0.03)/N to remove negative densities, and normalization to integrate to 1 = 1 0 ρ(y|x)dy.Each diagram contains 10 example predictions, vertical lines show the actual values (y i , ρ(y i |x i )): the higher the better prediction, without prediction all would have height 1.There were chosen companies having best/worst prediction.The best ones predict mainly narrow unimodal distributions in line with the actual values, weaker ones can rather only predict wide often multimodal distributions.We can see rapid growths at the ends -they are likely artifacts of using polynomials, their additional removal might improve prediction.Bottom: their sorted predicted densities in the actual values {ρ(y i |x i )} i , with marked gray ρ = 1 line of using no prediction and green exp(log-likelihood) line corresponding to average improvement over no prediction.The points are of different colors denoting one of 10 rounds of 10-fold cross-validation.from coefficients of x, for example using least-squares linear regression here, separately for each modelled moment of Y .The evaluation is performed using 10-fold cross-validation for log-likelihood of normalized variables: average natural logarithm of predicted conditional density in the actual value, what can be interpreted as estimated minus conditional entropy E XY (ln(ρ(Y |X))) = −H(Y |X). Figure 2 contains examples of such predicted densities, Figure 3 contains main evaluation of the used variables and models.
Having predicted conditional probability distributions, a basic application can be just taking expected values -getting conservative predictions of values, e.g.avoiding predicting extremes, as presented in Figure 7.We can additionally calculate variance of such predicted density to estimate uncertainty of value predicted as expected value.We can also handle more sophisticated situations like bimodal distribution with two (or more) separate maxima: when pointing expected value might not be a good choice (can have much lower density), a better prediction might be e.g. one of the maxima, or maybe both: providing prediction as alternative of two (or more) possibilities.Finally, we can also work on complete predicted densities, e.g. to generate Figure 3. Top: Log-likelihoods from 10-fold cross-validation for individual models for companies using various types of information, e.g.'123' denotes using all basic 1,2,3 variables, where '1' denotes closing price (P ), '2' volume (V ), and '3' difference between high and low price normalized by dividing by closing price: (H − L)/P .The '+' denotes using also 3 additional variables: depth, midpoint changes intraday and midpoint volatility.The last column presents averaged evaluation for using common model for all data.Bottom: examples of pairwise dependencies in dataset for the 3 variables (columns) for the least and the most dependent companies for a given variable (heights of corresponding dots).For example volume ('2') does not help for RWEG (nearly uncorrelated -blue dot is near zero), but is useful for FREG.We can see that there are large differences between companies, hence we will mostly focus on building individual models for each company.its random values for Monte-Carlo methods, or processing such entire probability distributions through some further functions for their more accurate modelling, as e.g.generally E(f (X)) = f (E(X)) for nonlinear functions: expected value of function is not equal function of expected value.
To model such conditional distributions we will use HCR methodology, which combines advantages of classical statistics and machine learning.While the former allows for well controlled and interpretable but relatively small (rough) models/descriptions, machine learning allows for very accurate descriptions using huge models, but usually lacks uniqueness of solution, control and interpretability of coefficients, and often is computationally costly.HCR allows to inexpensively work on huge models obtained from (unique) least-squares optimization, using well interpretable coefficients: as mixed moments of variables, starting e.g. with moments of single variables and correlations of coefficients.
More specifically, HCR conveniently starts with normalization of all marginal distributions to nearly uniform distributions like in copula theory -so they can be interpreted as quantiles, compactifying tails problematic for linear regression.Now we can model distortion from uniform distribution on this [0, 1] d hypercube with a linear combination, e.g. of orthonormal polynomials, for which coefficients can be interpreted analogously to (mixed) moments.E.g. for 3 variables, '000' coefficient is always 1 -corresponding to normalization, '100' is analogous to expected value of 1st variable, '020' to variance of 2nd variable, '011' to correlation coefficient between 2nd and 3rd variable, '202' is large if with large variance of 1st variable there comes large variance of 3rd variable (like heteroskedasticity in ARCH model), and so on also for higher moments and dependencies between 3 or more variables, getting hierarchical decomposition of statistical dependencies (joint distribution) into mixed moments.
While we could directly extract and exploit (X, Y ) joint distribution with HCR, experimental tests have shown that alternative approach from [3] gives slightly better evaluation by extracting additional dependencies, hence we will focus only on it: use separate bases of (mixed) moments for Y, X and predict each considered coefficient for Y with least squares linear regression using coefficients of X.While [3] has used only moments of separate variables for X, here we expand this methodology by using also their mixed moments -starting with '11' correlation-like coefficient.
Its advantage over modelling joint distribution is being able to notice and exploit e.g. that difference of two variables has some useful relation with the predicted variable.Looking e.g. at RWEG in Fig. 3, '2'-nd variable is practically noise, its (blue) dot is nearly zero.However, the difference between '13' and '123' model (red and orange dot) is much larger: relations with other variables allowed to extract more information from this noise.
In the discussed example we would like to predict conditional probability distributions for (nearly inaccessible) bid-ask spreads (relative quoted) from more available information.The basic considered '123' model began with 5 classically used variables: closing price (P ), high and low value (H, L), volume (V ) and log-return (R).Surprisingly, it has turned out that using R and L alone did not help improving evaluation (log-likelihood in 10-fold cross-validation), hence finally there were used 3 variables P, V, (H − L)/P .The second considered '123+' model complements this information with other relatively available variables, searching through which has lead to final use of 3 additional variables: (market) depth, midpoint changes intraday, and midpoint volatility.
The choice of basis of moments is a difficult question: too small leads to underfitting by not being able to express dependencies of data behavior, too large leads to overfitting by representing features of training set which do not generalize to test set.For '123' and '123+' models there was chosen a compromise to optimize for all companies: for '123' predicting 8 first moments of Y using 53 mixed moments of 3 variables of X, hence using 8 • 53 = 424 coefficient models.For '123+' predicting 6 moments from 205 mixed moments of 6 variables of X, getting 6 • 205 = 1230 coefficient models.
The used data is for 22 DAX companies for which large enough dataset was available, arbitrarily chosen as containing at least 2000 daily datapoints.As there were observed large difference between models for different companies, corresponding to different behaviors of traders of its stock, there were mainly considered individual models for each company.There was also performed hierarchical search for combinations of companies for which using common model leads to the smallest loss of evaluation, however, such loss often turns out significant.

II. DATASET AND BASIC CONCEPTS
This section discusses dataset and reminds standard concepts, to be used for building the used methodology in the next Section.
The basic set of variables is P -closing price, V -volume, R -return, H, L -high/low price.However, it has turned out that trying to exploit dependence on R and L alone did improve evaluation, hence finally the basic considered model: '123' uses only P as '1'-st variable, V as '2'ndvariable and normalized (H − L)/P as '3'-rd variable.
There were also performed trials to improve the prediction by using information from some additional relatively available variables -3 were found helpful in predictions: (market) depth, midpoint changes intraday and midpoint volatility.

B. Bid-ask spread and some its standard predictors
Bid-ask spread is the difference between the lowest asking price (ask, offered by a seller) and the highest bid price (bid, offered by a buyer).While this value is important because it is a main measure of market quality [4], [5], this information is usually publicly unavailable.Therefore, there is an interest in being able to predict this value based on other, more accessible data.
More specifically, we work on relative quoted spread, which is normalized by dividing by midpoint (ask +bid)/2: Simple examples of its predictor based on the 5 basic variables are AM I [6], [7], HLR [8]: They are intended for a simpler task than discussed: to predict values, while here we want to predict entire conditional probability distributions.We can reduce predicted probability distributions into predicted values e.g. as expected value, median, or positions of maxima (especially for multimodal distributions).Figure 7 presents comparisons using such predictions reduced with expected value.However, in practice such prediction is often further processed through some functions, generally E(f (X)) = f (E(X)) for nonlinear, hence it is more accurate to process the probability distribution (e.g. on a lattice) through the functions before e.g.taking expected value.

C. Normalization to nearly uniform marginal distributions
Like in copula theory, in HCR methodology it is convenient to initially normalize all variables to nearly uniform marginal distributions in [0, 1], hence we further only work on such normalized variables, what beside usually better prediction, also allows for better presentation of evaluation: e.g.density without prediction is 1, log-likelihood is 0.
This standard normalization requires estimations of cumulative distribution function (CDF), individually for each variable, and applying this CDF function to the original values.Finally, having a prediction we can go back to the original variable using CDF −1 , for example as in the original [3] article, however, for simplicity we omit this step hereworking only on normalized variables.Also AM I, HLR predictions underwent such normalization for the purpose of Fig. 7 visual performance comparison -making that an ideal predictor would give diagonal.
There was used empirical distribution function (EDF) for such normalization here: for each variable its n observed values are sorted, then i-th value in such order obtains (i − 0.5)/n normalized value.Hence values become their estimated quantiles this way, difference of two normalized values describes percentage of population between these two values.
Having predicted density for normalized variable, we can transform it to the original variable e.g. by discretizing this density to probability distribution on {(i − 0.5)/n} i=1..n lattice, and assigning probability of its i-th position to i-th ordered original value.For simplicity it is omitted in this article.

D. Evaluation: log-likelihood with 10-fold cross-validation
The most standard evaluation of probability distributions is log-likelihood like in ML estimation: average (natural) logarithm of (predicted) density in the actually observed value.Hence we will use this evaluation here.
Working on variables normalized to ρ ≈ 1 marginal distributions, without prediction they would have practically zero log-likelihood.It allows to imagine gains from predictions as averaged improvement over this ρ ≈ 1, as in Fig. 2. For example the best observed log-likelihood ≈ 1.2 corresponds to ≈ exp(1.2) ≈ 3.3 times better density than without prediction, the same as if we could squeeze [0, 1] range 3.3 times to a 0.3 wide range.Sorting predicted densities in the actually observed values, we can get additional information about distribution of prediction, as presented in this Figure .We predict here conditional density -denoted as ρ(Y = y|X = x) for density of Y predicted based on known value of X. Hence the used evaluation can be seen as estimation of E XY (ln(ρ(Y |X)), which is minus conditional entropy −H(Y |X).While being unknown, random variables have some concrete value of conditional entropy -we can hopefully try to approach it with better and better models.
We are focusing here on large models using hundreds of coefficients, hence we need to be careful not to overfit: represent only behavior which indeed generalizes, is not just a statistical artifact of the training set.Machine learning also builds large models, usually evaluating using crossvalidation: randomly split dataset into training and test set, training set is used to build the model, then test (or validation) set is used to evaluate the built model.
However, such evaluation depends on the random splitting into training and test set.There is used standard 10-fold cross validation to weaken this random effect: dataset is randomly split into 10 nearly equal size subsets, evaluation is average from 10 cross validations: using successive subsets as the test set and the remaining as the training set.However, there is still observed scale ≈ 0.01 randomness of such evaluation, hence only two digits after coma are being presented.

III. USED HCR-BASED METHODOLOGY
This section discusses the used methodology, being expansion of the one used in [3].We decompose X and Y variables into mixed moments and model separately each moments of Y using least-squares linear regression of moments of X, then combine them into predicted conditional probability distribution of Y .

A. Decomposing joint distribution into mixed moments
After normalization of marginal distributions of all variables to nearly uniform on [0, 1], for d variables their joint distribution on [0, 1] d would be also nearly uniform if they were statistically independent.
Distortion from uniform joint distribution corresponds to statistical dependencies between these variables -we would like to model and exploit it.In HCR we model it as just a linear combination using an orthornormal basis e.g. of polynomials, which gives the coefficients similar interpretation as moments and mixed moments: dependencies between moments for multiple variables.
Decomposing density ρ(x) = j a j f j (x), we need a 0 = 1 normalization to integrate to 1. Due to orthogonality, 1 0 f j (x)dx = 0 for j > 0, hence the following coefficients do not affect normalization.As we can see in their plots in Fig. 1, positive a 1 shifts density toward right -acting analogously as expected value.Positive a 2 increases probability of extreme values at cost of central values -analogously as variance.Skewness-like higher order asymmetry is brought by a 3 and so on -we can intuitively interpret these coefficients as moments (cumulants).This is only an approximation, but useful for interpretation of discussed models.
In multiple dimensions we can use product basis: leading to model of joint distribution: where B is the basis of mixed moments we are using for our modelling.It is required to contain (0, . . ., 0) for normalization.Beside, there is a freedom of choosing this basis, what allows to hierarchically decompose statistical dependencies of multiple variables into mixed moments.Figure 1 contains some first 5 functions of such product basis for d = 2 variables: f 00 corresponds to normalization and requires a 00 = 1.Coefficients of f 10 , f 20 describe expected value and variance of the first variable, f 01 and f 02 analogously of the second.Then we can start including moment dependencies, starting with a 11 which determines decrease/increase of expected value of one variable with growth of expected value of the second variables -analogously to correlation coefficient.We have also dependencies between higher moments, like asymmetric a 12 relating expected value of the first variable and variance of the second.
And analogously for more variables, e.g. a 010010 describes correlation between 2nd and 5th out of 6 variables.Finally we can hierarchical decompose statistical dependencies between multiple variables into their mixed moments.However, to fully describe general joint distribution, we would need B = N d infinite number of mixed moments this way -for practical applications we need to choose some finite basis B of moments to focus on.

B. Estimation with least squares linear regression
Having a data sample X , we would like to estimate such mixed moments as coefficients for linear combination of some orthonomal basis of functions e.g.polynomials.Smoothing the sample with kernel density estimation, finding linear combination minimizing square distance to such smoothened sample, and performing limit to zero width of the used kernel, we get convenient and inexpensive MSE estimation [1]: independently for each coefficient j as just average over dataset of value of the corresponding function: We could use such obtained model for predicting conditional distribution: substitute the known variables to the modeled joint distribution, after normalization getting (conditional) density of the unknown variables.However, for the bid-ask spread prediction problem, slightly better evaluation was obtained by generalizing alternative approach from [3], which allows to additionally exploit subtle variable dependencies, hence we will focus on it.
Specifically, to model ρ(Y = y|X = x), let us use separate bases of (mixed) moments: B X for X, B Y for Y , and model relations between them.While there could be considered more sophisticated models for such relations including neural networks, for simplicity and interpretability we focus on linear models here, treating f j (x) as inter-pretable features: hence the model is defined by the β matrix, which examples are visualized in Fig. 4 for It allows for good interpretability: β jk coefficient is linear contribution of k-th mixed moment of X to j-th (mixed) moment of Y .We focus on one-dimensional Y , but the formalism allows to analogously predict multidimensional.
To find the β model we use least-squares optimization here -it is very inexpensive, can be made independently for each j ∈ B Y thanks to using orthonormal basis, and intuitively is a proper heuristic: least-squares optimization estimates the mean -exactly as we would like for coefficient estimation (6).However, this is not necessarily the optimal choice -it might be worth to explore also more sophisticated ways.
Such least-squares optimization has to be performed separately for each j ∈ B Y .Denoting β j• = (β jk ) k∈B X as coefficient vector for j-th moment and Z = {(y i , x i )} i=1..n as (e.g.training) dataset of (y, x) pairs: .n matrix M and vector b j for j ∈ B Y .Such least-squares optimization has unique solution: Separately calculated for each j ∈ B Y , leading to the entire model as β matrix with β j• rows, like in Fig. 4.

C. Applying the model, enforcing nonnegativity
Having such model β we can apply it to (e.g.test) datapoints as in 7, getting predicted conditional density for y on [0, 1] as a polynomial.However, sometimes it can get below 0, so let us refer to it as ρ and then enforce nonnegativity required for densities: Such obtained polynomial always integrates to 1.However, it occasionally can get below zero, what should be interpreted as corresponding to some low positive density.Such interpretation to nonnegative density ρ is referred as calibration, and can be optimized based on dataset as discussed in [9].For simplicity there was just used: ), the numbers above names are log-likelihoods.The 'common' is the model built for combined all data -presents general trends.Trying to split all companies into subsets of similar behavior, as visualized in tree Fig. 6, splitting into two subsets we get the presented comL and comR models correspondingly for left (DPWG, BEIG, HNKG, FME, SAPG, DB1G, RWEG, FREG, HEIG, DTEG, IFXG) and right (DAIG, SIEG, TKA, CONG, MRCG, LHAG, VOWG, MUVG, ALVG, BMW, DBKG) subtree of this tree.Then there presented individual models for 5 selected companies.Rows correspond to predicted moments of Y , as a linear combinations of mixed moments of X corresponding to columns.The zeroth row has always only 000 nonzero coefficient equal 1 for normalization.The next row describes prediction of expected value, the next one of variance and so on.In the top model, common for all companies, we can e.g.see large positive 001 → 1 coefficient: spread increases with growth of H − L, negative 010 → 1: spread decreases with growth of V , and negative 011 → 2: variance of spread decreases for correlated V and H − L. Blue 100 → 3 for FREG denotes reduction of skewness of spread with growth of price.Generally, we can see quite individual behavior for different companies, starting with 100 → 1 analogous to price-spread correlation, which seems the main dividing factor between comL and comR companies.
where N normalization factor is chosen to integrate to 1: N = 1 0 max (ρ(y|x), 0.03) dy.The 0.03 threshold was experimentally chosen as a compromise for the used dataset, its tuning can slightly improve evaluation.
Figure 2 contains examples of such ρ(y|x i ) predicted densities on the test set with y i actual values marked as vertical lines.Flat near zero regions come from max(•, 0.03) thresholding.While they are relatively frequent in such predicted densities, in plots of sorted {ρ(y i |x i )} i below we can see that these close to zero densities are very rare among the actual values: prediction properly excludes these ρ < 0.03 regions as unlikely.
Integration is relatively costly to compute, especially in higher dimensions, hence for efficient calculation the predicted polynomial ρ was discretized here into 100 values on ((i − 0.5)/100) i=1..100 lattice, what corresponds to approximating density with piecewise constant function on length 1/100 subranges.Then max(•, 0.03) was applied, and division by sum for normalization.Finally density in discretized 100y i /100 position was used as ρ(y i |x i ) in log-likelihood evaluation.

D. Basic basis selection
The optimal choice of basis is a difficult open question.As the basic choice there was used combinatorial family: where m i chooses how many first moments to use for i-th variable, s bounds the sum of used moments (and formally degree of corresponding polynomial), r bounds the number of nonzero j i : to include dependencies of up to r variables.

IV. BID-ASK SPREAD MODELLING
This section discusses application of the presented methodology to model conditional distribution of (relative quoted) bid-ask spread.

A. '123' model using basic variables
The initial plan for this article was to improve prediction from standard models: AM I, HLR, trying to predict conditional distribution of spread from their values using the discussed methodology.However, the results were disappointing, especially for AM I, as we can see in Fig. 7.
Therefore, we have decided to use the original variables (P, V, L, H, R) instead, what has turned out to lead to essentially better predictions.There was manually performed search for parameters using B basic basis selection (III-D) to maximize averaged log-likelihood in 10-fold cross validation.This search has finally lead to B X = B((4, 4, 4), 5, 3) basis for only 3 variables: P, V, (H − L)/P to predict up to 8-th moment of Y .Surprisingly, adding dependence on R  moments (denoted by colors) using some first of mixed moments (sorted lexicographically) of 3 X variables: P, V, (H − L)/P .We can see that we should predict ≈ 8 moments, higher moments are necessary to represent more complex distributions.Middle: selective removal of successive mixed moments to maximize log-likelihood -we can see that we can slightly improve evaluation this way, additionally reducing model size.However, it rather requires individual optimization for each company.Bottom: analogously as top, but using size 181 larger B X = B((5, 5, 5), 10, 3), also trying different order of mixed moments: accordingly to i (j i ) p .While using all such mixed moments clearly leads to overfitting, selectively using some first of them can lead to slightly improved evaluation.and L alone was worsening evaluation -their dependence did not generalize from training to test sets.
While the optimal choice of basis seems a difficult open problem, exhaustive search over all subsets is rather impractically costly, Fig. 5 presents some heuristic approaches.The B family seems generally a good start, e.g.successively modifying some parameter by one as long as observing improvement.In this Figure we can see large improvement while rising the number of predicted moments up to ≈ 7, what suggests that complexity of conditional distributions for the considered problem requires this degree of polynomial for proper description.This Figure also contains trials of using some first of such mixed moments accordingly to different orders.A heuristic optimization of a reasonable cost is the presented selective removal: for each mixed moment from B X calculate evaluation when it is removed, finally remove the one leading to the best evaluation, and so on as long as evaluation improves.

B. Individual vs common models, universality
A natural question is how helpful for prediction is a given variable -Figure 3 presents some answers by calculating log-likelihood also for models using only some of the variables.We can see different companies can have very different behavior here, e.g. for some V is helpful, for some it is not, what we can also see in the presented points from dataset. Figure 4 shows that they can even have opposite behavior: e.g. 100 → 1 dependence on price.This is a general lesson that while we would like predictors as nice simple formulas, the reality might be much more complicated -models found here are results of cultures of traders of stocks of individual companies, which can essentially vary between companies.
Therefore, to get the most accurate predictions we should build individual models for each company.Even more, a specific behavior of a given company can additionally evolve in time -what could be exploited e.g. by building separate models for shorter time periods, or using adaptive leastsquares linear regression [10], and is planned for future investigation.
However, building such models requires training data, which in case of variables like bid-ask spread might be difficult to access.Hence it is also important to search for universality -e.g.try to guess a model for a company for which we lack such data, based on data available for other companies.This generally seems a very difficult problem, Fig. 6 shows that even having all the data, using common model for multiple companies we should expect large evaluation drop.For example we can see that behavior of DTEG completely disagrees with common model for all.
As we can see in this tree Figure, the one common model situation improves if we can cluster companies into groups of similar behavior (orange dots) -there are also presented results for splitting companies into just two groups with separate models (comL, comR in Fig. 4), also visually leading to slightly better prediction as we can see comparing 3rd and 4th column in Fig. 7.

C. '123+' model with additional variables
The information from P, V, H, L, R basic variables can often be complemented with some additional -a natural approach is checking if we can improve log-likelihood in the discussed methodology if adding information from some new variables.
The size of basis can even grow exponentially with the number of variables here: there are (m+1) d mixed moments Visualization of optimized hierarchical grouping and loss of using common models for multiple companies, height denotes loglikelihoods.Heights of names show evaluation of using individual model for a given company, orange dots show successive reduction of loglikelihood for a given company while using common models for subsets growing accordingly to the presented tree.The lowest dots correspond to using one common model for all (common in Fig. 4) -we can see that only for DTEG it is worse than zero (using no prediction at all).Splitting companies into left and right subtree and using separate two models for them (comL and comR in Fig. 4), we get essentially better prediction (one dot up).The tree structure was calculated by combining subsets to maximize (log-likelihood of common model / average loglikelihood of individual models) -grouping companies into pairs and then further, up to a single common model for all.Positions of lines represent such grouped companies: light-gray line their averaged log-likelihoods of individual models, dark-gray line their log-likelihood for a common model.The difference between these two lines represent loss of using common model.The common models are fixed hence there is no cross validation (CV), what artificially improves performance, for example for the first dot of FME corresponding to common model with HNKG -making it above CV individual model, generally suggesting large time inhomogeneities -to be included in future adaptive models.
if using all up to m-th moment for all d variables.The r parameter: maximal number of interacting variables in d i=1 sgn(j i ) ≤ r allows to bound it by O((dm) r ).The sum s limitation in d i=1 j i ≤ s seems also very useful, bounding degree of used polynomials.
Due to quickly growing basis size for increasing number of variables, we could easily exceed the size of datasetexperimentally seen as overfitting: decreasing performance.Manual search for using additional variables has started with B X = B((4, 4, 4), 5, 3) basis for the standard variables, and carefully increasing m in B X = B((4, 4, 4, m), 5, 3) basis with separate single additional variable to consider.
The most promising variables were later considered together, by modifying parameters by 1 as long as improvement was observed (of averaged log-likelihood over individual models for all companies).
Due to rapid growth of the number of coefficients, for adding further variables it is worth to consider e.g.building some features from multiple variables to be directly used here, or use some alternative way for choosing basis for X, e.g.directly optimized on the dataset like PCA or other dimensionality reduction.

V. CONCLUSIONS AND FURTHER WORK
There was presented a general methodology for extracting and exploiting complex statistical dependencies between multiple variables in inexpensive and interpretable way for predicting conditional probability distributions, on example of difficult problem of predicting bid-ask spread from more available information.It expands approach form [3] by inferring from mixed moments, and searching for the basis in large spaces of possibilities.
Figure 7 presents its comparison with standard methods when using only expected value from such predicted conditional density -perfect predictor would lead to diagonal, standard methods give rather a noise instead, while the predictions from the discussed approaches indeed often resemble diagonal, especially when using individual models.Predicted conditional probability density provides much more information: e.g.allows to additionally estimate uncertainty of such prediction, or provide or-or prediction for multimodal densities, or allows for generating its random values e.g. for Monte-Carlo simulations, or just provide the entire density for accurate considerations if transforming such random variables through some further functions.
There are many directions for further development of this relatively new general methodology, for example: • Optimal choice of basis is a difficult problem, necessary to be automatized especially for a larger number of variables -selecting from discussed basis of orthonormal polynomials, or maybe automatically optimizing a completely different basis based on dataset.• There are observed large differences between behaviors of individual companies -bringing difficult questions of trying to optimize for common behavior, optimize models based on incomplete information, etc.Additionally, such behavior has probably also time inhomogeneitythe models should evolve in time, requiring adaptive models to improve performance, where the problem of data availability becomes even more crucial.• The discussed models rapidly grow with the number of variables, what requires some modifications for exploiting high dimensional information -like extracting features from these variables, dimensionality reduction like PCA, etc. • We have predicted conditional distribution for onedimensional variable, but the methodology was introduced to be more general: predicting for multidimensional Y should be just a matter of using proper B Y , what is planned to be tested in future.• The predicted densities as polynomials have often rapid growths at the ends of [0, 1] -their removal might improve performance.• There was assumed linear relation between moments with least-squares optimization, what is inexpensive and has good interpretability, but is not necessarily optimal -there could be considered e.g. using neural networks instead, and optimizing criteria closer to loglikelihood of final predictions.The "1 common" column uses 1 model for all, "2 common" groups companies into two subsets and uses one of two models (as in Fig. 6, using models comL, comR from Fig. 4).The last two columns use models individually optimized for each company.

Figure 2 .
Figure 2.Top: examples of predicted conditional densities -predicted ρ(y|x i ) = j f j (y) k β jk f k (x i ) polynomial for i-th datapoint undergoes ρ = max(ρ, 0.03)/N to remove negative densities, and normalization to integrate to 1 = 1 0 ρ(y|x)dy.Each diagram contains 10 example predictions, vertical lines show the actual values (y i , ρ(y i |x i )): the higher the better prediction, without prediction all would have height 1.There were chosen companies having best/worst prediction.The best ones predict mainly narrow unimodal distributions in line with the actual values, weaker ones can rather only predict wide often multimodal distributions.We can see rapid growths at the ends -they are likely artifacts of using polynomials, their additional removal might improve prediction.Bottom: their sorted predicted densities in the actual values {ρ(y i |x i )} i , with marked gray ρ = 1 line of using no prediction and green exp(log-likelihood) line corresponding to average improvement over no prediction.The points are of different colors denoting one of 10 rounds of 10-fold cross-validation.

Figure 4 .
Figure 4.Visualized coefficients of '123' models (9 × 53 matrix β for ρ(y|x) = j f j (y) k β jk f k (x)), the numbers above names are log-likelihoods.The 'common' is the model built for combined all data -presents general trends.Trying to split all companies into subsets of similar behavior, as visualized in tree Fig.6, splitting into two subsets we get the presented comL and comR models correspondingly for left (DPWG, BEIG, HNKG, FME, SAPG, DB1G, RWEG, FREG, HEIG, DTEG, IFXG) and right (DAIG, SIEG, TKA, CONG, MRCG, LHAG, VOWG, MUVG, ALVG, BMW, DBKG) subtree of this tree.Then there presented individual models for 5 selected companies.Rows correspond to predicted moments of Y , as a linear combinations of mixed moments of X corresponding to columns.The zeroth row has always only 000 nonzero coefficient equal 1 for normalization.The next row describes prediction of expected value, the next one of variance and so on.In the top model, common for all companies, we can e.g.see large positive 001 → 1 coefficient: spread increases with growth of H − L, negative 010 → 1: spread decreases with growth of V , and negative 011 → 2: variance of spread decreases for correlated V and H − L. Blue 100 → 3 for FREG denotes reduction of skewness of spread with growth of price.Generally, we can see quite individual behavior for different companies, starting with 100 → 1 analogous to price-spread correlation, which seems the main dividing factor between comL and comR companies.

Figure 5 .
Figure 5. Top: Optimizing basis and model size on example of FREG company and B X = B((4, 4, 4), 5, 3) size 53 basis of mixed moments from '123' model.Log-likelihoods for predicting first 1 . ..10 moments (denoted by colors) using some first of mixed moments (sorted lexicographically) of 3 X variables: P, V, (H − L)/P .We can see that we should predict ≈ 8 moments, higher moments are necessary to represent more complex distributions.Middle: selective removal of successive mixed moments to maximize log-likelihood -we can see that we can slightly improve evaluation this way, additionally reducing model size.However, it rather requires individual optimization for each company.Bottom: analogously as top, but using size 181 larger B X = B((5, 5, 5), 10, 3), also trying different order of mixed moments: accordingly to i (j i ) p .While using all such mixed moments clearly leads to overfitting, selectively using some first of them can lead to slightly improved evaluation.

10
Figure 5. Top: Optimizing basis and model size on example of FREG company and B X = B((4, 4, 4), 5, 3) size 53 basis of mixed moments from '123' model.Log-likelihoods for predicting first 1 . ..10 moments (denoted by colors) using some first of mixed moments (sorted lexicographically) of 3 X variables: P, V, (H − L)/P .We can see that we should predict ≈ 8 moments, higher moments are necessary to represent more complex distributions.Middle: selective removal of successive mixed moments to maximize log-likelihood -we can see that we can slightly improve evaluation this way, additionally reducing model size.However, it rather requires individual optimization for each company.Bottom: analogously as top, but using size 181 larger B X = B((5, 5, 5), 10, 3), also trying different order of mixed moments: accordingly to i (j i ) p .While using all such mixed moments clearly leads to overfitting, selectively using some first of them can lead to slightly improved evaluation.

Figure 6 .
Figure 6.Visualization of optimized hierarchical grouping and loss of using common models for multiple companies, height denotes loglikelihoods.Heights of names show evaluation of using individual model for a given company, orange dots show successive reduction of loglikelihood for a given company while using common models for subsets growing accordingly to the presented tree.The lowest dots correspond to using one common model for all (common in Fig.4) -we can see that only for DTEG it is worse than zero (using no prediction at all).Splitting companies into left and right subtree and using separate two models for them (comL and comR in Fig.4), we get essentially better prediction (one dot up).The tree structure was calculated by combining subsets to maximize (log-likelihood of common model / average loglikelihood of individual models) -grouping companies into pairs and then further, up to a single common model for all.Positions of lines represent such grouped companies: light-gray line their averaged log-likelihoods of individual models, dark-gray line their log-likelihood for a common model.The difference between these two lines represent loss of using common model.The common models are fixed hence there is no cross validation (CV), what artificially improves performance, for example for the first dot of FME corresponding to common model with HNKG -making it above CV individual model, generally suggesting large time inhomogeneities -to be included in future adaptive models.

Figure 7 .
Figure 7.Comparison of spread predictors on dataset for visual evaluation: perfect predictor would give a diagonal, a completely useless one would give uniform distribution.All variables are normalized to nearly uniform marginal distributions, including outcomes of standard methods: AM I, HLR.The following 3 columns use expected values of predicted densities from discussed '123' model (using P, V, (H − L)/P variables, 8 • 53 = 424 coefficients), the last one is for '123+' model (using also depth, midpoint changes intraday and midpoint volatility, 6 × 205 = 1230 coefficients).The "1 common" column uses 1 model for all, "2 common" groups companies into two subsets and uses one of two models (as in Fig.6, using models comL, comR from Fig.4).The last two columns use models individually optimized for each company.