Behind every biological system, there is a program governing the expression of different genes in the genome. In normal phenotypes, these programs will lead to a state of homoeostasis. Nevertheless, alterations to the regulatory mechanisms may lead to pathological conditions. One example of these pathologies is breast cancer.
Gene regulatory programs (GRPs) are composed of several genomic and epigenomic mechanisms interacting in a complex, nonlinear fashion. In recent years, genomic technologies have allowed the study of these systems. Such massive data may be studied and analyzed through the use of computational methods derived from Information Theory, in order to identify large-scale features related to the organization of the system.
We have studied the deregulation of GRPs in breast cancer in terms of the influence of physical distance between genes on coexpression. Our recent work has shown that there is a loss in gene coexpression between genes that are further away, either in different chromosomes (sometimes known as trans-regulation) or at greater distances within the same chromosome (sometimes known as cis-regulation) [4].
In this work, we characterize the relationship between intrachromosomal gene coexpression and distance. We identify a distance-dependent gene coexpression decay in a breast cancer phenotype, a phenomenon not observed in healthy breast tissue. We adjust these relationships to known models for heavy-tailed distributions. Through a model discrimination analysis, we observe that the distance-dependent gene coexpression decay may exist in an intermediate regime between a power law and a Weibull distribution, which may have important implications in terms of the organization of the GRP in health and disease.
One relevant problem in contemporary computational biology is the probabilistic inference of the best (i.e., the maximum-likelihood or maximum-entropy) set of regulatory interactions between genes starting from a large –but still partial – data corpus (, as given for instance, in RNA sequencing experiments over whole-genome transcriptomes. We will call this problem the gene regulatory program deconvolution, GRPD. Solving the GRPD problem involves large-scale probabilistic inference in highly noisy data sets, and it thus remains a challenge to common probabilistic modelling approaches.
In order to circumvent these limitations, a number of algorithms – some of them with information theoretical foundations – have been developed. These approaches include mutual information maximization, Markov random fields, use of the data processing inequality, minimum description length, and Kullback–Leibler divergence. Also relevant to the GRPD problem are machine learning techniques, as well as Monte Carlo methods, variational methods, and hidden Markov models or stochastic linear dynamical systems. Common to these latter models is the fact that they are parametrically conditioned on the hidden state vector; the past, present, and future observations are statistically independent [8].
Information theoretical-founded approaches are useful for the tasks of feature selection and network inference. Feature selection methods applied to transcriptomics aim to improve molecular diagnosis and prognosis in complex diseases (such as cancer) by identifying a (minimum) set (a molecular signature) of features that characterize the underlying biological phenomenon. Network inference, in contrast, usually tries to present the full set of statistical dependencies between genes by means of a probabilistic graphical model of a gene regulatory network.
A first step towards solving either the feature selection or network inference subproblems of a GRPD is to have a detailed knowledge of the joint and marginal gene expression probability distributions. In what follows, we present an information theoretical approach to the GRPD problem via mutual information distributions.
Let
Here,
We can also define the off-diagonal mutual information,
Here (i
A GRP is the solution of a GRPD problem, i.e., the full set of interactions among genes that give rise to a transcriptional phenotype. Within the context of the theoretical and experimental settings we have just described, let us define what the solution of a GRPD problem is.
Definition 1. Here we define a gene regulatory program as the functional
Biological phenotypes are the result of a large set of regulatory interactions that are controlled by diverse genomic and epigenomic elements. Perturbation of these elements is involved in the origin and maintenance of the pathological phenotype observed in cancer. One of these elements is the spatial configuration of the genome. As we have previously mentioned, in a recent work [4], we have shown the existence of a major difference in the relationship between gene interactions and physical distance between breast cancer and healthy breast phenotypes.
Starting from the solution
The functional
By analyzing
Visual inspection, however, may be misleading us to attribute power-law behaviour to other long-tailed data distributions [2], and even common regression techniques may prove deceptive to establish functionality in heterogeneous variance settings [7, 25]. For these reasons, we decided to implement a comprehensive approach to model discrimination analysis.
To do this, we modelled the chromosomal gene–gene distance
In the present context, the use of GAMLSSs allows us to perform differential goodness-of-fit analyses of the chromosome-wise gene–gene mutual information distributions. The data were adjusted to the following models: power law (PL), log-normal (LN), Weibull (WB), and linear exponential (LE), using polynomials up to order = 4. This analysis generated 4
Once we have validated more formally the loss of long-range gene regulatory interactions in breast cancer, it is pertinent to analyze what is the distance dependence of the correlations. This may allow us to determine whether there is a preferred
The inference of the chromosome-wise GRPs
Figure 1 shows the full mutual information distributions for interactions in basal breast cancer and healthy breasts for chromosomes 1, 10, and 18. As can be observed, the strength of gene–gene correlations decays dramatically in cancer; meanwhile, for the healthy distribution, those values remain almost constant. As can be seen, the regulatory program in cancer is strongly altered (something that is, of course, widely known). One possible contribution to this deregulation relates to spatial chromosome rearrangements (i.e., changes in the 3D structure of chromatin).
In this regard, Figure 1 shows a long-tail decay of gene–gene mutual information correlations in the basal tumor distribution for chromosomes 1, 10, and 18 (upper panels), whereas a fundamentally constant (i.e., distance-independent) behaviour is observed in the nontumour distributions (lower panels).
As can be observed in Figure 1 and in Supplementary materials 1, the phenomenon of decay in the tumour intrachromosomal relationships occurs following a similar fashion in the different chromosomes. However, as can be seen in Figure 2 and in Supplementary materials 2, there are subtle differences in the decay regimes among chromosomes in breast cancer transcriptomes.
The model discrimination analysis we performed indicates that the best goodness of fit (by resorting to a combination of extensive coefficient of determination [R2] calculations and Bayesian information content predictors [16] – see Supplementary materials 2) corresponds to either the power law or the Weibull distributions (see Figure 2). The indicators for these distributions come so close that it is not unreasonable to consider that the
Power-law decay of spatial correlations is a well-established phenomenon in the physical and biological settings [17,22]. Power-law decay may be resulting from the action of a generalized central limit theorem [5] for multiplicative growth processes [13,15,18,20], which may [26,27] or may not [21] result from self-organization processes.
The Weibull distribution has been extensively used in recent times to characterize a variety of skew-distributed phenomena in the physical, biological, and economical sciences [9, 11, 19]. As in the case of the power law, the Weibull distribution may arise as a consequence of generalized central limit theorems, but in this case, for branching – instead of multiplicative growth – processes, as has been proved by Jo et al. [9] from the asymptotics of the Galton–Watson branching process of simple replicative systems. The authors have also shown that the branching process can be mapped into a process of aggregation of clusters. Weibull distributions may also arise in the context of percolation theory. Sornette [20] has shown how the existence of intermittency in continuum percolation changes the distribution from extreme exponential to a smoother Weibull-like form.
By considering the constructive processes associated with fat-tailed probability distributions – in particular, power law and Weibull-like decay – we can hypothesize that the behaviour of gene–gene correlations in breast cancer tumours may arise due to a combination of multiplicative growth, branching, and clustered aggregation processes. These hints are indeed relevant to arrive at the design of experimental strategies to probe what are the actual biological and physical processes behind the dramatic changes in the gene regulation patterns in cancer.
We have examined the phenomenon of decay of intrachromosomal regulation in breast cancer from an information theoretical perspective. In this work, we performed a systematic study of the phenomenon for each chromosome in breast cancer. By considering the whole GRP, defined in terms of mutual information, without resorting to any form of thresholding, binning, or any other method of feature selection or aggregation, we were able to model the relationship of intrachromosomal gene coexpression in terms of distance using the computationally demanding GAMLSS approach.
Comprehensive model discrimination analysis allowed us to identify that the distance-dependent gene co-expression decay lies in an intermediate regime between power law and Weibull distributions. This allows us to hypothesize that the divergence found between the healthy breast phenotype and breast cancer in terms of distance-dependent gene coexpression may arise from a combination of multiplicative growth, branching, and aggregation processes in the regulatory program. It can be argued that changes in the chromosomal structure, as well as changes in the topological associating domains in DNA, epigenomic phenomena, and chromosomal aberrations, may be among the more likely phenomena behind the loss of long-range regulation in cancer. However, a great deal of experimental as well as theoretical and computational data analysis must be done before arriving at a definite answer to this conundrum.