Learning Context-based Embeddings for Knowledge Graph Completion

Knowledge graphs (KGs) are becoming the foremost driving force to enable the Artificial Intelligence (AI). Like human brains, KGs will become the brains for machines that can connect each other, perform cognitive inference, and most importantly, derive insights from the vast amount of heterogeneous data. A lot of KGs are built in AI-related applications, for instance, recommender systems (Li et al., 2019), question answering (Hao et al., 2017), semantic searching and ranking (Gjergji et al., 2018; Xiong, Russell, & Jamie, 2017), and reading comprehension (Yang & Tom, 2017). However, independently, of the source of the creation of the KG, the data extracted for the initial KG is always not complete. Hence, it's an important step to predict missing links of the resulting knowledge graphs because it is costly to find all valid triples manually. This problem is also known as knowledge graph completion.

Indeed many techniques have been attempting to model the relation patterns and till now KG embedding approaches have proved to be a powerful approach for KG completion. TransE (Antoine et al., 2013) is a well-known embedding method which views every relation as a translation from subject to object. For a factual triple (h,r,t), r is viewed as a translative vector r such that embedding vectors h plus r should be almost equal to t. Thus, the scoring function is ψ(h,r,t) = ||h + r − t||. With this translational schema, multiple models have been presented to extend TransE to utilize entity or relation projection additionally to translate head entity into tail entity in either the entity space or the relation space. TransH (Wang et al., 2014) proposes a method named translations on hyperplane which project entities into the relation-specific hyperplane to effectively handle one-to-many, many-to-one and many-to-many relations. TransD (Ji, He, & Xu, 2015) decomposes two projection matrices into the product of two mapping vectors. These two matrices depend on both entities and relations, which make two projection vectors interact sufficiently. KGs are characterised by the heterogeneity (a few relations will link more entity pairs, while others will not). TranSparse (Ji et al., 2016) then uses adaptive sparse projection matrices to handle these issues, the adaptiveness means that the entity pairs linked by relations determines the sparseness of projection matrices.

To model pairwise interaction between entities and relations, DistMult (Yang et al., 2015) represents each relation as a matrix. For a valid triple (h,r,t), its score is defined as ψ(h,r,t) = h^TM_rt. To simplify ψ(h,r,t), DistMult restricts matrix M_r to be a diagonal matrix. However, in order to model compositional reasoning of relations which contain more latent semantics, it requires a non-diagonal matrix M_r. ComplEx (Trouillon et al., 2016) utilizes complex-valued embeddings rather than real-valued embeddings to model asymmetric relations better. Rotate (Sun et al., 2019) views each relation as a rotation from the source entity to the target entity in complex vector space. The vector of target entity t equals to the Hadmard product of the vectors of source entity s and relation r. That is, t = h ○ r, where ○ is the element-wise Hadmard product. Rotate can capture symmetry, antisymmetry, inversion as well as composition patterns. HAKE (Zhang et al., 2020) represents semantic hierarchies by mapping entities into a polar coordinate system, where the modular information represents different levels of the hierarchy and the phase information represents different entities at the same level of the hierarchy.

Recently, many attempts (Chang et al., 2014, Liu, Wu, & Yang, 2017; Minervini et al., 2017; Zhong et al., 2015, 2017) have been made on incorporating extra information beyond KG triples. RUGE (Guo et al., 2018) uses AMIE+ (Galarraga et al., 2015) tool to extract soft rules from the training sets. The embeddings of entities and relations are learned from both labeled triples in a KG, and unlabeled triples whose labels need to be learned iteratively. As suggested by RUGE, length-1 and length-2 rules with PCA confidence higher than 0.8 are utilized. ComplEx-NNE+AER (Ding et al., 2018) uses two types of constraints: approximate entailment constraints for representing relations and non-negativity constraints for representing entity. The aim of non-negativity constraints is to learn simple, understandable entity representations, and the purpose of approximate entailment constraints is to further encode logical regularities into relation representations. Such constraints have advantages on imposing prior knowledge introduced by association rules upon the embedding space to improve embedding learning. LineaRE (Peng & Zhang, 2020) represents a relation as a linear function of two entities which views KG embedding as a linear regression task. That is, LineaRE represents a relation vector as two weight vectors and a bias vector. This model is simple and has a powerful reasoning ability.

The most relevant work to ours is that of SimplE (Seyed & David, 2018). To model the different position contexts of each relation, SimplE takes two vectors h_e and t_e to represent two different position embeddings for each entity e. h_e captures the forward impact of r on e, if e is at the head of r. Similarly, t_e captures the backward impact of r on e, if e is at the tail of r. To address an information flow issue between h_e and t_e of an entity e, SimplE takes the inverse of the relations to handle this issue. Let h_e,t_e ∈ ℝ^d be the embedding of entity e if e is in the head or the tail of the relationship, respectively. Let v_r ∈ ℝ^d be the embedding of relation r and ν_r⁻¹ ∈ ℝ^d the embedding of its inverse relation r⁻¹. In SimplE, the scoring function ψ of a fact (e_i,r,e_j) is defined as the average of two scores: < h_{e_i}, ν_r, t_{e_j} > to the score of (e_i,r,e_j) and < h_{e_j}, ν_r⁻¹, t_{e_i} > to the score of (e_j,r⁻¹,e_i). That is, ψ(e_i,r,e_j) = (< h_{e_i}, ν_r, t_{e_j} > + < h_{e_j}, ν_r⁻¹, t_{e_i} >) / 2. Note that SimplE does not achieve an expected outcome since an entity could be multifaceted where it connects to different neighborhoods. SimplE only takes into account the position information of each entity e, it concerns whether e is in the head or in the tail of relation r. Each entity has two embedding vectors, which represent the forward context or the backward context of a relation, respectively. It is easy to know that h_e and t_e usually are unequal when entity e appearing in different positions. Whereas, in addition to the positions of the relation, the different relations that the entity associate with are also the key context of the entity. For instance, given two facts <e₁,r₁,e₂>, <e₁,r₂,e₃>, where e₁, r₁, and e₂ stand for “Bank”, “issued”, and “bonds”, respectively, e₁, r₂, and e₃ represent “Bank”, “played”, and “basketball”, respectively. SimplE encodes e₁ using the same embedding vector even if e₁ connects to two different relations r₁ and r₂. The meaning of the header entity e₁ for the relation r₁ is obviously different from the meaning of the header entity e₁ for the relation r₂. Therefore, two meanings of the polysemous entity e₁ share one embedding vector. Hence, the contextual nature of the relation is ignored by SimplE, which could facilitate to learn KG embedding.

To handle the aforementioned issues, one must consider both the relations that an entity connects to and the locations of a relation where an entity is located when encoding an entity. We present ContE model—a contextualized embedding model for link prediction task in KGs. It is necessary to distinguish entities in two circumstances to capture the multiple meanings of entities: 1- entities in different positions in the same relation and 2- entities in different relations. The forward and backward impacts of the relationship r in ContE are mapped to two different embedding vectors f_r and b_r, which represent the contextual information of the relationship. Given a triple <e₁,r,e₂>, we define its scoring function as ψ(e₁,r,e₂) = < f_r + ν_e₁, b_r + ν_e₂ >, where v_r = (f_r,b_r). Usually, because of f_r + v_e ≠ b_r + v_e, in different positions of relation r the embedding of entity e may be unequal. Similarly, for triples <e₁,r₁,t₁> and <e₁,r₂,t₂>, the embedding of entity e₁ in triple <e₁,r₁,t₁> is not equal to the one in triple <e₁,r₂,t₂> due to f_r₁ + ν_e₁ ≠ f_r₂ + ν_e₁ in general. For triples <e₁,r₁,t₁> and <e₂,r₂,t₁>, we have a similar conclusion that the embedding of entity t₁ in triple <e₁,r₁,t₁> usually does not equal to the one in triple <e₂,r₂,t₁> because of b_r₁ + ν_t₁ ≠ b_r₂ + ν_t₁. Therefore, ContE considers two different relational contexts to learn entity embeddings.

The parameters and scoring functions in SOTA models and in ContE model are summarized in Table 1. Some notations are illustrated as follows: < ν_h, ν_r, ν_t > = ∑_iν_{h_i} ν_{r_i} ν_{t_i} represents a tri-linear dot product of vectors. ||u||_q represents the q-norm of the vector u. ○ represents element-wise product or the Hadmard product. * is a convolution operator. • represents the dot product of vectors. g represents an activating function used in neural networks. concat is the concatenation operation on vectors. Ω represents a set of filters. $\hat{ν}$ \hat \nu represents a 2D reshaping operation on vector v.

Table 1

Parameters and scoring functions in SOTA baselines and in ContE model.

Model	Scoring function ψ(e₁,r,e₂)	Parameters
TransE	\|\|ν_e₁ + ν_r − ν_e₂\|\|_{_p}	ν_e₁, ν_r, ν_e₂ ∈ ℝ^d
ComplEx	$Re (< ν_{e_{1}}, ν_{r}, {\bar{ν}}_{e_{2}} >)$ {{ Re}} \left( { < {\nu _{{e_1}}},{\nu _r},{{\bar \nu }_{{e_2}}} > } \right)	ν_e₁, ν_r, ν_e₂ ∈ ℂ^d
SimplE	$\frac{1}{2} (< h_{e_{1}}, ν_{r}, t_{e_{2}} > + < h_{e_{2}}, ν_{r^{- 1}}, t_{e_{1}} >)$ {1 \over 2}\left( { < {h_{{e_1}}},{\nu _r},{t_{{e_2}}} > + < {h_{{e_2}}},{\nu _{{r^{ - 1}}}},{t_{{e_1}}} > } \right)	h_e₁, ν_r, t_e₂, h_e₂, ν_r⁻¹, t_e₁ ∈ ℝ^d
ConvE	$g (vec (g (concat ({\hat{v}}_{e_{1}}, ν_{r}) * Ω)) W) \cdot ν_{e_{2}}$ g\left( {vec\left( {g\left( {concat\left( {{{\hat v}_{{e_1}}},{\nu _r}} \right)*\Omega } \right)} \right)W} \right) \cdot {\nu _{{e_2}}}	ν_e₁, ν_r, ν_e₂ ∈ ℝ^d
ConvKB	concat(g([ν_e₁, ν_r, ν_e₂] * Ω)) · w	ν_e₁, ν_r, ν_e₂ ∈ ℂ^d
Rotate	\|\|ν_e₁ ○ ν_r − ν_e₂\|\|	ν_e₁, ν_e₂, ν_r, ∈ C^d
HAKE	\|\|ν_{e1_m} ○ ν_{r_m} − ν_{e2_m}\|\|₂	$ν_{e 1_{m}}, ν_{e 2_{m}} \in R^{d}, ν_{r_{m}} \in R_{+}^{d}$ {\nu _{e{1_m}}},{\nu _{e{2_m}}} \in {R^d},{\nu _{{r_m}}} \in R_ + ^d
HAKE	−λ \|\|sin(ν_{e1_p} + ν_{r_p} − ν_{e2_p})\|\|₁	ν_{e1_p}, r_p, ν_{e2_p} ∈ [0, 2π)^d
ContE	< f_r + ν_e₁, b_r + ν_e₂ >	ν_e₁, ν_e₂, f_r, b_r, ∈ ℝ^d

It is worth mentioning that the previous approaches are static, i.e., there is only one embedding vector for each entity regardless of its context. This posed a notable problem that all meanings of the polysemous entity share one embedding vector. As in (Devlin et al., 2019; Vaswani et al., 2017), replacing static word embeddings with contextualized word representations has achieved great improvements on multiple natural language processing (NLP) tasks. This success shows that considering the contexts of entities will be beneficial for learning KG embeddings. We previously proposed the KGCR model (Pu et al., 2020) which encodes the contextual information of the relation through its forward and backward embedding vectors that is helpful for capturing the different entity's semantics when entity appears in different relations or in different positions of the relation. KGCR uses complex vector representations for entities and relations via its defined operation between two complex vectors, thus it has high complexity and is not fully expressive. Additionally, KGCR cannot model the compositional pattern of relationships. It is well known that the performance of compositional inference may largely depend on the ability to model compositional relationships. To overcome these shortcomings of KGCR, we propose ContE model to improve the performance of KGCR. ContE encodes entities and relations in real vector space with lower complexity than KGCR. ContE has fully expressiveness property and can model compositional relationships.

In sum, the contributions of this paper are threefold:

We propose the ContE model—a novel polysemous entity embedding method with consideration of relational contexts for KG completion.

ContE is full expressiveness, that is, given any ground truth over the triples, there are embedding assignments to entities and relations that can precisely separate the true triples from false ones.

ContE is a bilinear model. As the number of entities or relations increases, the number of ContE's parameters increases linearly relative to the dimension of the embedded vectors. ContE outperforms existing SOTA embedding models on benchmark datasets for KG completion.

We organize the rest as follows. In Section 2, we first summarize related work. In Section 3, we detail ContE model. Then, in Section 4, we present experiments and results. At last, in Section 5, we draw some conclusions.

Related work

Link prediction with KG embedding based approaches has been extensively explored in recent years. Plenty of approaches have been designed towards this task. Integrating deep neural networks with KGs has caused wide concerns in a lot of NLP tasks, for instance, relation extraction (Ji et al., 2017; Zhang et al., 2019), language modeling (Liu et al., 2020; Sun et al., 2019), and knowledge fusion (Dong et al., 2014). NTN (Socher et al., 2013) is the first neural network based approach for reasoning over relationships between two entities. NTN utilizes a neural tensor network for the relation classification. NTN represents entities with their word vectors that are trained by pre-trained models such as Word2Vec. ConvE (Dettmers et al., 2018) first uses CNN model for KG completion. It is a CNN with three layers, the lowest of which is the convolution layer performing 2D reshaping, the middle of which is the projection layer, and the top of which is the inner product layers. The 2D convolution can capture more interactions between different dimensional entries of v_h and v_r concentrating on their local relationships. Instead of extracting local relationships between h and r, ConvKB (Nguyen et al., 2018) utilizes another convolutional operation on same dimensional entries of v_h, v_r and v_t so as to extract their global relationship. Graph Convolutional Networks (GCNs) are based on CNNs and graph embedding, they operate on the graph structure to collectively aggregate information from the graph. R-GCN (Schlichtkrull, Thomas, & Bloem, 2018) first applies GCN to relational data. It is an auto-encoder for the missing link prediction task, R-GCN can produce latent entity representation similar to an encoder, and a tensor factorization model utilizes the entity representations to predict relation labels similar to a decoder. For the entity classification task, R-GCN introduces linear relation-specific transformations to accumulate feature embeddings from neighboring nodes. However, R-GCN does not outperform the CNN based models such as ConvE for the link prediction task. CapsE (Nguyen et al., 2019) uses a capsule network which has a deep architecture to model entries in triple-based data at the same dimension. It can effectively represent complex relationships such as many-to-many relationships. KBAT (Nathani et al., 2019) focuses on the multi-hop neighborhood of each entity that can capture multi-hop and similar relations. By extending the graph attention mechanism to entities and relations within each entity's neighborhood, KBAT collects multi-hop relation information from the n-hop neighborhood of the entity, which is particularly suitable for relation prediction tasks. It is worth noting that as described in (Sun et al., 2020), some neural network-based methods such as ConvKB, CapsE and KBAT use inappropriate evaluation protocols, so their performances highly outperform other SOTA baselines. The main reason is that if there are multiple triples with the same score from the model, they usually pick the correct triple to insert in the beginning of candidate set. The right way is that the correct triple is placed randomly in candidate set.

PTransE (Lin et al., 2015) utilizes relation paths to learn entity and relation representations. It not only takes one step path into consideration but also considers multi-steps path. Sometimes, there are many relation paths between two entities. For any given relation path, to estimate its importance, the allocation of resources algorithm is presented in PTransE to provide the importance score of a path and select those relation paths with the importance score larger than a certain value. Three encodings of a relation path, namely, multiplication of relation representations on the path, addition of relation representations on the path, and recurrent neural network-based relation representation, are utilized for relation representations. PTransE can make full use of local path information. But, it does not achieve an expected outcome that PTransE only utilizes relation paths to enhance the representation learning and still neglects relational contexts of entities. Since there are many entities linking fewer relations (a.k.a. long-tailed entities), in order to learn the embeddings of long-tailed entities, RSNs (Guo, Sun, & Hu, 2019) presents recurrent skipping networks. They are different from other models. The head entities could predict both their subsequent relations and their tail entities via skipping their connections directly. The relation path based learning methods such as PTransE, suffer from high computational cost and the length of a path of relations is limited to at most three steps. RSNs leverage biased random walks to model deep paths which can capture more relational dependencies. It is worth mentioning that RSNs only use relational paths to learn KG embeddings, whereas textual information of relations is still neglected.

So far, many existing approaches are static. Static representation learning has some limitations. It learns entity and relation embeddings from a single triple view. The contexts of relations are neglected. On the one hand, there are two different positions of a relation in which an entity appears. On the other hand, many entities connect different relational neighbors, i.e., different entities may link different relations. Hence, exploiting interactions between entities and relations for KG embedding is significantly important.

ContE model

Notation Let ℛ be the relation set, ℰ be the entity set. Let 𝒲 ⊂ ℰ × ℛ × ℰ represent the factual triple set (i.e. true in a world). Let 𝒲^𝒞 be the complement of 𝒲. KGs 𝒢 = {(e₁,r,e₂)} ⊆ 𝒲 are the subsets of 𝒲. Let G′ be the complement of 𝒢. Each triple of 𝒢 is denoted by (head entity, relation, tail entity). The KG completion is formalized as a problem: for any triple (e₁,r,e₂), one needs to infer from 𝒢 whether (e₁,r,e₂) ∈ 𝒲 (or (e₁,r,e₂) ∈ 𝒲^𝒞). For each triple (e₁,r,e₂), its score function ψ: ℰ × ℛ × ℰ ↦ ℝ is defined for measuring its plausibility. Triples of 𝒢 should have higher scores than those not in 𝒢.

We use TransE's procedure to generate negative triples. For each positive triple (e₁,r,e₂)∈𝒢, we generate negative triples by substituting either entity e₁ or entity e₂ with $e_{1}^{'}$ e_1^\prime (or $e_{2}^{'}$ e_2^\prime ) randomly selected from ℰ − {e₁} (or ℰ − {e₂}). Let 𝒢⁻ the set of negative triples. Then, (1) $\begin{array}{l} G^{-} = {(e_{1}^{'}, r, e_{2}) | e_{1}^{'} \in ℰ \land e_{1}^{'} \neq e_{1} \land (e_{1}, r, e_{2}) \in G} \\ \cup {(e_{1}, r, e_{2}^{'}) | e_{2}^{'} \in ℰ \land e_{2}^{'} \neq e_{2} \land (e_{1}, r, e_{2}) \in G} \end{array}$ \matrix{{{{\cal G}^ - } = \left\{ {\left( {e_1^\prime,r,{e_2}} \right)|e_1^\prime \in {\cal E} \wedge e_1^\prime \ne {e_1} \wedge \left( {{e_1},r,{e_2}} \right) \in {\cal G}} \right\}} \hfill \cr {\bigcup {\left\{ {\left( {{e_1},r,e_2^\prime} \right)|e_2^\prime \in {\cal E} \wedge e_2^\prime \ne {e_2} \wedge \left( {{e_1},r,{e_2}} \right) \in {\cal G}} \right\}} } \hfill \cr }

Note that negative triples can also be produced via randomly corrupting relations. Whereas this method of generating negative triples possibly inputs false negative triples, that is, positive triples.

Under consideration of the contextual circumstance of the entity, we must take into account both the relations that the entity connects to and the locations of a relation where the entity is in. We propose an embedding model based on the relational contexts, named ContE, for KG completion. In ContE model, each relation r is divided into two vectors f_r and b_r, representing r's forward and backward impacts, respectively. The corresponding scoring function of ContE is as follows: for any triple (h,r,t), (2) $ψ (h, r, t) = < f_{r} + ν_{h}, b_{r} + ν_{t} >$ \psi \left( {h,r,t} \right) = < {f_r} + {\nu _h},{b_r} + {\nu _t} > where, v_h ∈ ℝ^d, v_t ∈ ℝ^d and v_r = (f_r,b_r) ∈ ℝ^2d are the respective embeddings of h, t and r. v_r is the catenation of f_r and b_r, where f_r ∈ ℝ^d, b_r ∈ ℝ^d represent the forward and backward context vectors of the relation r, respectively. d represents the dimension of embedding vectors. The bilinear product $< f_{r} + ν_{h}, b_{r} + ν_{t} > = \sum_{i}^{d} {[f_{r} + ν_{h}]}_{i} {[b_{r} + ν_{t}]}_{i}$ < \,{f_r} + {\nu _h},{b_r} + {\nu _t}\, > \, = \,\sum\nolimits_i^d {{{\left[ {{f_r} + {\nu _h}} \right]}_i}{{\left[ {{b_r} + {\nu _t}} \right]}_i}} , here, [x]_k is the k-th entry of vector x.

For each relation r, f_r and b_r can capture the position information of relation r as well as the context of r. The role of an entity that appears at the head of relation r differs from that of the entity that appears at the tail of relation r. This assertion is easy to obtain from f_r + v_e ≠ b_r + v_e. On the other hand, the same entity has different roles at the same position in different relations.

3.1

Optimizing model

After defining the scoring function, the model can be optimized. One can use a negative sampling loss function similar to (Mikolov et al., 2013) to effectively minimizing margin-based loss: (3) $ℒ = - \sum_{k = 1}^{n} \frac{1}{d} [\log σ (ψ (h_{k}^{'}, r, t_{k}) - γ) + \log σ (ψ (h_{k}, r, t_{k}^{'}) - γ)] - \log σ (γ - ψ (h, r, t))$ {\cal L}\, = - \sum\nolimits_{k = 1}^n {{1 \over d}\left[ {log \,\sigma \left( {\psi \left( {h_k^\prime,r,{t_k}} \right) - \gamma } \right) + log\, \sigma \left( {\psi \left( {{h_k},r,t_k^\prime} \right) - \gamma } \right)} \right]} - log\, \sigma \left( {\gamma - \psi \left( {h,r,t} \right)} \right) here, σ represents a sigmoid function, d represents the dimension of entity embeddings, γ is a fixed margin, $(h_{k}^{'}, r, t_{k})$ \left( {h_k^\prime,r,{t_k}} \right) and $(h_{k}, r, t_{k}^{'})$ \left( {{h_k},r,t_k^\prime} \right) are both k-th corrupted triples.

We present a new approach for drawing negative samples with this distribution: (4) $p ((h_{k}^{'}, r, t_{k}^{'}) | (h_{i}, r, t_{i})) = softmax (ψ (h_{k}^{'}, r, t_{k}^{'})) = \frac{\exp ψ (h_{k}^{'}, r, t_{k}^{'})}{\sum_{i} \exp ψ (h_{i}^{'}, r, t_{i}^{'})}$ p\left( {\left( {h_k^\prime,r,t_k^\prime} \right)|\left( {{h_i},r,{t_i}} \right)} \right) = softmax \left( {\psi \left( {h_k^\prime,r,t_k^\prime} \right)} \right) = {{exp \psi \left( {h_k^\prime,r,t_k^\prime} \right)} \over {\sum\limits_i {exp \psi \left( {h_i^\prime,r,t_i^\prime} \right)} }} where $(h_{k}^{'}, r, t_{k}^{'})$ \left( {h_k^\prime,r,t_k^\prime} \right) represents $(h_{k}^{'}, r, t_{k})$ \left( {h_k^\prime,r,{t_k}} \right) or $(h_{k}, r, t_{k}^{'})$ \left( {{h_k},r,t_k^\prime} \right) . Finally, the loss function with negative samples is as follows: (5) $ℒ = - \sum_{k = 1}^{n} [p (h_{k}^{'}, r, t_{k}) \log σ (ψ (h_{k}^{'}, r, t_{k}) - γ) + p (h_{k}, r, t_{k}^{'}) \log σ (ψ (h_{k}, r, t_{k}^{'}) - γ)] - \log σ (γ - ψ (h, r, t)) + λ {‖ θ ‖}_{2}^{2}$ {\cal L}\, = - \sum\nolimits_{k = 1}^n {\left[ {p\left( {h_k^\prime,r,{t_k}} \right)log \,\sigma \left( {\psi \left( {h_k^\prime,r,{t_k}} \right) - \gamma } \right) + p\left( {{h_k},r,t_k^\prime} \right)log\, \sigma \left( {\psi \left( {{h_k},r,t_k^\prime} \right) - \gamma } \right)} \right]} - log\, \sigma \left( {\gamma - \psi \left( {h,r,t} \right)} \right) + \lambda \left\| \theta \right\|_2^2

Here, θ represents the parameters of the model, that is, the parameters in entity embeddings and relation embeddings. ${‖ θ ‖}_{2}^{2}$ \left\| \theta \right\|_2^2 is the regular term of the model, and λ is the corresponding regular coefficient.

3.2

Full expressiveness

ContE is a simple yet expressive model. The following theorem establishes the full expressiveness of ContE.

Theorem 1

Given arbitrary ground truth on ℰ and ℛ that contains ρ facts, ContE model has embeddings with size min(|ρ| + 1, |ℰ| · |ℛ|) which represent this ground truth.

Proof

We prove the |ℰ| · |ℛ| bound. For each head entity e_p, relation r_q, assume ν_{e_p} and f_{r_q} (the forward contextual vector of relation r_q) are embedding vectors of size |ℰ| · |ℛ| = N, then embedding vector ν_{r_q} is of size 2N. We set the m-th element of ν_{e_p} = 1 / 2 if m mod |ℰ| = p, otherwise 0. By the same way, we set f_{r_q} [m] = 1 / 2 if m mod |ℰ| = p, otherwise 0. Furthermore, to tail entity e_s and b_{r_q} (the backward contextual vector of relation r_q), we set $ν_{e_{s}} [m] = b_{r_{q}} [m] = {\begin{array}{l} \frac{1}{2} & if m div | ℰ | = s and (e_{p}, r_{q}, e_{s}) \in G \\ - \frac{1}{2} & if m div | ℰ | = s and (e_{p}, r_{q}, e_{s}) \in G^{'} \\ 0 & if m div | ℰ | \neq s \end{array}$ {\nu _{{e_s}}}\left[ m \right] = {b_{{r_q}}}\left[ m \right] = \left\{ {\matrix{{{1 \over 2}\,\,\,\,\,\,if\,m\,div\,\left| {{\cal E}} \right| = s\,and\left( {{e_p},{r_q},{e_s}} \right) \in {\cal G}} \hfill \cr { - {1 \over 2}\,if\,m\,div\,\left| {{\cal E}} \right| = s\,and\left( {{e_p},{r_q},{e_s}} \right) \in {\cal G}'} \hfill \cr {0\,\,\,\,\,\,\,if\,m\,div\,\left| {{\cal E}} \right| \ne s} \hfill \cr } } \right.

Then, the (s * |ℰ| + p)-th element of product < f_{r_q} + ν_{e_p}, b_{r_q} + ν_{e_s} > = 1 if (e_p,r_q,e_s) holds, otherwise −1. It implies that ψ(e_p,r_q,e_s) = 1 if e_p,r_q,e_s holds, otherwise −1.

Now we prove the ρ+1 bound. Let ρ be zero (base of the induction). For each relation r and entity e with embedding vectors of size 1, and contextual vectors f_r, b_r of size 1, we set the value for entities to 1/2, for f_r to 1/2, and for b_r to −3/2. Then, < f_{r_q} + ν_{e_p}, b_{r_q} + ν_{e_s} > = −1 for entities e_p, e_s and relation r_q. Hence, there are embeddings with size ρ+1 that represent this ground truth. The conclusion is correct for the initial value 0 of ρ. Now, suppose for arbitrary ground truth satisfying ρ = m − 1 (1 ≤ m ≤ |ℛ| |ℰ|²), an value assignment to embedding vectors with size m exists which represents this ground truth (induction assumption). It is necessary to verify that for arbitrary ground truth satisfying ρ = m, a value assignments to embedding vectors with size m + 1 exists which represents this ground truth. Suppose (e_p,r_q,e_s) is one of the m facts. Take into account a changed ground truth that has m − 1 facts, and (e_p,r_q,e_s) is assigned false. Based on the induction assumption, one could denote it by some embedding vectors with size m. Suppose u = < f_{r_q} + ν_{e_p}, b_{r_q} + ν_{e_s} >, where f_{r_q} + ν_{e_p} and b_{r_q} + ν_{e_s} are the embedding vectors representing that modified ground truth. An element is added to the end of all vectors ν_{e_p}, ν_{e_s}, f_r, b_r and set it to 0. It will increase the length of vectors to m + 1, however no score has changed. Thus one can set the last element of ν_{e_p} to 1/2, f_r to 1/2, b_r to 1 and ν_{e_s} to −u. This ensures that < f_{r_q} + ν_{e_p}, b_{r_q} + ν_{e_s} > =1 for the triple (e_p,r_q,e_s), and no other score is changed.

As stated in (Seyed et al., 2018), many types of translational models, for instance, STransE, TransE, TransR, FTransE, and TransH are not fully expressive. Since DistMult forces relations to be symmetric, it is also not fully expressive. ComplEx and SimplE are both fully expressive with embedding vectors of size at most |ℰ| |ℛ|.

3.3

Ability to model relational patterns

Using contextualized representations of entities allows us to easily model primary relation patterns. Four patterns of relations, that is, inversion, symmetry, antisymmetry as well as composition, can be modeled by ContE.

r₂ is an inverse relation of r₁ on entities ℰ, if ∀ e₁,e₂, (e₁,r₁,e₂)∈𝒢 ⇔ (e₂,r₂,e₁) ∈ 𝒢.

r is an antisymmetric relation on entities ℰ, if ∀e₁,e₂, (e₁,r,e₂)∈𝒢 ⇒ (e₂,r,e₁) ∈𝒢′.

r is a symmetric relation on entities ℰ, if ∀e₁,e₂, (e₁,r,e₂)∈𝒢 ⇔ (e₂,r,e₁) ∈𝒢.

r₃ is a compositional relation of r₁ and r₂ on entities ℰ, if ∀e₁,e₂,e₃, (e₁,r₁,e₂)∈𝒢 and (e₂,r₂,e₃)∈𝒢 ⇒ (e₁,r₃,e₃) ∈ 𝒢.

Theorem 2

Given a symmetry relation r on entities ℰ, r can be encoded in model ContE by tying the parameters f_r = b_r.

Proof

If (e₁,r,e₂) ∈ 𝒢, then ψ(e₁,r,e₂) = < f_r + ν_e₁, b_r + ν_e₂ >. By tying f_r = b_r, we conclude that < f_r + ν_e₂, b_r + ν_e₁ > = < f_r + ν_e₁, b_r + ν_e₂ >. So, ψ(e₂,r,e₁) = ψ(e₁,r,e₂). It implies that (e₂,r,e₁) is a positive triple in ContE.

Theorem 3

Given an antisymmetry relation r on entities ℰ, r can be encoded in model ContE by tying the parameters f_r = −b_r.

Proof

For any entities e₁, e₂ ∈ ℰ, if f_r = −b_r, we need to prove the following inequality when proving the antisymmetry patterns: $ψ (e_{1}, r, e_{2}) \neq ψ (e_{2}, r, e_{1})$ \psi \left( {{e_1},r,{e_2}} \right) \ne \psi \left( {{e_2},r,{e_1}} \right)

Firstly, we expand the left term: $\begin{array}{l} ψ (e_{1}, r, e_{2}) & = & < f_{r} + ν_{e_{1}}, b_{r} + ν_{e_{2}} > \\ = & \sum_{i = 1}^{d} {[f_{r} + ν_{e_{1}}]}_{i} {[b_{r} + ν_{e_{2}}]}_{i} \\ = & \sum_{i = 1}^{d} {[f_{r} b_{r} + ν_{e_{1}} b_{r} + f_{r} ν_{e_{2}} + ν_{e_{1}} ν_{e_{2}}]}_{i} \end{array}$ \matrix{{\psi \left( {{e_1},r,{e_2}} \right)} \hfill & = \hfill & { < \,{f_r} + {\nu _{{e_1}}},\,{b_r} + {\nu _{{e_2}}}\, > } \hfill \cr {} \hfill & = \hfill & {\sum\limits_{i = 1}^d {{{\left[ {{f_r} + {\nu _{{e_1}}}} \right]}_i}{{\left[ {{b_r} + {\nu _{{e_2}}}} \right]}_i}} } \hfill \cr {} \hfill & = \hfill & {\sum\limits_{i = 1}^d {{{\left[ {{f_r}{b_r} + {\nu _{{e_1}}}{b_r} + {f_r}{\nu _{{e_2}}} + {\nu _{{e_1}}}{\nu _{{e_2}}}} \right]}_i}} } \hfill \cr }

We then expand the right term: $\begin{array}{l} ψ (e_{2}, r, e_{1}) & = & < f_{r} + ν_{e_{2}}, b_{r} + ν_{e_{1}} > \\ = & \sum_{i = 1}^{d} {[f_{r} + ν_{e_{2}}]}_{i} {[b_{r} + ν_{e_{1}}]}_{i} \\ = & \sum_{i = 1}^{d} {[f_{r} b_{r} + ν_{e_{2}} b_{r} + f_{r} ν_{e_{1}} + e_{1} ν_{e_{2}}]}_{i} \\ = & \sum_{i = 1}^{d} {[f_{r} b_{r} - ν_{e_{1}} b_{r} - f_{r} ν_{e_{2}} + ν_{e_{1}} ν_{e_{2}}]}_{i} ({fromf}_{r} = - b_{r}) \end{array}$ \matrix{{\psi \left( {{e_2},r,{e_1}} \right)} \hfill & = \hfill & { < \,{f_r} + {\nu _{{e_2}}},\,{b_r} + {\nu _{{e_1}}}\, > } \hfill \cr {} \hfill & = \hfill & {\sum\limits_{i = 1}^d {{{\left[ {{f_r} + {\nu _{{e_2}}}} \right]}_i}{{\left[ {{b_r} + {\nu _{{e_1}}}} \right]}_i}} } \hfill \cr {} \hfill & = \hfill & {\sum\limits_{i = 1}^d {{{\left[ {{f_r}{b_r} + {\nu _{{e_2}}}{b_r} + {f_r}{\nu _{{e_1}}} + {e_1}{\nu _{{e_2}}}} \right]}_i}} } \hfill \cr {} \hfill & = \hfill & {\sum\limits_{i = 1}^d {{{\left[ {{f_r}{b_r} - {\nu _{{e_1}}}{b_r} - {f_r}{\nu _{{e_2}}} + {\nu _{{e_1}}}{\nu _{{e_2}}}} \right]}_i}\left( {{\rm{from}}{{\rm{f}}_{\rm{r}}} = - {{\rm{b}}_{\rm{r}}}} \right)} } \hfill \cr }

It is obvious that these two terms are not equal because some of the terms have different signs.

Theorem 4

Assume that relation r₁ is the inverse of relation r₂. Then, ContE can encode r₂ and r₁ via making f_r₁ ∈ ℝ^d, f_r₂ ∈ ℝ^d, b_r₁ ∈ ℝ^d and b_r₂ ∈ ℝ^d satisfy one of the following conditions.

f_r₂ = b_r₁ and b_r₂ = f_r₁ = 0

b_r₂ = f_r₁ and f_r₂ = b_r₁ = 0

Proof

We only prove case (1) since case (2) is the same. First, we have $\begin{array}{l} ψ (e_{1}, r_{1}, e_{2}) & = & < f_{r_{1}} + ν_{e_{1}}, b_{r_{1}} + ν_{e_{2}} > \\ = & \sum_{i = 1}^{d} {[f_{r_{1}} + ν_{e_{1}}]}_{i} {[b_{r_{1}} + ν_{e_{2}}]}_{i} \\ = & \sum_{i = 1}^{d} {[(f_{r_{1}} + ν_{e_{1}}) (b_{r_{1}} + ν_{e_{2}})]}_{i} \\ = & \sum_{i = 1}^{d} {[f_{r_{1}} b_{r_{1}} + f_{r_{1}} ν_{e_{2}} + ν_{e_{1}} b_{r_{1}} + ν_{e_{1}} ν_{e_{2}}]}_{i} \\ = & \sum_{i = 1}^{d} {[ν_{e_{1}} b_{r_{1}} + ν_{e_{1}} ν_{e_{2}}]}_{i} ({fromf}_{r_{1}} = 0) \\ = & \sum_{i = 1}^{d} {[ν_{e_{1}} f_{r_{2}} + ν_{e_{1}} ν_{e_{2}}]}_{i} ({fromf}_{r_{2}} = b_{r_{1}}) \\ = & \sum_{i = 1}^{d} {[f_{r_{2}} b_{r_{2}} + ν_{e_{2}} b_{r_{2}} + ν_{e_{1}} f_{r_{2}} + ν_{e_{1}} ν_{e_{2}}]}_{i} ({fromf}_{r_{2}} = 0) \\ = & < f_{r_{2}} + ν_{e_{2}}, b_{r_{2}} + ν_{e_{1}} > = ψ (e_{2}, r_{2}, e_{1}) \end{array}$ \matrix{{\psi \left( {{e_1},{r_1},{e_2}} \right)} \hfill & = \hfill & { < \,{f_{{r_1}}} + {\nu _{{e_1}}},\,{b_{{r_1}}} + {\nu _{{e_2}}}\, > } \hfill \cr {} \hfill & = \hfill & {\sum\limits_{i = 1}^d {{{\left[ {{f_{{r_1}}} + {\nu _{{e_1}}}} \right]}_i}{{\left[ {{b_{{r_1}}} + {\nu _{{e_2}}}} \right]}_i}} } \hfill \cr {} \hfill & = \hfill & {\sum\limits_{i = 1}^d {{{\left[ {\left( {{f_{{r_1}}} + {\nu _{{e_1}}}} \right)\left( {{b_{{r_1}}} + {\nu _{{e_2}}}} \right)} \right]}_i}} } \hfill \cr {} \hfill & = \hfill & {\sum\limits_{i = 1}^d {{{\left[ {{f_{{r_1}}}{b_{{r_1}}} + {f_{{r_1}}}{\nu _{{e_2}}} + {\nu _{{e_1}}}{b_{{r_1}}} + {\nu _{{e_1}}}{\nu _{{e_2}}}} \right]}_i}} } \hfill \cr {} \hfill & = \hfill & {\sum\limits_{i = 1}^d {{{\left[ {{\nu _{{e_1}}}{b_{{r_1}}} + {\nu _{{e_1}}}{\nu _{{e_2}}}} \right]}_i}\,\left( {{\rm{from}}{{\rm{f}}_{{{\rm{r}}_1}}} = 0} \right)} } \hfill \cr {} \hfill & = \hfill & {\sum\limits_{i = 1}^d {{{\left[ {{\nu _{{e_1}}}{f_{{r_2}}} + {\nu _{{e_1}}}{\nu _{{e_2}}}} \right]}_i}\,\left( {{\rm{from}}{{\rm{f}}_{{{\rm{r}}_2}}} = {{\rm{b}}_{{{\rm{r}}_{\rm{1}}}}}} \right)} } \hfill \cr {} \hfill & = \hfill & {\sum\limits_{i = 1}^d {{{\left[ {{f_{{r_2}}}{b_{{r_2}}} + {\nu _{{e_2}}}{b_{{r_2}}} + {\nu _{{e_1}}}{f_{{r_2}}} + {\nu _{{e_1}}}{\nu _{{e_2}}}} \right]}_i}\,\left( {{\rm{from}}{{\rm{f}}_{{{\rm{r}}_2}}} = 0} \right)} } \hfill \cr {} \hfill & = \hfill & { < \,{f_{{r_2}}} + {\nu _{{e_2}}},\,{b_{{r_2}}} + {\nu _{{e_1}}}\, > \, = \,\psi \left( {{e_2},{r_2},{e_1}} \right)} \hfill \cr }

Hence, (e₁,r₃,e₃)∈𝒢 ⇔ (e₂,r₂,e₁) ∈ 𝒢.

Composition Pattern We present an analysis of existing models on their abilities in inferring and modeling these four relation patterns. The results are summarized in Table 2. Since deep neural networks (DNNs) are black boxes, it is hard to analyze which pattern they can model. The corresponding patterns of DNN based models such as ConvE and ConvKB are denoted as “−”, i.e., “unknown”. It's worth mentioning that TransE, Rotate and HAKE have drawbacks in modeling composition patterns. We select Rotate model to illustrate this issue, the other two models are similar. For example, suppose there are three persons/entities e₁, e₂, and e₃. e₁ is the father of e₂ (denoted as r₁) and e₂ is the brother of e₃ (denoted as r₂). Then e₁ is the father of e₃ (denoted as r₃). Note that r₁ is the same relation as r₃. We make a formal analysis of this example. For entity vectors ν_e₁, ν_e₂, ν_e₃, and relation vectors ν_r₁, ν_r₂ and ν_r₃ satisfying ν_e₂ = ν_e₁ ○ ν_r₁, ν_e₃ = ν_e₂ ○ ν_r₂, and ν_e₃ = ν_e₁ ○ ν_r₃, it can be obtained that ν_r₃ = ν_r₁ ○ ν_r₂. Assume that r₃ = r₁, according to Rotate model, ν_r₂ must be an all-one vector. That is, all relations r₂ satisfying the above conditions must be all-one vectors. This is unreasonable. ContE can model the composition pattern without this problem.

Table 2

Relation pattern modeling and inference abilities of baseline models.

Model	Symmetry	Antisymmetry	Inversion	Composition
TransE (Antoine et al., 2013)	×	✓	✓	✓
DistMult (Yang et al., 2015)	✓	×	×	×
ComplEx (Trouillon et al., 2016)	✓	✓	✓	×
SimplE (Seyed & David, 2018)	✓	✓	✓	×
ConvE (Dettmers et al., 2018)	−	−	−	−
ConvKB (Nguyen et al., 2018)	−	−	−	−
RotatE (Sun et al., 2019)	✓	✓	✓	✓
HAKE (Zhang et al., 2020)	✓	✓	✓	✓
KGCR (Pu et al., 2020)	✓	✓	✓	×
LineaRE (Peng & Zhang, 2020)	✓	✓	✓	✓
ContE	✓	✓	✓	✓

Time Complexity We further analyse the time complexity of ContE model. Let the number of entities, relations and triples of 𝒢 be n_e, n_r and n_t, respectively. Let d and 2d be the dimensions of the embedding vectors of entity and relation, respectively. It is easy to obtain that the time complexity of ContE is dn_t = O(d) from formula (2). The complexity of ContE is the same as TransE. The number of parameters needed for ContE is O(n_ed + 2n_rd). ContE has less parameters and no matrix multiplication operations, so it could be applied to large-scale KGs. The comparison between SOTA baselines and ContE in terms of parameters and time complexity is summarized in Table 3.

Table 3

Comparison of SOTA baselines and ContE model in terms of time complexity and number of parameters.

Models	#Parameters	Time Complexity
TransE	O(n_ed + n_rd)	O(d)
NTN	O(n_ed + n_rd²k)	O(d³)
ComplEx	O(n_ed + n_rd)	O(d)
TransR	O(n_ed + n_rdk)	O(dk)
SimplE	O(n_ed + n_rd)	O(d)
ContE	O(n_ed + 2n_rd)	O(d)

Table 4

Summary on datasets.

Dataset	\|E\|	\|R\|	\|training\|	\|validation\|	\|test\|
FB15K-237	14,541	237	272,115	17,535	20,466
UMLS	135	46	5,216	652	661
Nations	14	55	1,592	199	201
Countries_S1	271	2	1,111	24	24
Countries_S2	271	2	1,063	24	24
Countries_S3	271	2	985	24	24

Experiments

We evaluate the performance of ContE on four standard benchmarks, i.e., UMLS, Nations, FB15K-237 and Countries. Table 3 summarizes the statistics of these datasets. UMLS (Kok & Domingos, 2007) contains data from the Unified Medical Language System, a database of biomedical ontology. It contains 135 concepts and 46 relations, of which 6,529 atoms are true. FB15K (Antoine et al., 2013) is extracted from Freebase which is a large KG containing general knowledge facts. FB15K-237 removed inverse relations from FB15K because through inverse relations FB15K leads to test leakage (Dettmers et al., 2018). Countries dataset (Guillaume, Sameer, & Theo, 2015) contains 272 entities and 2 relations, which is used to test the ability of modelling and reasoning composition pattern. The two relations are isInside(e₁,e₂) and isNeighbor(e₁,e₂), respectively. There are three versions of Counties dataset, called S1, S2, and S3 which test the ability for 1-step, 2-step, and 3-step reasoning, respectively. We test the reasoning ability of ContE with SOTA baselines using Countries dataset.

Evaluation Protocol We use the following metrics of correct entities to compare ContE with baselines: To predict e₁ in each test triple (e₁,r,e₂), a KG model scores all the corrupted triples in set $T^{'} = {(e_{1}^{'}, r, e_{2}) | e_{1}^{'} \in ℰ}$ T' = \left\{ {\left( {e_1^\prime,r,{e_2}} \right)|e_1^\prime \in {\cal E}} \right\} . Sorting scores of the correct and corrupted triples in descending order, then one get the ranking values of the valid triples in the list. Analogously, we can obtain another ranking value of triple (e₁,r,e₂) via replacing the tail entity e₂. Aggregated on all test triples, five metrics are reported: MR (Mean Rank) is an average of all predicting rankings of valid triples, MRR (Mean Reciprocal Rank) is an average of inverse rankings of valid triples. Usually, MRR is a better metric than MR. Hits@n is the proportion of predicting ranks of the valid triples are not larger than n where n = 1,3,10. The metrics of MRR and Hits@n are utilized in our experiments. The higher Hits@n or the lower MR or the higher MRR, the better performance. We choose the best MRR as the experimental results for all datasets. Notice that if there is a corrupted triple in KG, which means it is correct, it is reasonable to sort it before the original triple. If one select the corrupted triple regardless of whether it is correct or not, this setting is called “raw”. If one selects corrupted triples that do not exist in training, valid, and test set, this setting is called filtered. We report experimental results in the filtered setting (Antoine et al., 2013).

Experimental Setup ContE utilizes the optimizer Adam to optimize the loss function and tunes its hyper-parameters over the validation set. We conduct grid search for finding hyper-parameters to maximize the value of MRR. The hyper-parameters are listed as follows: k (embedding dimension), lr (learning rate), γ (maximal margin), λ (regular coefficient), batch_size (size of each batch), max_ steps (max steps during iteration) and n (negative sample size).

Results We compare ContE with several baseline models to show empirically the performance of modeling the relation patterns and predicting the missing links.

Performance on UMLS Table 5 summarizes the comparison results between ContE and the listed SOTA baselines on UMLS dataset. The results of DistMult, ConvE, NeuralLP (Yang, Zhang, & Cohen, 2017), NTP-λ (Rocktaschel & Riedel, 2017), MINERVA (Das et al., 2018), KGRRS+ComplEx and KGRRS+ConvE are from (Lin, Socher, & Xiong, 2018). “−” means no result of this metric is reported in original paper. For TransE, we rerun the code given by (Sun et al., 2019), and do a grid search to find the best values of hyper-parameters. The range of hyper-parameters is batch_size = 20, k = 500, max_steps = 10000, n ∈ {1,2,3,4,5,6,7,8,9,10}, γ ∈ {1,2,3,4,5,6,7,8,9,10}, lr ∈ {0.0005,0.001}. For ComplEx, the range of hyper-parameters is batch_size=20, k ∈ {500,1000}, max_steps = 10000, n ∈ {1,2,3,4,5, 6,7,8}, γ ∈ {1,2,4,6,8,10,11,12, 13,14}, lr ∈ {0.0005,0.0008,0.001,0.002}. For Rotate, the range of hyper-parameters is batch_size = 20, k ∈ {500,1000}, max_steps = 10000, n ∈ {1,2,3,4,5,6,7,8,9,10, 11,12}, γ ∈ {1,2,3,4,5,6,7,8,9,10,11,12,13}, lr ∈ {0.0005,0.001,0.002}. For HAKE, the range of hyper-parameters is batch_size ∈ {20,30,100,256}, k = 1000, max_steps ∈ {9000,10000}, n ∈ {4,5,6,7,8,9,10,20}, γ ∈ {0.4,0.5,0.6,0.8,0.9,1,2,3,4,5,6,7,8}, lr ∈ {0.001,0.002}, modulus_weight ∈ {1.0,2.5,3.5} and phase_weight ∈ {0.5,1.0,2.5}. For LineaRE, the range of hyper-parameters is k ∈ {250,500,1000}, temperature of sampling α ∈ {0.5,1.0}, batch_ size b ∈ {128,256,512}, learning_rate lr = 0.001, drop_rate = 0.05, n ∈ {128,256,512}, γ ∈ {9,12}. For ContE, the range of hyper-parameters is λ = 0, batch_size ∈ {16,20,32,50,128,512}, k ∈ {500,1000, 2000,3000,3500}, max_steps ∈ {9000,10000}, n ∈ {3,4,5,6,7,8,9,10,20,30,50,100}, γ ∈ {1,2,3,4,5,6,7,8,9,10}, lr ∈ {0.0005,0.000 8,0.001,0.0015,0.002,0.003}.

Table 5

Experimental results for UMLS.

	UMLS

	MRR	Hits@N

		1	3	10
TransE (Antoine et al., 2013)	0.7966	0.6452	0.9418	0.9841
DistMult (Yang et al., 2015)	0.868	0.821	-	0.967
ComplEx (Trouillon et al., 2016)	0.8753	0.7942	0.9531	0.9713
ConvE (Dettmers et al., 2018)	0.957	0.932	-	0.994
NeuralLP (Yang, Zhang, & Cohen, 2017)	0.778	0.643	-	0.962
NTP-λ (Rocktaschel et al., 2017)	0.912	0.843	-	1.0
MINERVA (Das et al., 2018)	0.825	0.728	-	0.968
KGRRS+ComplEx (Lin et al., 2018)	0.929	0.887	-	0.985
KGRRS+ConvE (Lin et al., 2018)	0.940	0.902	-	0.992
Rotate (Sun et al., 2019)	0.9274	0.8744	0.9788	0.9947
HAKE (Zhang et al., 2020)	0.8928	0.8366	0.9387	0.9849
LineaRE (Peng & Zhang, 2020)	0.9508	0.9145	0.9856	0.9992
ContE	0.9677	0.9501	0.9811	1.0

The ContE model achieves the best performance on all metrics except metric Hits@3 in Table 5.

Performance on Nations Table 6 summarizes the comparison results between ContE and the listed SOTA baselines on Nations dataset.

Table 6

Experimental results for Nations.

	Nations

	MRR	Hits@N

		1	3	10
TransE (Antoine et al., 2013)	0.4813	0.2189	0.6667	0.9801
DistMult (Yang et al., 2015)	0.7131	0.5970	0.7761	0.9776
ComplEx (Trouillon et al., 2016)	0.6677	0.5274	0.7413	0.9776
ConvE (Dettmers et al., 2018)	0.5616	0.3470	0.7155	0.9946
Rotate (Sun et al., 2019)	0.7155	0.5796	0.7985	1.0
HAKE (Zhang et al., 2020)	0.7157	0.5945	0.7786	0.9851
LineaRE (Peng & Zhang, 2020)	0.8146	0.7114	0.8881	0.9975
ContE	0.8412	0.7587	0.9179	1.0

For all listed baselines except ConvE, we rerun the codes given by their original papers, and do a grid search to find the best values of hyper-parameters. The results of ConvE are obtained by PyKEEN (Ali et al., 2021). For TransE, the range of hyper-parameters is batch_size = 20, k ∈ {500,1000}, max_steps = 10000, n ∈ {4,64,256,512,1024}, γ ∈ {0.5,1,6,9,10,12,24,100}, lr ∈ {0.00005,0.001}. For DistMult, the range of hyper-parameters are batch_size = 20, k ∈ {500,1000}, n ∈ {4,64,128,256,512}, max_steps = 10000, γ ∈ {0.5,2,6,9,10}, lr ∈ {0.00005,0.001}. For ComplEx, the range of hyper-parameters is batch_size = 20, k ∈ {500,1000}, max_steps = 10000, n ∈ {4,64,128,256,512,1024}, γ ∈ {0.5,1,2,6,9,10,12,13,24,100}, lr ∈ {0.00005,0.001}. For Rotate, the range of hyper-parameters is batch_size = 20, k ∈ {500,1000}, max_steps = 10000, n ∈ {4,5,6,7,8,9,10}, γ ∈ {1,2,3,4,5,6,7,8,9,10}, lr ∈ {0.0005,0.001}. For HAKE, the range of hyper-parameters is batch_size ∈ {20,30,100,256}, k ∈ {500,1000}, n ∈ {4,5,6,7,8,9,10}, max_steps = 10000, γ ∈ {1,2,3,4,5,6,7,8,9,10}, lr ∈ {0.0005,0.001}, modulus_weight ∈ {1.0,2.5,3.5} and phase_weight ∈ {0.5,1.0,2.5}. For LineaRE, the range of hyper-parameters is k ∈ {250,500,1000}, temperature of sampling α ∈ {0.5,1.0}, batch_size b ∈ {128,256,512}, learning_rate lr = 0.001, drop_rate = 0.05, n ∈ {128,256,512}, γ ∈ {9,12}. For ContE, the range of hyper-parameters is λ = 0, batch_size b ∈ {16,20,30}, k ∈ {1000,2000}, max_steps ∈ {6000,7000,8000,9000}, n ∈ {3,4,5,6,7,8,9,10}, γ ∈ {1,2,3,4,5,6,7,8,9,10}, lr ∈ {0.0005,0.0006,0.0007,0.001}.

From Table 6, ContE is the best model which outperforms the best SOTA baseline by 2.66% on metric MRR and by 4.73% on metric Hits@1, by 2.98% on metric Hits@3, respectively.

Performance on FB15K-237 Table 7 summarizes the comparison results between ContE and the listed SOTA baselines on FB15K-237 dataset. The results of the baselines such as TransE, DistMult, ComplEx, and Rotate are from (Sun et al., 2019). The results of R-GCN are from its original paper and CapsE, ConvKB, ConvE are from (Sun et al., 2020). For SimplE we rerun the codes given by its original papers, and do a grid search to find the best values of hyper-parameters. The range of its hyper-parameters is n ∈ {1,2,5,10}, lr ∈ {0.001,0.05,0.01,0.02}, k ∈ {100,200,500}, number_of_epoches = 1000. For ContE, the range of hyper-parameters is λ ∈ {0,0.01,0.05,0.1}, batch_size ∈ {64,128,256,512,1024}, k ∈ {1000,2000}, max_steps ∈ {6000,7000,8000,9000}, n ∈ {256,512,1024}, γ ∈ {1,5,10,12,14,18,20}, lr ∈ {0.00005,0.0005,0.001,0.002}.

Table 7

Experimental results for FB15K-237.

	FB15K-237

	MRR	Hits@N

		1	3	10
TransE (Antoine et al., 2013)	0.279	0.198	0.376	0.441
DistMult (Yang et al., 2015)	0.281	0.199	0.301	0.446
ComplEx (Trouillon et al., 2016)	0.278	0.194	0.297	0.45
ConvE (Dettmers et al., 2018)	0.312	0.225	0.341	0.497
ConvKB (Nguyen et al., 2018)	0.289	0.198	0.324	0.471
R-GCN (Schlichtkrull et al., 2018)	0.164	0.10	0.181	0.30
SimplE (Seyed & David, 2018)	0.169	0.095	0.179	0.327
CapsE (Nguyen et al., 2019)	0.150	-	-	0.356
Rotate (Sun et al., 2019)	0.338	0.241	0.375	0.533
ContE	0.3445	0.2454	0.3823	0.5383

Table 7 shows that ContE achieves the best MRR as well as Hits@k (k=1,3,10) compared with the SOTA baselines.

Performance on Countries Table 8 and Table 9 summarize the comparison results between ContE and the listed SOTA baselines on Countries_S1, Countries_ S2 and Countries_S3 datasets, respectively. For all baselines, the range of hyper-parameters is similar to that of UMLS, Nations, FB15K-237.

Table 8

Experimental results for Countries_S1.

	Countries_S1

	MRR	Hits@N

		1	3	10
TransE (Antoine et al., 2013)	0.8785	0.7708	1.0	1.0
DistMult (Yang et al., 2015)	0.9028	0.8125	1.0	1.0
ComplEx (Trouillon et al., 2016)	0.9792	0.9583	1.0	1.0
Rotate (Sun et al., 2019)	0.8750	0.7708	1.0	1.0
HAKE (Zhang et al., 2020)	0.9045	0.8333	0.9792	1.0
LineaRE (Peng & Zhang, 2020)	1.0	1.0	1.0	1.0
ContE	1.0	1.0	1.0	1.0

Table 9

Experimental results for Countries_S2 and Countries_S3.

	Countries_S2				Countries_S3

	MRR	Hits@N			MRR	Hits@N

		1	3	10		1	3	10
TransE	0.6997	0.50	0.9375	1.0	0.1206	0.00	0.0833	0.3542
DistMult	0.7813	0.5833	1.0	1.0	0.2496	0.0625	0.333	0.6250
ComplEx	0.7934	0.6042	0.9792	1.0	0.2731	0.0833	0.3958	0.6667
Rotate	0.6979	0.4792	0.9583	1.0	0.1299	0.00	0.0833	0.4792
HAKE	0.6667	0.4583	0.8333	0.9583	0.2472	0.0625	0.3333	0.5417
LineaRE	0.7873	0.6458	0.9583	0.9792	0.2393	0.0625	0.3542	0.5208
ContE	0.8370	0.7292	0.9583	0.9792	0.4695	0.3542	0.5	0.625

As can be seen from Tables 8 and 9, ContE shows the stronger reasoning ability compared with other baselines. Specially, on Countries_S2, ContE outperforms the best SOTA baseline by 4.36% on metric MRR and by 8.32% on metric Hits@1, respectively. On Countries_S3, ContE outperforms the best SOTA baseline by 19.64% on metric MRR, by 27.09% on metric Hits@1, and by 10.42% on metric Hits@3, respectively.

Inferring relation patterns We evaluate ContE on datasets UMLS and Nations to test the ability of inferring relation patterns. Table 10 and Table 11 summarize the comparison results between ContE and the listed SOTA baselines on metric MRR. Note that UMLS and Nations have no symmetric relation. Thus, the corresponding column is denoted as “−”.

Table 10

Testing the ability of inferring relation patterns on UMLS.

	UMLS

	Symmetry	Antisymmetry	Inversion	Composition
ComplEx	-	0.8806	0.8615	0.8732
Rotate	-	0.9065	0.9069	0.9141
HAKE	-	0.8558	0.8650	0.8723
LineaRE	-	0.9446	0.9441	0.9490
ContE	-	0.9644	0.9622	0.9645

Table 11

Testing the ability of inferring relation patterns on Nations.

	Nations

	Symmetry	Antisymmetry	Inversion	Composition
ComplEx	-	0.6499	0.6487	0.6499
Rotate	-	0.6825	0.6970	0.6907
HAKE	-	0.6953	0.6835	0.6844
LineaRE	-	0.8189	0.8240	0.8333
ContE	-	0.8256	0.8326	0.8303

From Tables 10 and 11, ContE shows the stronger capability of inferring relation patterns compared with other baselines.

Conclusion

We propose the ContE model—a novel entity embedding model considering contexts of relations for KG completion. The forward and backward impacts of the relationship in ContE are mapped to two different embedding vectors, which represent the contextual information of the relationship. Then, according to the position of the entity, the entity's polysemous representation is obtained by adding its static embedding vector to the corresponding context vector of the relationship. ContE is a simple bilinear model. As the number of entities or relations increases, the number of its parameters increases linearly relative to the dimension of the embeddings. ContE is also a fully expressive model compared to other methods such as TransE and HAKE. It is worth mentioning that ContE can model symmetry, antisymmetry, inversion as well as composition patterns. ContE has stronger reasoning capability compared with SOTA baselines.

The empirical results demonstrate that, for the link prediction task, ContE achieves better performance than existing SOTA baselines. In future works, we can leverage the entity's neighborhood to explore local graph structure and use attention to capture the main sense of the entity from its heterogeneous neighborhood, and will further extend the ContE model to multi-modal knowledge graphs (Liu, Li, & Alberto, 2019) that contain not only (links to) images but also numerical features for all entities.

eISSN:: 2543-683X
Lingua:: Inglese

Frequenza di pubblicazione:: 4 volte all'anno
Argomenti della rivista:: Computer Sciences, Information Technology, Project Management, Databases and Data Mining

Feed RSS della rivista

Learning Context-based Embeddings for Knowledge Graph Completion

Article Category: Research Paper

Pubblicato online: 25 apr 2022

Pagine: 84 - 106

Ricevuto: 03 nov 2021

Accettato: 10 mar 2022

DOI: https://doi.org/10.2478/jdis-2022-0009

Parole chiaveFull expressiveness, Relational contexts, Knowledge graph embedding, Relation patterns, Link prediction

© 2022 Fei Pu et al., published by Sciendo

This work is licensed under the Creative Commons Attribution 4.0 International License.

Parole chiave
Full expressiveness, Relational contexts, Knowledge graph embedding, Relation patterns, Link prediction