BackgroundΒΆ

Knowledge graphs are graph-based knowledge bases whose facts are modeled as relationships between entities. Knowledge graph research led to broad-scope graphs such as DBpedia [ABK+07], WordNet [Pri10], and YAGO [SKW07]. Countless domain-specific knowledge graphs have also been published on the web, giving birth to the so-called Web of Data [BHBL11].

Formally, a knowledge graph \(\mathcal{G}=\{ (sub,pred,obj)\} \subseteq \mathcal{E} \times \mathcal{R} \times \mathcal{E}\) is a set of \((sub,pred,obj)\) triples, each including a subject \(sub \in \mathcal{E}\), a predicate \(pred \in \mathcal{R}\), and an object \(obj \in \mathcal{E}\). \(\mathcal{E}\) and \(\mathcal{R}\) are the sets of all entities and relation types of \(\mathcal{G}\).

Knowledge graph embedding models are neural architectures that encode concepts from a knowledge graph (i.e. entities \(\mathcal{E}\) and relation types \(\mathcal{R}\)) into low-dimensional, continuous vectors \(\in \mathcal{R}^k\). Such textit{knowledge graph embeddings} have applications in knowledge graph completion, entity resolution, and link-based clustering, just to cite a few [NMTG16]. Knowledge graph embeddings are learned by training a neural architecture over a graph. Although such architectures vary, the training phase always consists in minimizing a loss function \(\mathcal{L}\) that includes a scoring function \(f_{m}(t)\), i.e. a model-specific function that assigns a score to a triple \(t=(sub,pred,obj)\) .

The goal of the optimization procedure is learning optimal embeddings, such that the scoring function is able to assign high scores to positive statements and low scores to statements unlikely to be true. Existing models propose scoring functions that combine the embeddings \(\mathbf{e}_{sub},\mathbf{e}_{pred}, \mathbf{e}_{obj} \in \mathcal{R}^k\) of the subject, predicate, and object of triple \(t=(sub,pred,obj)\) using different intuitions: TransE [BUGD+13] relies on distances, DistMult [YYH+14] and ComplEx [TWR+16] are bilinear-diagonal models, HolE [NRP+16] uses circular correlation. While the above models can be interpreted as multilayer perceptrons, others such as ConvE include convolutional layers [DMSR18].

As example, the scoring function of TransE computes a similarity between the embedding of the subject \(\mathbf{e}_{sub}\) translated by the embedding of the predicate \(\mathbf{e}_{pred}\) and the embedding of the object \(\mathbf{e}_{obj}\), using the \(L_1\) or \(L_2\) norm \(||\cdot||\):

\[f_{TransE}=-||\mathbf{e}_{sub} + \mathbf{e}_{pred} - \mathbf{e}_{obj}||_n\]

Such scoring function is then used on positive and negative triples \(t^+, t^-\) in the loss function. This can be for example a pairwise margin-based loss, as shown in the equation below:

\[\mathcal{L}(\Theta) = \sum_{t^+ \in \mathcal{G}}\sum_{t^- \in \mathcal{N}}max(0, [\gamma + f_{m}(t^-;\Theta) - f_{m}(t^+;\Theta)])\]

where \(\Theta\) are the embeddings learned by the model, \(f_{m}\) is the model-specific scoring function, \(\gamma \in \mathcal{R}\) is the margin and \(\mathcal{N}\) is a set of negative triples generated with a corruption heuristic [BUGD+13].