AmpliGraph¶
Open source Python library that predicts links between concepts in a knowledge graph.
Join the conversation on Slack
AmpliGraph is a suite of neural machine learning models for relational Learning, a branch of machine learning that deals with supervised learning on knowledge graphs.

Use AmpliGraph if you need to:
Discover new knowledge from an existing knowledge graph.
Complete large knowledge graphs with missing statements.
Generate stand-alone knowledge graph embeddings.
Develop and evaluate a new relational model.
AmpliGraph’s machine learning models generate knowledge graph embeddings, vector representations of concepts in a metric space:

It then combines embeddings with model-specific scoring functions to predict unseen and novel links:

Key Features¶
Intuitive APIs: AmpliGraph APIs are designed to reduce the code amount required to learn models that predict links in knowledge graphs.
GPU-Ready: AmpliGraph is based on TensorFlow, and it is designed to run seamlessly on CPU and GPU devices - to speed-up training.
Extensible: Roll your own knowledge graph embeddings model by extending AmpliGraph base estimators.
Modules¶
AmpliGraph includes the following submodules:
Datasets: helper functions to load datasets (knowledge graphs).
Models: knowledge graph embedding models. AmpliGraph contains TransE, DistMult, ComplEx, HolE, ConvE, ConvKB (More to come!)
Evaluation: metrics and evaluation protocols to assess the predictive power of the models.
Discovery: High-level convenience APIs for knowledge discovery (discover new facts, cluster entities, predict near duplicates).
How to Cite¶
If you like AmpliGraph and you use it in your project, why not starring the project on GitHub!
If you instead use AmpliGraph in an academic publication, cite as:
@misc{ampligraph,
author= {Luca Costabello and
Sumit Pai and
Chan Le Van and
Rory McGrath and
Nick McCarthy and
Pedro Tabacof},
title = {{AmpliGraph: a Library for Representation Learning on Knowledge Graphs}},
month = mar,
year = 2019,
doi = {10.5281/zenodo.2595043},
url = {https://doi.org/10.5281/zenodo.2595043}
}
Installation¶
Prerequisites¶
Linux, macOS, Windows
Python ≥ 3.7
Provision a Virtual Environment¶
Create and activate a virtual environment (conda)
conda create --name ampligraph python=3.7
source activate ampligraph
Install TensorFlow¶
AmpliGraph is built on TensorFlow 1.x. Install from pip or conda:
CPU-only
pip install "tensorflow>=1.15.2,<2.0"
or
conda install tensorflow'>=1.15.2,<2.0.0'
GPU support
pip install "tensorflow-gpu>=1.15.2,<2.0"
or
conda install tensorflow-gpu'>=1.15.2,<2.0.0'
Install AmpliGraph¶
Install the latest stable release from pip:
pip install ampligraph
If instead you want the most recent development version, you can clone the repository and install from source as below (also see the How to Contribute guide for details):
git clone https://github.com/Accenture/AmpliGraph.git
cd AmpliGraph
git checkout develop
pip install -e .
Sanity Check¶
>> import ampligraph
>> ampligraph.__version__
'1.3.2'
Background¶
For a comprehensive theoretical and hands-on overview of KGE models and hands-on AmpliGraph, check out our ECAI-20 Tutorial (Slides + Recording + Colab Notebook).
Knowledge graphs are graph-based knowledge bases whose facts are modeled as relationships between entities. Knowledge graph research led to broad-scope graphs such as DBpedia [ABK+07], WordNet [Pri10], and YAGO [SKW07]. Countless domain-specific knowledge graphs have also been published on the web, giving birth to the so-called Web of Data [BHBL11].
Formally, a knowledge graph \(\mathcal{G}=\{ (sub,pred,obj)\} \subseteq \mathcal{E} \times \mathcal{R} \times \mathcal{E}\) is a set of \((sub,pred,obj)\) triples, each including a subject \(sub \in \mathcal{E}\), a predicate \(pred \in \mathcal{R}\), and an object \(obj \in \mathcal{E}\). \(\mathcal{E}\) and \(\mathcal{R}\) are the sets of all entities and relation types of \(\mathcal{G}\).
Knowledge graph embedding models are neural architectures that encode concepts from a knowledge graph (i.e. entities \(\mathcal{E}\) and relation types \(\mathcal{R}\)) into low-dimensional, continuous vectors \(\in \mathcal{R}^k\). Such textit{knowledge graph embeddings} have applications in knowledge graph completion, entity resolution, and link-based clustering, just to cite a few [NMTG16]. Knowledge graph embeddings are learned by training a neural architecture over a graph. Although such architectures vary, the training phase always consists in minimizing a loss function \(\mathcal{L}\) that includes a scoring function \(f_{m}(t)\), i.e. a model-specific function that assigns a score to a triple \(t=(sub,pred,obj)\) .
The goal of the optimization procedure is learning optimal embeddings, such that the scoring function is able to assign high scores to positive statements and low scores to statements unlikely to be true. Existing models propose scoring functions that combine the embeddings \(\mathbf{e}_{sub},\mathbf{e}_{pred}, \mathbf{e}_{obj} \in \mathcal{R}^k\) of the subject, predicate, and object of triple \(t=(sub,pred,obj)\) using different intuitions: TransE [BUGD+13] relies on distances, DistMult [YYH+14] and ComplEx [TWR+16] are bilinear-diagonal models, HolE [NRP+16] uses circular correlation. While the above models can be interpreted as multilayer perceptrons, others such as ConvE include convolutional layers [DMSR18].
As example, the scoring function of TransE computes a similarity between the embedding of the subject \(\mathbf{e}_{sub}\) translated by the embedding of the predicate \(\mathbf{e}_{pred}\) and the embedding of the object \(\mathbf{e}_{obj}\), using the \(L_1\) or \(L_2\) norm \(||\cdot||\):
Such scoring function is then used on positive and negative triples \(t^+, t^-\) in the loss function. This can be for example a pairwise margin-based loss, as shown in the equation below:
where \(\Theta\) are the embeddings learned by the model, \(f_{m}\) is the model-specific scoring function, \(\gamma \in \mathcal{R}\) is the margin and \(\mathcal{N}\) is a set of negative triples generated with a corruption heuristic [BUGD+13].
API¶
AmpliGraph includes the following submodules:
Datasets¶
Helper functions to load knowledge graphs.
Note
It is recommended to set the AMPLIGRAPH_DATA_HOME
environment variable:
export AMPLIGRAPH_DATA_HOME=/YOUR/PATH/TO/datasets
When attempting to load a dataset, the module will first check if AMPLIGRAPH_DATA_HOME
is set.
If it is, it will search this location for the required dataset.
If the dataset is not found it will be downloaded and placed in this directory.
If AMPLIGRAPH_DATA_HOME
has not been set the databases will be saved in the following directory:
~/ampligraph_datasets
Benchmark Datasets Loaders¶
Use these helpers functions to load datasets used in graph representation learning literature.
The functions will automatically download the datasets if they are not present in ~/ampligraph_datasets
or
at the location set in AMPLIGRAPH_DATA_HOME
.
|
Load the FB15k-237 dataset |
|
Load the WN18RR dataset |
|
Load the YAGO3-10 dataset |
|
Load the FB15k dataset |
|
Load the WN18 dataset |
|
Load the WordNet11 (WN11) dataset |
|
Load the Freebase13 (FB13) dataset |
Datasets Summary
Dataset |
Train |
Valid |
Test |
Entities |
Relations |
---|---|---|---|---|---|
FB15K-237 |
272,115 |
17,535 |
20,466 |
14,541 |
237 |
WN18RR |
86,835 |
3,034 |
3,134 |
40,943 |
11 |
FB15K |
483,142 |
50,000 |
59,071 |
14,951 |
1,345 |
WN18 |
141,442 |
5,000 |
5,000 |
40,943 |
18 |
YAGO3-10 |
1,079,040 |
5,000 |
5,000 |
123,182 |
37 |
WN11 |
110,361 |
5,215 |
21,035 |
38,194 |
11 |
FB13 |
316,232 |
11,816 |
47,464 |
75,043 |
13 |
Warning
WN18 and FB15k include a large number of inverse relations, and its use in experiments has been deprecated. Use WN18RR and FB15K-237 instead.
Warning
FB15K-237’s validation set contains 8 unseen entities over 9 triples. The test set has 29 unseen entities, distributed over 28 triples. WN18RR’s validation set contains 198 unseen entities over 210 triples. The test set has 209 unseen entities, distributed over 210 triples.
Note
WN11 and FB13 also provide true and negative labels for the triples in the validation and tests sets. In both cases the positive base rate is close to 50%.
Loaders for Custom Knowledge Graphs¶
Functions to load custom knowledge graphs from disk.
|
Load a knowledge graph from a csv file |
|
Load RDF ntriples |
|
Load an RDF file |
Hint
AmpliGraph includes a helper function to split a generic knowledge graphs into training,
validation, and test sets. See ampligraph.evaluation.train_test_split_no_unseen()
.
Models¶
Knowledge Graph Embedding Models¶
|
Random baseline |
|
Translating Embeddings (TransE) |
|
The DistMult model |
|
Complex embeddings (ComplEx) |
|
Holographic Embeddings |
|
Convolutional 2D KG Embeddings |
|
Convolution-based model |
Anatomy of a Model¶
Knowledge graph embeddings are learned by training a neural architecture over a graph. Although such architectures vary, the training phase always consists in minimizing a loss function \(\mathcal{L}\) that includes a scoring function \(f_{m}(t)\), i.e. a model-specific function that assigns a score to a triple \(t=(sub,pred,obj)\).
AmpliGraph models include the following components:
Scoring function \(f(t)\)
Loss function \(\mathcal{L}\)
AmpliGraph comes with a number of such components. They can be used in any combination to come up with a model that performs sufficiently well for the dataset of choice.
AmpliGraph features a number of abstract classes that can be extended to design new models:
|
Abstract class for embedding models |
|
Abstract class for loss function. |
|
Abstract class for Regularizer. |
|
Abstract class for initializer . |
Scoring functions¶
Existing models propose scoring functions that combine the embeddings \(\mathbf{e}_{s},\mathbf{r}_{p}, \mathbf{e}_{o} \in \mathcal{R}^k\) of the subject, predicate, and object of a triple \(t=(s,p,o)\) according to different intuitions:
TransE
[BUGD+13] relies on distances. The scoring function computes a similarity between the embedding of the subject translated by the embedding of the predicate and the embedding of the object, using the \(L_1\) or \(L_2\) norm \(||\cdot||\):
ConvE
[DMSR18] uses convolutional layers (\(g\) is a non-linear activation function, \(\ast\) is the linear convolution operator, \(vec\) indicates 2D reshaping):
Loss Functions¶
AmpliGraph includes a number of loss functions commonly used in literature. Each function can be used with any of the implemented models. Loss functions are passed to models as hyperparameter, and they can be thus used during model selection.
|
Pairwise, max-margin loss. |
|
Absolute margin , max-margin loss. |
|
Self adversarial sampling loss. |
|
Negative log-likelihood loss. |
|
Multiclass NLL Loss. |
|
Binary Cross Entropy Loss. |
Regularizers¶
AmpliGraph includes a number of regularizers that can be used with the loss function.
LPRegularizer
supports L1, L2, and L3.
|
Performs LP regularization |
Initializers¶
AmpliGraph includes a number of initializers that can be used to initialize the embeddings. They can be passed as hyperparameter, and they can be thus used during model selection.
|
Initializes from a normal distribution with specified |
|
Initializes from a uniform distribution with specified |
|
Follows the xavier strategy for initialization of layers [GB10]. |
|
Initializes with the constant values provided by the user |
Optimizers¶
The goal of the optimization procedure is learning optimal embeddings, such that the scoring function is able to assign high scores to positive statements and low scores to statements unlikely to be true.
We support SGD-based optimizers provided by TensorFlow, by setting the optimizer
argument in a model initializer.
Best results are currently obtained with Adam.
|
Wrapper around Adam Optimizer |
|
Wrapper around adagrad optimizer |
|
Wrapper around SGD Optimizer |
|
Wrapper around Momentum Optimizer |
Evaluation¶
The module includes performance metrics for neural graph embeddings models, along with model selection routines, negatives generation, and an implementation of the learning-to-rank-based evaluation protocol used in literature.
Metrics¶
Learning-to-rank metrics to evaluate the performance of neural graph embedding models.
|
Rank of a triple |
|
Mean Reciprocal Rank (MRR) |
|
Mean Rank (MR) |
|
Hits@N |
Negatives Generation¶
Negatives generation routines. These are corruption strategies based on the Local Closed-World Assumption (LCWA).
|
Generate corruptions for evaluation. |
|
Generate corruptions for training. |
Evaluation & Model Selection¶
Functions to evaluate the predictive power of knowledge graph embedding models, and routines for model selection.
|
Evaluate the performance of an embedding model. |
|
Model selection routine for embedding models via either grid search or random search. |
Helper Functions¶
Utilities and support functions for evaluation procedures.
|
Split into train and test sets. |
Create string-IDs mappings for entities and relations. |
|
|
Convert statements (triples) into integer IDs. |
Discovery¶
This module includes a number of functions to perform knowledge discovery in graph embeddings.
Functions provided include discover_facts
which will generate candidate statements using one of several
defined strategies and return triples that perform well when evaluated against corruptions, find_clusters
which
will perform link-based cluster analysis on a knowledge graph, find_duplicates
which will find duplicate entities
in a graph based on their embeddings, and query_topn
which when given two elements of a triple will return
the top_n results of all possible completions ordered by predicted score.
|
Discover new facts from an existing knowledge graph. |
|
Perform link-based cluster analysis on a knowledge graph. |
|
Find duplicate entities, relations or triples in a graph based on their embeddings. |
|
Queries the model with two elements of a triple and returns the top_n results of all possible completions ordered by score predicted by the model. |
Utils¶
This module contains utility functions for neural knowledge graph embedding models.
Saving/Restoring Models¶
Models can be saved and restored from disk. This is useful to avoid re-training a model.
|
Save a trained model to disk. |
|
Restore a saved model from disk. |
Visualization¶
Functions to visualize embeddings.
|
Export embeddings to Tensorboard. |
Others¶
Function to convert a pandas DataFrame with headers into triples.
|
Convert DataFrame into triple format. |
How to Contribute¶
Git Repo and Issue Tracking¶
AmpliGraph repository is available on GitHub.
A list of open issues is available here.
How to Contribute¶
We welcome community contributions, whether they are new models, tests, or documentation.
You can contribute to AmpliGraph in many ways:
Raise a bug report
File a feature request
Help other users by commenting on the issue tracking system
Add unit tests
Improve the documentation
Add a new graph embedding model (see below)
Adding Your Own Model¶
The landscape of knowledge graph embeddings evolves rapidly. We welcome new models as a contribution to AmpliGraph, which has been built to provide a shared codebase to guarantee a fair evalaution and comparison acros models.
You can add your own model by raising a pull request.
To get started, read the documentation on how current models have been implemented.
Developer Notes¶
Additional documentation on data adapters, AmpliGraph support for large graphs, and further technical details is available here.
Clone and Install in editable mode¶
Clone the repository and checkout the develop
branch.
Install from source with pip. use the -e
flag to enable editable mode:
git clone https://github.com/Accenture/AmpliGraph.git
git checkout develop
cd AmpliGraph
pip install -e .
Unit Tests¶
To run all the unit tests:
$ pytest tests
See pytest documentation for additional arguments.
Documentation¶
The project documentation is based on Sphinx and can be built on your local working copy as follows:
cd docs
make clean autogen html
The above generates an HTML version of the documentation under docs/_built/html
.
Packaging¶
To build an AmpliGraph custom wheel, do the following:
pip wheel --wheel-dir dist --no-deps .
Examples¶
These examples show how to get started with AmpliGraph APIs, and cover a range of useful tasks. Note that additional tutorials are also available.
Train and evaluate an embedding model¶
import numpy as np
from ampligraph.datasets import load_wn18
from ampligraph.latent_features import ComplEx
from ampligraph.evaluation import evaluate_performance, mrr_score, hits_at_n_score
def main():
# load Wordnet18 dataset:
X = load_wn18()
# Initialize a ComplEx neural embedding model with pairwise loss function:
# The model will be trained for 300 epochs.
model = ComplEx(batches_count=10, seed=0, epochs=20, k=150, eta=10,
# Use adam optimizer with learning rate 1e-3
optimizer='adam', optimizer_params={'lr':1e-3},
# Use pairwise loss with margin 0.5
loss='pairwise', loss_params={'margin':0.5},
# Use L2 regularizer with regularizer weight 1e-5
regularizer='LP', regularizer_params={'p':2, 'lambda':1e-5},
# Enable stdout messages (set to false if you don't want to display)
verbose=True)
# For evaluation, we can use a filter which would be used to filter out
# positives statements created by the corruption procedure.
# Here we define the filter set by concatenating all the positives
filter = np.concatenate((X['train'], X['valid'], X['test']))
# Fit the model on training and validation set
model.fit(X['train'],
early_stopping = True,
early_stopping_params = \
{
'x_valid': X['valid'], # validation set
'criteria':'hits10', # Uses hits10 criteria for early stopping
'burn_in': 100, # early stopping kicks in after 100 epochs
'check_interval':20, # validates every 20th epoch
'stop_interval':5, # stops if 5 successive validation checks are bad.
'x_filter': filter, # Use filter for filtering out positives
'corruption_entities':'all', # corrupt using all entities
'corrupt_side':'s+o' # corrupt subject and object (but not at once)
}
)
# Run the evaluation procedure on the test set (with filtering).
# To disable filtering: filter_triples=None
# Usually, we corrupt subject and object sides separately and compute ranks
ranks = evaluate_performance(X['test'],
model=model,
filter_triples=filter,
use_default_protocol=True, # corrupt subj and obj separately while evaluating
verbose=True)
# compute and print metrics:
mrr = mrr_score(ranks)
hits_10 = hits_at_n_score(ranks, n=10)
print("MRR: %f, Hits@10: %f" % (mrr, hits_10))
# Output: MRR: 0.886406, Hits@10: 0.935000
if __name__ == "__main__":
main()
Model selection¶
from ampligraph.datasets import load_wn18
from ampligraph.latent_features import ComplEx
from ampligraph.evaluation import select_best_model_ranking
def main():
# load Wordnet18 dataset:
X_dict = load_wn18()
model_class = ComplEx
# Use the template given below for doing grid search.
param_grid = {
"batches_count": [10],
"seed": 0,
"epochs": [4000],
"k": [100, 50],
"eta": [5,10],
"loss": ["pairwise", "nll", "self_adversarial"],
# We take care of mapping the params to corresponding classes
"loss_params": {
#margin corresponding to both pairwise and adverserial loss
"margin": [0.5, 20],
#alpha corresponding to adverserial loss
"alpha": [0.5]
},
"embedding_model_params": {
# generate corruption using all entities during training
"negative_corruption_entities":"all"
},
"regularizer": [None, "LP"],
"regularizer_params": {
"p": [2],
"lambda": [1e-4, 1e-5]
},
"optimizer": ["adam"],
"optimizer_params":{
"lr": [0.01, 0.0001]
},
"verbose": True
}
# Train the model on all possibile combinations of hyperparameters.
# Models are validated on the validation set.
# It returnes a model re-trained on training and validation sets.
best_model, best_params, best_mrr_train, \
ranks_test, mrr_test = select_best_model_ranking(model_class, # Class handle of the model to be used
# Dataset
X_dict['train'],
X_dict['valid'],
X_dict['test'],
# Parameter grid
param_grid,
# Use filtered set for eval
use_filter=True,
# corrupt subject and objects separately during eval
use_default_protocol=True,
# Log all the model hyperparams and evaluation stats
verbose=True)
print(type(best_model).__name__, best_params, best_mrr_train, mrr_test)
if __name__ == "__main__":
main()
Get the embeddings¶
import numpy as np
from ampligraph.latent_features import ComplEx
model = ComplEx(batches_count=1, seed=555, epochs=20, k=10)
X = np.array([['a', 'y', 'b'],
['b', 'y', 'a'],
['a', 'y', 'c'],
['c', 'y', 'a'],
['a', 'y', 'd'],
['c', 'y', 'd'],
['b', 'y', 'c'],
['f', 'y', 'e']])
model.fit(X)
model.get_embeddings(['f','e'], embedding_type='entity')
Save and restore a model¶
import numpy as np
from ampligraph.latent_features import ComplEx
from ampligraph.utils import save_model, restore_model
model = ComplEx(batches_count=2, seed=555, epochs=20, k=10)
X = np.array([['a', 'y', 'b'],
['b', 'y', 'a'],
['a', 'y', 'c'],
['c', 'y', 'a'],
['a', 'y', 'd'],
['c', 'y', 'd'],
['b', 'y', 'c'],
['f', 'y', 'e']])
model.fit(X)
# Use the trained model to predict
y_pred_before = model.predict(np.array([['f', 'y', 'e'], ['b', 'y', 'd']]))
print(y_pred_before)
#[-0.29721245, 0.07865551]
# Save the model
example_name = "helloworld.pkl"
save_model(model, model_name_path = example_name)
# Restore the model
restored_model = restore_model(model_name_path = example_name)
# Use the restored model to predict
y_pred_after = restored_model.predict(np.array([['f', 'y', 'e'], ['b', 'y', 'd']]))
print(y_pred_after)
# [-0.29721245, 0.07865551]
Split dataset into train/test or train/valid/test¶
import numpy as np
from ampligraph.evaluation import train_test_split_no_unseen
from ampligraph.datasets import load_from_csv
'''
Assume we have a knowledge graph stored in my_folder/my_graph.csv,
and that the content of such file is:
a,y,b
f,y,e
b,y,a
a,y,c
c,y,a
a,y,d
c,y,d
b,y,c
f,y,e
'''
# Load the graph in memory
X = load_from_csv('my_folder', 'my_graph.csv', sep=',')
# To split the graph in train and test sets:
# (In this toy example the test set will include only two triples)
X_train, X_test = train_test_split_no_unseen(X, test_size=2)
print(X_train)
'''
X_train:[['a' 'y' 'b']
['f' 'y' 'e']
['b' 'y' 'a']
['c' 'y' 'a']
['c' 'y' 'd']
['b' 'y' 'c']
['f' 'y' 'e']]
'''
print(X_test)
'''
X_test: [['a' 'y' 'c']
['a' 'y' 'd']]
'''
# To split the graph in train, validation, and test the method must be called twice:
X_train_valid, X_test = train_test_split_no_unseen(X, test_size=2)
X_train, X_valid = train_test_split_no_unseen(X_train_valid, test_size=2)
print(X_train)
'''
X_train: [['a' 'y' 'b']
['b' 'y' 'a']
['c' 'y' 'd']
['b' 'y' 'c']
['f' 'y' 'e']]
'''
print(X_valid)
'''
X_valid: [['f' 'y' 'e']
['c' 'y' 'a']]
'''
print(X_test)
'''
X_test: [['a' 'y' 'c']
['a' 'y' 'd']]
'''
Clustering and projectings embeddings into 2D space¶
Embedding training¶
import numpy as np
import pandas as pd
import requests
from ampligraph.datasets import load_from_csv
from ampligraph.latent_features import ComplEx
from ampligraph.evaluation import evaluate_performance
from ampligraph.evaluation import mr_score, mrr_score, hits_at_n_score
from ampligraph.evaluation import train_test_split_no_unseen
# International football matches triples
url = 'https://ampligraph.s3-eu-west-1.amazonaws.com/datasets/football.csv'
open('football.csv', 'wb').write(requests.get(url).content)
X = load_from_csv('.', 'football.csv', sep=',')[:, 1:]
# Train test split
X_train, X_test = train_test_split_no_unseen(X, test_size=10000)
# ComplEx model
model = ComplEx(batches_count=50,
epochs=300,
k=100,
eta=20,
optimizer='adam',
optimizer_params={'lr':1e-4},
loss='multiclass_nll',
regularizer='LP',
regularizer_params={'p':3, 'lambda':1e-5},
seed=0,
verbose=True)
model.fit(X_train)
Embedding evaluation¶
filter_triples = np.concatenate((X_train, X_test))
ranks = evaluate_performance(X_test,
model=model,
filter_triples=filter_triples,
use_default_protocol=True,
verbose=True)
mr = mr_score(ranks)
mrr = mrr_score(ranks)
print("MRR: %.2f" % (mrr))
print("MR: %.2f" % (mr))
hits_10 = hits_at_n_score(ranks, n=10)
print("Hits@10: %.2f" % (hits_10))
hits_3 = hits_at_n_score(ranks, n=3)
print("Hits@3: %.2f" % (hits_3))
hits_1 = hits_at_n_score(ranks, n=1)
print("Hits@1: %.2f" % (hits_1))
'''
MRR: 0.25
MR: 4927.33
Hits@10: 0.35
Hits@3: 0.28
Hits@1: 0.19
'''
Clustering and 2D projections¶
Please install lib adjustText first with pip install adjustText
.
For incf.countryutils, do the following steps:
git clone https://github.com/wyldebeast-wunderliebe/incf.countryutils.git
cd incf.countryutils
pip install .
incf.countryutils is used to map countries to the corresponding continents.
import re
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
from adjustText import adjust_text
from incf.countryutils import transformations
from ampligraph.discovery import find_clusters
# Get the teams entities and their corresponding embeddings
triples_df = pd.DataFrame(X, columns=['s', 'p', 'o'])
teams = triples_df.s[triples_df.s.str.startswith('Team')].unique()
team_embeddings = dict(zip(teams, model.get_embeddings(teams)))
team_embeddings_array = np.array([i for i in team_embeddings.values()])
# Project embeddings into 2D space via PCA
embeddings_2d = PCA(n_components=2).fit_transform(team_embeddings_array)
# Cluster embeddings (on the original space)
clustering_algorithm = KMeans(n_clusters=6, n_init=100, max_iter=500, random_state=0)
clusters = find_clusters(teams, model, clustering_algorithm, mode='entity')
# This function maps country to continent
def cn_to_ctn(country):
try:
original_name = ' '.join(re.findall('[A-Z][^A-Z]*', country[4:]))
return transformations.cn_to_ctn(original_name)
except KeyError:
return "unk"
plot_df = pd.DataFrame({"teams": teams,
"embedding1": embeddings_2d[:, 0],
"embedding2": embeddings_2d[:, 1],
"continent": pd.Series(teams).apply(cn_to_ctn),
"cluster": "cluster" + pd.Series(clusters).astype(str)})
# Top 20 teams in 2019 according to FIFA rankings
top20teams = ["TeamBelgium", "TeamFrance", "TeamBrazil", "TeamEngland", "TeamPortugal",
"TeamCroatia", "TeamSpain", "TeamUruguay", "TeamSwitzerland", "TeamDenmark",
"TeamArgentina", "TeamGermany", "TeamColombia", "TeamItaly", "TeamNetherlands",
"TeamChile", "TeamSweden", "TeamMexico", "TeamPoland", "TeamIran"]
np.random.seed(0)
# Plot 2D embeddings with country labels
def plot_clusters(hue):
plt.figure(figsize=(12, 12))
plt.title("{} embeddings".format(hue).capitalize())
ax = sns.scatterplot(data=plot_df[plot_df.continent!="unk"],
x="embedding1", y="embedding2", hue=hue)
texts = []
for i, point in plot_df.iterrows():
if point["teams"] in top20teams or np.random.random() < 0.1:
texts.append(plt.text(point['embedding1']+0.02,
point['embedding2']+0.01,
str(point["teams"])))
adjust_text(texts)
Tutorials¶
For a comprehensive theoretical and hands-on overview of KGE models and hands-on AmpliGraph, check out our ECAI-20 Tutorial (Slides + Recording + Colab Notebook).
The following Jupyter notebooks will guide you through the most important features of AmpliGraph:
AmpliGraph basics: training, saving and restoring a model, evaluating a model, discover new links, visualize embeddings. [Jupyter notebook] [Colab notebook]
Link-based clustering and classification: how to use the knowledge embeddings generated by a graph of international football matches in clustering and classification tasks. [Jupyter notebook] [Colab notebook]
Additional examples and code snippets are available here.
Performance¶
Predictive Performance¶
We report the filtered MR, MRR, Hits@1,3,10 for the most common datasets used in literature.
Results are computed assigning the worst rank to a positive test triple in case of tie with its synthetic negatives.
Although this is the most conservative approach, some published literature may adopt an evaluation protocol that assigns
the best rank instead. Check out the documentation of ampligraph.evaluation.evaluate_performance()
for details.
Note
On ConvE Evaluation. Results reported in the literature for ConvE are based on the alternative 1-N evaluation protocol which requires that reciprocal relations are added to the dataset [DMSR18]:
During training each unique pair of subject and predicate can predict all possible object scores for that pairs, and therefore object corruptions evaluation can be performed with a single forward pass:
In the standard corruption procedure the subject entity is replaced by corruptions:
However in the 1-N protocol subject corruptions are interpreted as object corruptions of the reciprocal relation:
To reproduce the results reported in the literature using the 1-N evaluation protocol, add reciprocal relations by
specifying add_reciprocal_rels
in the dataset loader function, e.g. load_fb15k(add_reciprocal_rels=True)
,
and run the evaluation protocol with object corruptions by specifying corrupt_sides='o'
.
Results obtained with the standard evaluation protocol are labeled ConvE, while those obtained with the 1-N protocol are marked ConvE(1-N).
FB15K-237¶
Model |
MR |
MRR |
Hits@1 |
Hits@3 |
Hits@10 |
Hyperparameters |
---|---|---|---|---|---|---|
TransE |
208 |
0.31 |
0.22 |
0.35 |
0.50 |
k: 400; epochs: 4000; eta: 30; loss: multiclass_nll; regularizer: LP; regularizer_params: lambda: 0.0001; p: 2; optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params: norm: 1; normalize_ent_emb: false; seed: 0; batches_count: 64; |
DistMult |
199 |
0.31 |
0.22 |
0.35 |
0.49 |
k: 300; epochs: 4000; eta: 50; loss: multiclass_nll; regularizer: LP; regularizer_params: lambda: 0.0001; p: 3; optimizer: adam; optimizer_params: lr: 0.00005; seed: 0; batches_count: 50; normalize_ent_emb: false; |
ComplEx |
184 |
0.32 |
0.23 |
0.35 |
0.50 |
k: 350; epochs: 4000; eta: 30; loss: multiclass_nll; optimizer: adam; optimizer_params: lr: 0.00005; seed: 0; regularizer: LP; regularizer_params: lambda: 0.0001; p: 3; batches_count: 64; |
HolE |
184 |
0.31 |
0.22 |
0.34 |
0.49 |
k: 350; epochs: 4000; eta: 50; loss: multiclass_nll; regularizer: LP; regularizer_params: lambda: 0.0001; p: 2; optimizer: adam; optimizer_params: lr: 0.0001; seed: 0; batches_count: 64; |
ConvKB |
327 |
0.23 |
0.15 |
0.25 |
0.40 |
k: 200; epochs: 500; eta: 10; loss: multiclass_nll; loss_params: {} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ num_filters: 32, filter_sizes: 1, dropout: 0.1}; seed: 0; batches_count: 300; |
ConvE |
1060 |
0.26 |
0.19 |
0.28 |
0.38 |
k: 200; epochs: 4000; loss: bce; loss_params: {label_smoothing=0.1} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ conv_filters: 32, conv_kernel_size: 3, dropout_embed: 0.2, dropout_conv: 0.1, dropout_dense: 0.3, use_batchnorm: True, use_bias: True}; seed: 0; batches_count: 100; |
ConvE(1-N) |
234 |
0.32 |
0.23 |
0.35 |
0.50 |
k: 200; epochs: 4000; loss: bce; loss_params: {label_smoothing=0.1} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ conv_filters: 32, conv_kernel_size: 3, dropout_embed: 0.2, dropout_conv: 0.1, dropout_dense: 0.3, use_batchnorm: True, use_bias: True}; seed: 0; batches_count: 100; |
Note
FB15K-237 validation and test sets include triples with entities that do not occur in the training set. We found 8 unseen entities in the validation set and 29 in the test set. In the experiments we excluded the triples where such entities appear (9 triples in from the validation set and 28 from the test set).
WN18RR¶
Model |
MR |
MRR |
Hits@1 |
Hits@3 |
Hits@10 |
Hyperparameters |
---|---|---|---|---|---|---|
TransE |
2692 |
0.22 |
0.03 |
0.37 |
0.54 |
k: 350; epochs: 4000; eta: 30; loss: multiclass_nll; optimizer: adam; optimizer_params: lr: 0.0001; regularizer: LP; regularizer_params: lambda: 0.0001; p: 2; seed: 0; normalize_ent_emb: false; embedding_model_params: norm: 1; batches_count: 150; |
DistMult |
5531 |
0.47 |
0.43 |
0.48 |
0.53 |
k: 350; epochs: 4000; eta: 30; loss: multiclass_nll; optimizer: adam; optimizer_params: lr: 0.0001; regularizer: LP; regularizer_params: lambda: 0.0001; p: 2; seed: 0; normalize_ent_emb: false; batches_count: 100; |
ComplEx |
4177 |
0.51 |
0.46 |
0.53 |
0.58 |
k: 200; epochs: 4000; eta: 20; loss: multiclass_nll; loss_params: margin: 1; optimizer: adam; optimizer_params: lr: 0.0005; seed: 0; regularizer: LP; regularizer_params: lambda: 0.05; p: 3; batches_count: 10; |
HolE |
7028 |
0.47 |
0.44 |
0.48 |
0.53 |
k: 200; epochs: 4000; eta: 20; loss: self_adversarial; loss_params: margin: 1; optimizer: adam; optimizer_params: lr: 0.0005; seed: 0; batches_count: 50; |
ConvKB |
3652 |
0.39 |
0.33 |
0.42 |
0.48 |
k: 200; epochs: 500; eta: 10; loss: multiclass_nll; loss_params: {} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ num_filters: 32, filter_sizes: 1, dropout: 0.1}; seed: 0; batches_count: 300; |
ConvE |
5346 |
0.45 |
0.42 |
0.47 |
0.52 |
k: 200; epochs: 4000; loss: bce; loss_params: {label_smoothing=0.1} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ conv_filters: 32, conv_kernel_size: 3, dropout_embed: 0.2, dropout_conv: 0.1, dropout_dense: 0.3, use_batchnorm: True, use_bias: True}; seed: 0; batches_count: 100; |
ConvE(1-N) |
4842 |
0.48 |
0.45 |
0.49 |
0.54 |
k: 200; epochs: 4000; loss: bce; loss_params: {label_smoothing=0.1} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ conv_filters: 32, conv_kernel_size: 3, dropout_embed: 0.2, dropout_conv: 0.1, dropout_dense: 0.3, use_batchnorm: True, use_bias: True}; seed: 0; batches_count: 100; |
Note
WN18RR validation and test sets include triples with entities that do not occur in the training set. We found 198 unseen entities in the validation set and 209 in the test set. In the experiments we excluded the triples where such entities appear (210 triples in from the validation set and 210 from the test set).
YAGO3-10¶
Model |
MR |
MRR |
Hits@1 |
Hits@3 |
Hits@10 |
Hyperparameters |
---|---|---|---|---|---|---|
TransE |
1264 |
0.51 |
0.41 |
0.57 |
0.67 |
k: 350; epochs: 4000; eta: 30; loss: multiclass_nll; optimizer: adam; optimizer_params: lr: 0.0001; regularizer: LP; regularizer_params: lambda: 0.0001; p: 2; embedding_model_params: norm: 1; normalize_ent_emb: false; seed: 0; batches_count: 100; |
DistMult |
1107 |
0.50 |
0.41 |
0.55 |
0.66 |
k: 350; epochs: 4000; eta: 50; loss: multiclass_nll; optimizer: adam; optimizer_params: lr: 5e-05; regularizer: LP; regularizer_params: lambda: 0.0001; p: 3; seed: 0; normalize_ent_emb: false; batches_count: 100; |
ComplEx |
1227 |
0.49 |
0.40 |
0.54 |
0.66 |
k: 350; epochs: 4000; eta: 30; loss: multiclass_nll; optimizer: adam; optimizer_params: lr: 5e-05; regularizer: LP; regularizer_params: lambda: 0.0001; p: 3; seed: 0; batches_count: 100 |
HolE |
6776 |
0.50 |
0.42 |
0.56 |
0.65 |
k: 350; epochs: 4000; eta: 30; loss: self_adversarial; loss_params: alpha: 1; margin: 0.5; optimizer: adam; optimizer_params: lr: 0.0001; seed: 0; batches_count: 100 |
ConvKB |
2820 |
0.30 |
0.21 |
0.34 |
0.50 |
k: 200; epochs: 500; eta: 10; loss: multiclass_nll; loss_params: {} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ num_filters: 32, filter_sizes: 1, dropout: 0.1}; seed: 0; batches_count: 3000; |
ConvE |
6063 |
0.40 |
0.33 |
0.42 |
0.53 |
k: 300; epochs: 4000; loss: bce; loss_params: {label_smoothing=0.1} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ conv_filters: 32, conv_kernel_size: 3, dropout_embed: 0.2, dropout_conv: 0.1, dropout_dense: 0.3, use_batchnorm: True, use_bias: True}; seed: 0; batches_count: 300; |
ConvE(1-N) |
2741 |
0.55 |
0.48 |
0.60 |
0.69 |
k: 300; epochs: 4000; loss: bce; loss_params: {label_smoothing=0.1} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ conv_filters: 32, conv_kernel_size: 3, dropout_embed: 0.2, dropout_conv: 0.1, dropout_dense: 0.3, use_batchnorm: True, use_bias: True}; seed: 0; batches_count: 300; |
Note
YAGO3-10 validation and test sets include triples with entities that do not occur in the training set. We found 22 unseen entities in the validation set and 18 in the test set. In the experiments we excluded the triples where such entities appear (22 triples in from the validation set and 18 from the test set).
FB15K¶
Warning
The dataset includes a large number of inverse relations, and its use in experiments has been deprecated. Use FB15k-237 instead.
Model |
MR |
MRR |
Hits@1 |
Hits@3 |
Hits@10 |
Hyperparameters |
---|---|---|---|---|---|---|
TransE |
44 |
0.63 |
0.50 |
0.73 |
0.85 |
k: 150; epochs: 4000; eta: 10; loss: multiclass_nll; optimizer: adam; optimizer_params: lr: 5e-5; regularizer: LP; regularizer_params: lambda: 0.0001; p: 3; embedding_model_params: norm: 1; normalize_ent_emb: false; seed: 0; batches_count: 100; |
DistMult |
179 |
0.78 |
0.74 |
0.82 |
0.86 |
k: 200; epochs: 4000; eta: 20; loss: self_adversarial; loss_params: margin: 1; optimizer: adam; optimizer_params: lr: 0.0005; seed: 0; normalize_ent_emb: false; batches_count: 50; |
ComplEx |
184 |
0.80 |
0.76 |
0.82 |
0.86 |
k: 200; epochs: 4000; eta: 20; loss: self_adversarial; loss_params: margin: 1; optimizer: adam; optimizer_params: lr: 0.0005; seed: 0; batches_count: 100; |
HolE |
216 |
0.80 |
0.76 |
0.83 |
0.87 |
k: 200; epochs: 4000; eta: 20; loss: self_adversarial; loss_params: margin: 1; optimizer: adam; optimizer_params: lr: 0.0005; seed: 0; batches_count: 50; |
ConvKB |
331 |
0.65 |
0.55 |
0.71 |
0.82 |
k: 200; epochs: 500; eta: 10; loss: multiclass_nll; loss_params: {} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ num_filters: 32, filter_sizes: 1, dropout: 0.1}; seed: 0; batches_count: 300; |
ConvE |
385 |
0.50 |
0.42 |
0.52 |
0.66 |
k: 300; epochs: 4000; loss: bce; loss_params: {label_smoothing=0.1} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ conv_filters: 32, conv_kernel_size: 3, dropout_embed: 0.2, dropout_conv: 0.1, dropout_dense: 0.3, use_batchnorm: True, use_bias: True}; seed: 0; batches_count: 100; |
ConvE(1-N) |
55 |
0.80 |
0.74 |
0.84 |
0.89 |
k: 300; epochs: 4000; loss: bce; loss_params: {label_smoothing=0.1} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ conv_filters: 32, conv_kernel_size: 3, dropout_embed: 0.2, dropout_conv: 0.1, dropout_dense: 0.3, use_batchnorm: True, use_bias: True}; seed: 0; batches_count: 100; |
WN18¶
Warning
The dataset includes a large number of inverse relations, and its use in experiments has been deprecated. Use WN18RR instead.
Model |
MR |
MRR |
Hits@1 |
Hits@3 |
Hits@10 |
Hyperparameters |
---|---|---|---|---|---|---|
TransE |
260 |
0.66 |
0.44 |
0.88 |
0.95 |
k: 150; epochs: 4000; eta: 10; loss: multiclass_nll; optimizer: adam; optimizer_params: lr: 5e-5; regularizer: LP; regularizer_params: lambda: 0.0001; p: 3; embedding_model_params: norm: 1; normalize_ent_emb: false; seed: 0; batches_count: 100; |
DistMult |
675 |
0.82 |
0.73 |
0.92 |
0.95 |
k: 200; epochs: 4000; eta: 20; loss: nll; loss_params: margin: 1; optimizer: adam; optimizer_params: lr: 0.0005; seed: 0; normalize_ent_emb: false; batches_count: 50; |
ComplEx |
726 |
0.94 |
0.94 |
0.95 |
0.95 |
k: 200; epochs: 4000; eta: 20; loss: nll; loss_params: margin: 1; optimizer: adam; optimizer_params: lr: 0.0005; seed: 0; batches_count: 50; |
HolE |
665 |
0.94 |
0.93 |
0.94 |
0.95 |
k: 200; epochs: 4000; eta: 20; loss: self_adversarial; loss_params: margin: 1; optimizer: adam; optimizer_params: lr: 0.0005; seed: 0; batches_count: 50; |
ConvKB |
331 |
0.80 |
0.69 |
0.90 |
0.94 |
k: 200; epochs: 500; eta: 10; loss: multiclass_nll; loss_params: {} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ num_filters: 32, filter_sizes: 1, dropout: 0.1}; seed: 0; batches_count: 300; |
ConvE |
492 |
0.93 |
0.91 |
0.94 |
0.95 |
k: 300; epochs: 4000; loss: bce; loss_params: {label_smoothing=0.1} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ conv_filters: 32, conv_kernel_size: 3, dropout_embed: 0.2, dropout_conv: 0.1, dropout_dense: 0.3, use_batchnorm: True, use_bias: True}; seed: 0; batches_count: 100; |
ConvE(1-N) |
436 |
0.95 |
0.93 |
0.95 |
0.95 |
k: 300; epochs: 4000; loss: bce; loss_params: {label_smoothing=0.1} optimizer: adam; optimizer_params: lr: 0.0001; embedding_model_params:{ conv_filters: 32, conv_kernel_size: 3, dropout_embed: 0.2, dropout_conv: 0.1, dropout_dense: 0.3, use_batchnorm: True, use_bias: True}; seed: 0; batches_count: 100; |
To reproduce the above results:
$ cd experiments
$ python predictive_performance.py
Note
Running predictive_performance.py
on all datasets, for all models takes ~115 hours on
an Intel Xeon Gold 6142, 64 GB Ubuntu 16.04 box equipped with a Tesla V100 16GB.
The long running time is mostly due to the early stopping configuration (see section below).
Note
All of the experiments above were conducted with early stopping on half the validation set.
Typically, the validation set can be found in X['valid']
.
We only used half the validation set so the other half is available for hyperparameter tuning.
The exact early stopping configuration is as follows:
x_valid: validation[::2]
criteria: mrr
x_filter: train + validation + test
stop_interval: 4
burn_in: 0
check_interval: 50
Note that early stopping adds a significant computational burden to the learning procedure. To lessen it, you may either decrease the validation set, the stop interval, the check interval, or increase the burn in.
Note
Due to a combination of model and dataset size it is not possible to evaluate Yago3-10 with ConvKB on the GPU. The fastest way to replicate the results above is to train ConvKB with Yago3-10 on a GPU using the hyper- parameters described above (~15hrs on GTX 1080Ti), and then evaluate the model in CPU only mode (~15 hours on Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz).
Note
ConvKB with early-stopping evaluation protocol does not fit into GPU memory, so instead is just trained for a set number of epochs.
Experiments can be limited to specific models-dataset combinations as follows:
$ python predictive_performance.py -h
usage: predictive_performance.py [-h] [-d {fb15k,fb15k-237,wn18,wn18rr,yago310}]
[-m {complex,transe,distmult,hole,convkb,conve}]
optional arguments:
-h, --help show this help message and exit
-d {fb15k,fb15k-237,wn18,wn18rr,yago310}, --dataset {fb15k,fb15k-237,wn18,wn18rr,yago310}
-m {complex,transe,distmult,hole,convkb,conve}, --model {complex,transe,distmult,hole,convkb,conve}
Runtime Performance¶
Training the models on FB15K-237 (k=100, eta=10, batches_count=100, loss=multiclass_nll
), on an Intel Xeon Gold 6142, 64 GB
Ubuntu 16.04 box equipped with a Tesla V100 16GB gives the following runtime report:
model |
seconds/epoch |
---|---|
ComplEx |
1.33 |
TransE |
1.22 |
DistMult |
1.20 |
HolE |
1.30 |
ConvKB |
2.83 |
ConvE |
1.13 |
Note
ConvE is trained with bce
loss instead of multiclass_nll
.
Bibliography¶
- aC15
Danqi and Chen. Observed versus latent features for knowledge base and text inference. In 3rd Workshop on Continuous Vector Space Models and Their Compositionality. ACL - Association for Computational Linguistics, July 2015. URL: https://www.microsoft.com/en-us/research/publication/observed-versus-latent-features-for-knowledge-base-and-text-inference/.
- ABK+07
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia: a nucleus for a web of open data. In The semantic web, 722–735. Springer, 2007.
- BB12
James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(Feb):281–305, 2012.
- BHBL11
Christian Bizer, Tom Heath, and Tim Berners-Lee. Linked data: the story so far. In Semantic services, interoperability and web applications: emerging concepts, 205–227. IGI Global, 2011.
- BEP+08
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 1247–1250. AcM, 2008.
- BUGD+13
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, 2787–2795. 2013.
- DMSR18
Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In Procs of AAAI. 2018. URL: https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17366.
- GB10
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249–256. 2010.
- HOSM17
Takuo Hamaguchi, Hidekazu Oiwa, Masashi Shimbo, and Yuji Matsumoto. Knowledge transfer for out-of-knowledge-base entities: A graph neural network approach. IJCAI International Joint Conference on Artificial Intelligence, pages 1802–1808, 2017.
- HS17
Katsuhiko Hayashi and Masashi Shimbo. On the equivalence of holographic and complex embeddings for link prediction. CoRR, 2017. URL: http://arxiv.org/abs/1702.05563, arXiv:1702.05563.
- KBK17
Rudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst. Knowledge base completion: baselines strike back. CoRR, 2017. URL: http://arxiv.org/abs/1705.10744, arXiv:1705.10744.
- LUO18
Timothee Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition for knowledge base completion. In International Conference on Machine Learning, 2869–2878. 2018.
- LJ18
Lisha Li and Kevin Jamieson. Hyperband: a novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18:1–52, 2018.
- MBS13
Farzaneh Mahdisoltani, Joanna Biega, and Fabian M Suchanek. Yago3: a knowledge base from multilingual wikipedias. In CIDR. 2013.
- Mil95
George A Miller. Wordnet: a lexical database for english. Communications of the ACM, 38(11):39–41, 1995.
- NNNP18
Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Phung. A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 327–333. 2018.
- NMTG16
Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs. Procs of the IEEE, 104(1):11–33, 2016.
- NRP+16
Maximilian Nickel, Lorenzo Rosasco, Tomaso A Poggio, and others. Holographic embeddings of knowledge graphs. In AAAI, 1955–1961. 2016.
- P+99
John Platt and others. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
- Pri10
Princeton. About wordnet. Web, 2010. https://wordnet.princeton.edu.
- SCMN13
Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems, 926–934. 2013.
- SKW07
Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semantic knowledge. In Procs of WWW, 697–706. ACM, 2007.
- SDNT19
Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: knowledge graph embedding by relational rotation in complex space. In International Conference on Learning Representations. 2019. URL: https://openreview.net/forum?id=HkgEQnRqYQ.
- TC20
Pedro Tabacof and Luca Costabello. Probability Calibration for Knowledge Graph Embedding Models. In ICLR. 2020.
- TCP+15
Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1499–1509. 2015.
- TWR+16
Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International Conference on Machine Learning, 2071–2080. 2016.
- YYH+14
Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint, 2014.
Changelog¶
1.3.2¶
25 Aug 2020
Added constant initializer (#205)
Ranking strategies for breaking ties (#212)
ConvE Bug Fixes (#210, #194)
Efficient batch sampling (#202)
Added pointer to documentation for large graph mode and Docs for Optimizer (#216)
1.3.0¶
9 Mar 2020
ConvE model Implementation (#178)
Changes to evaluate_performance API (#183)
Option to add reciprocal relations (#181)
Minor fixes to ConvKB (#168, #167)
Minor fixes in large graph mode (#174, #172, #169)
Option to skip unseen entities checks when train_test_split is used (#163)
Stability of NLL losses (#170)
ICLR-20 calibration paper experiments added in branch paper/ICLR-20
1.2.0¶
22 Oct 2019
Probability calibration using Platt scaling, both with provided negatives or synthetic negative statements (#131)
Added ConvKB model
Added WN11, FB13 loaders (datasets with ground truth positive and negative triples) (#138)
Continuous integration with CircleCI, integrated on GitHub (#127)
Refactoring of models into separate files (#104)
Fixed bug where the number of epochs did not exactly match the provided number by the user (#135)
Fixed some bugs on RandomBaseline model (#133, #134)
Fixed some bugs on discover_facts with strategies “exhaustive” and “graph_degree”
Fixed bug on subsequent calls of model.predict on the GPU (#137)
1.1.0¶
16 Aug 2019
Support for large number of entities (#61, #113)
Faster evaluation protocol (#74)
New Knowledge discovery APIs: discover facts, clustering, near-duplicates detection, topn query (#118)
API change: model.predict() does not return ranks anymore
API change: friendlier ranking API output (#101)
Implemented nuclear-3 norm (#23)
Jupyter notebook tutorials: AmpliGraph basics (#67) and Link-based clustering
Random search for hyper-parameter tuning (#106)
Additional initializers (#112)
Experiment campaign with multiclass-loss
System-wide bugfixes and minor improvements
1.0.3¶
7 Jun 2019
Fixed regression in RandomBaseline (#94)
Added TensorBoard Embedding Projector support (#86)
Minor bugfixing (#87, #47)
1.0.2¶
19 Apr 2019
Added multiclass loss (#24 and #22)
Updated the negative generation to speed up evaluation for default protocol.(#74)
Support for visualization of embeddings using Tensorboard (#16)
Save models with custom names. (#71)
Quick fix for the overflow issue for datasets with a million entities. (#61)
Fixed issues in train_test_split_no_unseen API and updated api (#68)
Added unit test cases for better coverage of the code(#75)
Corrupt_sides : can now generate corruptions for training on both sides, or only on subject or object
Better error messages
Reduced logging verbosity
Added YAGO3-10 experiments
Added MD5 checksum for datasets (#47)
Addressed issue of ambiguous dataset loaders (#59)
Renamed ‘type’ parameter in models.get_embeddings to fix masking built-in function
Updated String comparison to use equality instead of identity.
Moved save_model and restore_model to ampligraph.utils (but existing API will remain for several releases).
Other minor issues (#63, #64, #65, #66)
1.0.1¶
22 Mar 2019
evaluation protocol now ranks object and subjects corruptions separately
Corruption generation can now use entities from current batch only
FB15k-237, WN18RR loaders filter out unseen triples by default
Removed some unused arguments
Improved documentation
Minor bugfixing
1.0.0¶
16 Mar 2019
TransE
DistMult
ComplEx
FB15k, WN18, FB15k-237, WN18RR, YAGO3-10 loaders
generic loader for csv files
RDF, ntriples loaders
Learning to rank evaluation protocol
Tensorflow-based negatives generation
save/restore capabilities for models
pairwise loss
nll loss
self-adversarial loss
absolute margin loss
Model selection routine
LCWA corruption strategy for training and eval
rank, Hits@N, MRR scores functions
About¶
AmpliGraph is developed and maintained by Accenture Labs Dublin.
How to Cite¶
If you like AmpliGraph and you use it in your project, why not starring the project on GitHub!
If you instead use AmpliGraph in an academic publication, cite as:
@misc{ampligraph,
author= {Luca Costabello and
Sumit Pai and
Chan Le Van and
Rory McGrath and
Nicholas McCarthy and
Pedro Tabacof},
title = {{AmpliGraph: a Library for Representation Learning on Knowledge Graphs}},
month = mar,
year = 2019,
doi = {10.5281/zenodo.2595043},
url = {https://doi.org/10.5281/zenodo.2595043}
}
License¶
AmpliGraph is licensed under the Apache 2.0 License.