Datasets

Helper functions to load knowledge graphs.

Note

It is recommended to set the AMPLIGRAPH_DATA_HOME environment variable:

export AMPLIGRAPH_DATA_HOME=/YOUR/PATH/TO/datasets

When attempting to load a dataset, the module will first check if AMPLIGRAPH_DATA_HOME is set. If it is, it will search this location for the required dataset. If the dataset is not found it will be downloaded and placed in this directory.

If AMPLIGRAPH_DATA_HOME has not been set the databases will be saved in the following directory:

~/ampligraph_datasets

Benchmark Datasets Loaders

Use these helper functions to load datasets used in graph representation learning literature. The functions will automatically download the datasets if they are not present in ~/ampligraph_datasets or at the location set in AMPLIGRAPH_DATA_HOME.

load_fb15k_237([check_md5hash, …]) Load the FB15k-237 dataset
load_wn18rr([check_md5hash, clean_unseen, …]) Load the WN18RR dataset
load_yago3_10([check_md5hash, clean_unseen, …]) Load the YAGO3-10 dataset
load_fb15k([check_md5hash, add_reciprocal_rels]) Load the FB15k dataset
load_wn18([check_md5hash, add_reciprocal_rels]) Load the WN18 dataset
load_wn11([check_md5hash, clean_unseen, …]) Load the WordNet11 (WN11) dataset
load_fb13([check_md5hash, clean_unseen, …]) Load the Freebase13 (FB13) dataset

Datasets Summary

Dataset Train Valid Test Entities Relations
FB15K-237 272,115 17,535 20,466 14,541 237
WN18RR 86,835 3,034 3,134 40,943 11
FB15K 483,142 50,000 59,071 14,951 1,345
WN18 141,442 5,000 5,000 40,943 18
YAGO3-10 1,079,040 5,000 5,000 123,182 37
WN11 110,361 5,215 21,035 38,194 11
FB13 316,232 11,816 47,464 75,043 13

Warning

WN18 and FB15k include a large number of inverse relations, and its use in experiments has been deprecated. Use WN18RR and FB15K-237 instead.

Warning

FB15K-237’s validation set contains 8 unseen entities over 9 triples. The test set has 29 unseen entities, distributed over 28 triples. WN18RR’s validation set contains 198 unseen entities over 210 triples. The test set has 209 unseen entities, distributed over 210 triples.

Note

WN11 and FB13 also provide true and negative labels for the triples in the validation and tests sets. In both cases the positive base rate is close to 50%.

Benchmark Datasets Loaders (Knowledge Graphs With Numeric Values on Edges)

These helper functions load benchmark datasets with numeric values on edges, as described in [PC21] (the figure below shows a toy example).

_images/kg_eg.png

Hint

To process a knowledge graphs with numeric values associated to edges, enable the FocusE layer when training a knowledge graph embedding model [PC21].

The functions will automatically download the datasets if they are not present in ~/ampligraph_datasets or at the location set in AMPLIGRAPH_DATA_HOME.

load_onet20k([check_md5hash, clean_unseen, …]) Load the O*NET20K dataset
load_ppi5k([check_md5hash, clean_unseen, …]) Load the PPI5K dataset
load_nl27k([check_md5hash, clean_unseen, …]) Load the NL27K dataset
load_cn15k([check_md5hash, clean_unseen, …]) Load the CN15K dataset

Datasets Summary (KGs with numeric values on edges)

Dataset Train Valid Test Entities Relations
O*NET20K 461,932 138 2,000 20,643 19
PPI5K 230,929 19,017 21,720 4,999 7
NL27K 149,100 12,274 14,026 27,221 405
CN15K 199,417 16,829 19,224 15,000 36

Loaders for Custom Knowledge Graphs

Functions to load custom knowledge graphs from disk.

load_from_csv(directory_path, file_name[, …]) Load a knowledge graph from a csv file
load_from_ntriples(folder_name, file_name[, …]) Load RDF ntriples
load_from_rdf(folder_name, file_name[, …]) Load an RDF file

Hint

AmpliGraph includes a helper function to split a generic knowledge graphs into training, validation, and test sets. See ampligraph.evaluation.train_test_split_no_unseen().