Datasets¶

Support for loading and managing datasets.

Loaders for Custom Knowledge Graphs¶

These are functions to load custom knowledge graphs from disk. They load the data from the specified files and store it as a numpy array. These loaders are recommended when the datasets to load are small in size (approx 1M entities and millions of triples), i.e., as long as they can be stored in memory. In case the dataset is too big to fit in memory, use the GraphDataLoader class instead (see the Advanced Topics section for more).

`load_from_csv`(directory_path, file_name[, ...])	Load a knowledge graph from a .csv file.
`load_from_ntriples`(folder_name, file_name[, ...])	Load a dataset of RDF ntriples.
`load_from_rdf`(folder_name, file_name[, ...])	Load an RDF file.

Benchmark Datasets Loaders¶

The following helper functions allow to load datasets used in graph representation learning literature as benchmarks.
Among the various datasets, some include additional content to the usual triples. WN11 and FB13 provide true and negative labels for the triples in the validation and tests sets. CODEX-M contains also ground truths negative triples for test and validation sets (more information about the dataset in [SK20]).
Finally, even though some of them are nowadays deprecated (WN18 and FB15k), they are kept in the library as these were the first benchmarks to be used in literature.

`load_fb15k_237`([check_md5hash, ...])	Load the FB15k-237 dataset (with option to load human labeled test subset).
`load_wn18rr`([check_md5hash, clean_unseen, ...])	Load the WN18RR dataset.
`load_yago3_10`([check_md5hash, clean_unseen, ...])	Load the YAGO3-10 dataset.
`load_wn11`([check_md5hash, clean_unseen, ...])	Load the WordNet11 (WN11) dataset.
`load_fb13`([check_md5hash, clean_unseen, ...])	Load the Freebase13 (FB13) dataset.
`load_codex`([check_md5hash, clean_unseen, ...])	Load the CoDEx-M dataset.
`load_fb15k`([check_md5hash, add_reciprocal_rels])	Load the FB15k dataset.
`load_wn18`([check_md5hash, add_reciprocal_rels])	Load the WN18 dataset.

Datasets Summary

Dataset	Train	Valid	Test	Entities	Relations
FB15K-237	272,115	17,535	20,466	14,541	237
WN18RR	86,835	3,034	3,134	40,943	11
YAGO3-10	1,079,040	5,000	5,000	123,182	37
WN11	110,361	5,215	21,035	38,194	11
FB13	316,232	11,816	47,464	75,043	13
CODEX-M	185,584	10,310	10,311	17,050	51
FB15K	483,142	50,000	59,071	14,951	1,345
WN18	141,442	5,000	5,000	40,943	18

Hint

It is recommended to set the AMPLIGRAPH_DATA_HOME environment variable:

export AMPLIGRAPH_DATA_HOME=/YOUR/PATH/TO/datasets

When attempting to load a dataset, the module will first check if AMPLIGRAPH_DATA_HOME is set. If so, it will search this location for the required dataset. If not, the dataset will be downloaded and placed in this directory.

If AMPLIGRAPH_DATA_HOME is not set, the datasets will be saved in the ~/ampligraph_datasets directory.

Warning

FB15K-237’s validation set contains 8 unseen entities over 9 triples. The test set has 29 unseen entities, distributed over 28 triples.

WN18RR’s validation set contains 198 unseen entities over 210 triples. The test set has 209 unseen entities, distributed over 210 triples.

Benchmark Datasets With Numeric Values on Edges Loader¶

These helper functions load benchmark datasets with numeric values on edges (the figure below shows a toy example). Numeric values associated to edges of a knowledge graph have been used to represent uncertainty, edge importance, and even out-of-band knowledge in a growing number of scenarios, ranging from genetic data to social networks.

Hint

To process a knowledge graphs with numeric values associated to edges, enable the FocusE layer [PC21] when training a knowledge graph embedding model.

The functions will automatically download the datasets if they are not present in ~/ampligraph_datasets or at the location set in the AMPLIGRAPH_DATA_HOME.

`load_onet20k`([check_md5hash, clean_unseen, ...])	Load the O*NET20K dataset.
`load_ppi5k`([check_md5hash, clean_unseen, ...])	Load the PPI5K dataset.
`load_nl27k`([check_md5hash, clean_unseen, ...])	Load the NL27K dataset.
`load_cn15k`([check_md5hash, clean_unseen, ...])	Load the CN15K dataset.

Datasets Summary

Dataset	Train	Valid	Test	Entities	Relations
O*NET20K	461,932	138	2,000	20,643	19
PPI5K	230,929	19,017	21,720	4,999	7
NL27K	149,100	12,274	14,026	27,221	405
CN15K	199,417	16,829	19,224	15,000	36