load_codex

ampligraph.datasets.datasets.load_codex(check_md5hash=False, clean_unseen=True, add_reciprocal_rels=False, return_mapper=False)

Load the CoDEx-M dataset.

The dataset is described in [SK20].

Note

CODEX-M contains also ground truths negative triples for test and validation sets. For more information, see the above reference to the original paper.

The CodDEx dataset is loaded from file if it exists at the AMPLIGRAPH_DATA_HOME location. If AMPLIGRAPH_DATA_HOME is not set, the default ~/ampligraph_datasets is checked. If the dataset is not found at either location, it is downloaded and placed in AMPLIGRAPH_DATA_HOME or ~/ampligraph_datasets.

This dataset is divided in three splits:

  • train: 185,584 triples

  • valid: 10,310 triples

  • test: 10,310 triples

Both the validation and test splits are associated with labels (binary ndarrays), with True for positive statements and False for negatives:

  • valid_labels

  • test_labels

Dataset

Train

Valid

Valid-negatives

Test

Test-negatives

Entities

Relations

CoDEx-M

185,584

10,310

10,310

10311

10311

17,050

51

Parameters:
  • clean_unseen (bool) – If True, filters triples in validation and test sets that include entities not present in the training set.

  • check_md5hash (bool) – If True, check the md5hash of the datset files (default: False).

  • add_reciprocal_rels (bool) – Flag which specifies whether to add reciprocal relations. For every <s, p, o> in the dataset this creates a corresponding triple with reciprocal relation <o, p_reciprocal, s> (default: False).

  • return_mapper (bool) – Whether to return human-readable labels in a form of dictionary in X["mapper"] field (default: False).

Returns:

splits – The dataset splits: {‘train’: train, ‘valid’: valid, ‘valid_negatives’: valid_negatives’, ‘test’: test, ‘test_negatives’: test_negatives}. Each split is a ndarray of shape (n, 3).

Return type:

dict

Example

>>> from ampligraph.datasets import load_codex
>>> X = load_codex()
>>> X["valid"][0]
array(['Q60684', 'P106', 'Q4964182'], dtype=object)
>>> X = load_codex(return_mapper=True)
>>> [X['mapper'][elem]['label'] for elem in X['valid'][0]]
['Novalis', 'occupation', 'philosopher']