load_ppi5k¶

ampligraph.datasets.load_ppi5k(check_md5hash=False, clean_unseen=True, split_test_into_top_bottom=True, split_threshold=0.1)¶

Load the PPI5K dataset.

Originally proposed in [CCS+19], PPI5K is a subset of the protein-protein interactions (PPI) knowledge graph [SM+16]. Numeric values represent the confidence of the link based on existing scientific literature evidence.

PPI5K is loaded from file if it exists at the AMPLIGRAPH_DATA_HOME location. If AMPLIGRAPH_DATA_HOME is not set, the default ~/ampligraph_datasets is checked. If the dataset is not found at either location, it is downloaded and placed in AMPLIGRAPH_DATA_HOME or ~/ampligraph_datasets.

It is divided into three splits:

train: 230,929 triples
valid: 19,017 triples
test: 21,720 triples

Each triple in these splits is associated to a numeric value which models additional information on the fact (importance, relevance of the link).

Dataset	Train	Valid	Test	Entities	Relations
PPI5K	230929	19017	21720	4999	7

Parameters:

check_md5hash (bool) – If True check the md5hash of the files (default: False).
clean_unseen (bool) – If True, filters triples in validation and test sets that include entities not present in the training set.
split_test_into_top_bottom (bool) – When set to True, the function also returns subsets of the test set that includes only the top-k or bottom-k numeric-enriched triples. Splits test_topk, test_bottomk and their numeric values. Such splits are generated by sorting Splits the test set by numeric values and returns test_top_split and test_bottom_split by splitting based on sorted numeric values and returning top and bottom k% triples, where k is specified by the split_threshold argument.
split_threshold (float) – Specifies the top and bottom percentage of triples to return.

Returns:

splits – The dataset splits: {‘train’: train, ‘valid’: valid, ‘test’: test, ‘test_topk’: test_topk, ‘test_bottomk’: test_bottomk, ‘train_numeric_values’: train_numeric_values, ‘valid_numeric_values’:valid_numeric_values, ‘test_numeric_values’: test_numeric_values, ‘test_topk_numeric_values’: test_topk_numeric_values, ‘test_bottomk_numeric_values’: test_bottomk_numeric_values}. Each *_numeric_values split contains numeric values associated to the corresponding dataset split and is a ndarray of shape (n). Each dataset split is a ndarray of shape (n,3). The *_topk and *_bottomk splits are only returned when split_test_into_top_bottom=True and contain the triples ordered by highest/lowest numeric edge value associated. These are typically used at evaluation time aiming at observing a model that assigns the highest rank possible to the _topk and the lowest possible to the _bottomk.

Return type:

dict

Example

>>> from ampligraph.datasets import load_ppi5k
>>> X = load_ppi5k()
>>> X["train"][0]
['4001' '5' '4176']
>>> X['train_numeric_values'][0]
[0.329]