load_cn15k¶
- ampligraph.datasets.datasets.load_cn15k(check_md5hash=False, clean_unseen=True, split_test_into_top_bottom=True, split_threshold=0.1)¶
Load the CN15K dataset.
CN15K was originally proposed in [CCS+19], it is a subset of ConceptNet [SCH16], a common-sense knowledge graph built to represent general human knowledge. Numeric values on triples represent uncertainty.
CN15k dataset is loaded from file if it exists at the
AMPLIGRAPH_DATA_HOMElocation. IfAMPLIGRAPH_DATA_HOMEis not set, the default~/ampligraph_datasetsis checked. If the dataset is not found at either location, it is downloaded and placed inAMPLIGRAPH_DATA_HOMEor~/ampligraph_datasets.It is divided into three splits:
train: 199,417 triples
valid: 16,829 triples
test: 19,224 triples
Each triple in these splits is associated to a numeric value which represents the importance/relevance of the link.
Dataset
Train
Valid
Test
Entities
Relations
CN15K
199,417
16,829
19,224
15,000
36
- Parameters:
check_md5hash (bool) – If True, check the md5hash of the files (default: False).
clean_unseen (bool) – If True, filters triples in validation and test sets that include entities not present in the training set.
split_test_into_top_bottom (bool) – Splits the test set by numeric values and returns test_top_split and test_bottom_split by splitting based on sorted numeric values and returning top and bottom k% triples, where k is specified by split_threshold argument.
split_threshold (float) – Specifies the top and bottom percentage of triples to return.
- Returns:
splits – The dataset splits: {‘train’: train, ‘valid’: valid, ‘test’: test, ‘test_topk’: test_topk, ‘test_bottomk’: test_bottomk, ‘train_numeric_values’: train_numeric_values, ‘valid_numeric_values’:valid_numeric_values, ‘test_numeric_values’: test_numeric_values, ‘test_topk_numeric_values’: test_topk_numeric_values, ‘test_bottomk_numeric_values’: test_bottomk_numeric_values}. Each
*_numeric_valuessplit contains numeric values associated to the corresponding dataset split and is a ndarray of shape (n). Each dataset split is a ndarray of shape (n,3). The*_topkand*_bottomksplits are only returned whensplit_test_into_top_bottom=Trueand contain the triples ordered by highest/lowest numeric edge value associated. These are typically used at evaluation time aiming at observing a model that assigns the highest rank possible to the _topk and the lowest possible to the _bottomk.- Return type:
dict
Example
>>> from ampligraph.datasets import load_cn15k >>> X = load_cn15k() >>> X["train"][0] ['260' '2' '13895'] >>> X['train_numeric_values'][0] [0.8927088]