ampligraph.datasets.load_onet20k(check_md5hash=False, clean_unseen=True, split_test_into_top_bottom=True, split_threshold=0.1)

Load the O*NET20K dataset

O*NET20K was originally proposed in [PC21]. It a subset of O*NET, a dataset that includes job descriptions, skills and labeled, binary relations between such concepts. Each triple is labeled with a numeric value that indicates the importance of that link.

ONET*20K dataset is loaded from file if it exists at the AMPLIGRAPH_DATA_HOME location. If AMPLIGRAPH_DATA_HOME is not set the the default ~/ampligraph_datasets is checked.

If the dataset is not found at either location, it is downloaded and placed in AMPLIGRAPH_DATA_HOME or ~/ampligraph_datasets.

It is divided in three splits:

  • train
  • valid
  • test

Each triple in these splits is associated to a numeric value which represents the importance/relevance of the link.

Dataset Train Valid Test Entities Relations
ONET*20K 461,932 850 2,000 20,643 19
  • check_md5hash (boolean) – If True check the md5hash of the files. Defaults to False.
  • clean_unseen (bool) – If True, filters triples in validation and test sets that include entities not present in the training set.
  • split_test_into_top_bottom (bool) – Splits the test set by numeric values and returns test_top_split and test_bottom_split by splitting based on sorted numeric values and returning top and bottom k% triples, where k is specified by split_threshold argument
  • split_threshold (float) – specifies the top and bottom percentage of triples to return

splits – The dataset splits: {‘train’: train, ‘valid’: valid, ‘test’: test, ‘test_topk’: test_topk, ‘test_bottomk’: test_bottomk, ‘train_numeric_values’: train_numeric_values, ‘valid_numeric_values’:valid_numeric_values, ‘test_numeric_values’: test_numeric_values, ‘test_topk_numeric_values’: test_topk_numeric_values, ‘test_bottomk_numeric_values’: test_bottomk_numeric_values}.

Each *_numeric_values split contains numeric values associated to the corresponding dataset split and is a ndarray of shape [n].

Each dataset split is a ndarray of shape [n,3].

The *_topk and *_bottomk splits are only returned when split_test_into_top_bottom=True.

Return type:



>>> from ampligraph.datasets import load_onet20k
>>> X = load_onet20k()