train_test_split_no_unseen¶

ampligraph.evaluation.train_test_split_no_unseen(X, test_size=100, seed=0, allow_duplication=False, filtered_test_predicates=None)¶

Split into train and test sets.

This function carves out a test set that contains only entities and relations which also occur in the training set.

This is an improved version which is much faster - since this does not sample like in the earlier approach but rather shuffles indices and gets the test set of required size by selecting from the shuffled indices only triples which do not disconnect entities/relations.

Parameters:

X (ndarray, shape (n, 3)) – The dataset to split.
test_size (int, float) – If int, the number of triples in the test set. If float, the percentage of total triples.
seed (int) – A random seed used to split the dataset.
allow_duplication (bool) – Flag to indicate if the test set can contain duplicated triples.
filtered_test_predicates (None, list) – If None, all predicate types will be considered for the test set. If list, only the predicate types in the list will be considered for the test set.

Returns:

X_train (ndarray, shape (n, 3)) – The training set.
X_test (ndarray, shape (n, 3)) – The test set.

Example

>>> import numpy as np
>>> from ampligraph.evaluation import train_test_split_no_unseen
>>> # load your dataset to X
>>> X = np.array([['a', 'y', 'b'],
>>>               ['f', 'y', 'e'],
>>>               ['b', 'y', 'a'],
>>>               ['a', 'y', 'c'],
>>>               ['c', 'y', 'a'],
>>>               ['a', 'y', 'd'],
>>>               ['c', 'y', 'd'],
>>>               ['b', 'y', 'c'],
>>>               ['f', 'y', 'e']])
>>> # if you want to split into train/test datasets
>>> X_train, X_test = train_test_split_no_unseen(X, test_size=2)
>>> X_train
array([['a', 'y', 'd'],
   ['b', 'y', 'a'],
   ['a', 'y', 'c'],
   ['f', 'y', 'e'],
   ['a', 'y', 'b'],
   ['c', 'y', 'a'],
   ['b', 'y', 'c']], dtype='<U1')
>>> X_test
array([['f', 'y', 'e'],
   ['c', 'y', 'd']], dtype='<U1')
>>> # if you want to split into train/valid/test datasets, call it 2 times
>>> X_train_valid, X_test = train_test_split_no_unseen(X, test_size=2, backward_compatible=True)
>>> X_train, X_valid = train_test_split_no_unseen(X_train_valid, test_size=2, backward_compatible=True)
>>> X_train
array([['a', 'y', 'b'],
   ['a', 'y', 'd'],
   ['a', 'y', 'c'],
   ['c', 'y', 'a'],
   ['f', 'y', 'e']], dtype='<U1')
>>> X_valid
array([['c', 'y', 'd'],
   ['f', 'y', 'e']], dtype='<U1')
>>> X_test
array([['b', 'y', 'c'],
   ['b', 'y', 'a']], dtype='<U1')