GraphDataLoader¶

class ampligraph.datasets.GraphDataLoader(data_source, batch_size=1, dataset_type='train', backend=None, root_directory=None, use_indexer=True, verbose=False, remap=False, name='main_partition', parent=None, in_memory=False, use_filter=False)¶

Data loader for models to ingest graph data.

This class is internally used by the model to store the data passed by the user and batch over it during training and evaluation, and to obtain filters during evaluation.

It can be used by advanced users to load custom datasets which are large, for performing partitioned training. The complete dataset will not get loaded in memory. It will load the data in chunks based on which partition is being trained.

Example

>>> from ampligraph.datasets import GraphDataLoader, BucketGraphPartitioner
>>> from ampligraph.datasets.sqlite_adapter import SQLiteAdapter
>>> from ampligraph.latent_features import ScoringBasedEmbeddingModel
>>> AMPLIGRAPH_DATA_HOME='/your/path/to/datasets/'
>>> # Graph loader - loads the data from the file, numpy array, etc and generates batches for iterating
>>> path_to_training = AMPLIGRAPH_DATA_HOME + 'fb15k-237/train.txt'
>>> dataset_loader = GraphDataLoader(path_to_training,
>>>                                  backend=SQLiteAdapter, # type of backend to use
>>>                                  batch_size=1000,       # batch size to use while iterating over this dataset
>>>                                  dataset_type='train',  # dataset type
>>>                                  use_filter=False,      # Whether to use filter or not
>>>                                  use_indexer=True)      # indicates that the data needs to be mapped to index
>>>
>>> # Choose the partitioner - in this case we choose RandomEdges partitioner
>>> partitioner = BucketGraphPartitioner(dataset_loader, k=3)
>>> partitioned_model = ScoringBasedEmbeddingModel(eta=2,
>>>                                                k=50,
>>>                                                scoring_type='DistMult')
>>> partitioned_model.compile(optimizer='adam', loss='multiclass_nll')
>>> partitioned_model.fit(partitioner,            # pass the partitioner object as input to the fit function this will generate data for the model during training
>>>                       epochs=10)              # number of epochs
>>> indexer = partitioned_model.data_handler.get_mapper()    # get the mapper from the trained model
>>> path_to_test = AMPLIGRAPH_DATA_HOME + 'fb15k-237/test.txt'
>>> dataset_loader_test = GraphDataLoader(path_to_test,
>>>                                       backend=SQLiteAdapter,                         # type of backend to use
>>>                                       batch_size=400,                                # batch size to use while iterating over this dataset
>>>                                       dataset_type='test',                           # dataset type
>>>                                       use_indexer=indexer                            # mapper to map test concepts to the same indices used during training
>>>                                       )
>>> ranks = partitioned_model.evaluate(dataset_loader_test, # pass the dataloader object to generate data for the model during training
>>>                                    batch_size=400)
>>> print(ranks)
[[  85    7]
 [  95    9]
 [1074   22]
 ...
 [ 546   95]
 [9961 7485]
 [1494    2]]

Attributes

`max_entities`	Maximum number of entities present in the dataset mapper.
`max_relations`	Maximum number of relations present in the dataset mapper.

Methods

`__init__`(data_source[, batch_size, ...])	Initialise persistent/in-memory data storage.
`add_dataset`(data_source, dataset_type)	Adds the dataset to the backend (if possible).
`clean`()	Cleans up the temporary files created for training/evaluation.
`get_batch_generator`([dataset_type, use_filter])	Get batch generator from the backend.
`get_complementary_entities`(triples[, use_filter])	Get subjects and objects complementary to triples (?,p,?).
`get_complementary_objects`(triples[, use_filter])	Get objects complementary to triples (s,p,?).
`get_complementary_subjects`(triples[, use_filter])	Get subjects complementary to triples (?,p,o).
`get_data_size`()	Returns number of triples.
`get_participating_entities`(triples[, sides, ...])	Get entities from triples with fixed subjects or fixed objects or both fixed.
`get_tf_generator`()	Generates a tensorflow.data.Dataset object.
`get_triples`([subjects, objects, entities])	Get triples that subject is in subjects and object is in objects, or triples that eiter subject or object is in entities.
`intersect`(dataloader)	Returns the intersection between the current data loader and another one specified in `dataloader`.
`on_complete`()
`on_epoch_end`()
`reload`([use_filter, dataset_type])	Reinstantiate batch iterator.

__init__(data_source, batch_size=1, dataset_type='train', backend=None, root_directory=None, use_indexer=True, verbose=False, remap=False, name='main_partition', parent=None, in_memory=False, use_filter=False)¶

Initialise persistent/in-memory data storage.

Parameters:

data_source (str or np.array or GraphDataLoader or AbstractGraphPartitioner) – File with data (e.g. CSV). Can be a path pointing to the file location, can be data loaded as numpy, a GraphDataLoader or an AbstractGraphPartitioner instance.
batch_size (int) – Size of batch.
dataset_type (str) – Kind of data provided (“train” | “test” | “valid”).
backend (str) – Name of backend class (NoBackend, SQLiteAdapter) or already initialised backend. If None, NoBackend is used (in-memory processing).
root_directory (str) – Path to a directory where the database will be created, and the data and mappings will be stored. If None, the root directory is obtained through the tempfile.gettempdir() method (default: None).
use_indexer (bool or DataIndexer) – Flag to tell whether data should be indexed. If the DataIndexer object is passed, the mappings defined in the indexer will be reused to generate mappings for the current data.
verbose (bool) – Verbosity.
remap (bool) – Flag to be used by graph partitioner, indicates whether previously indexed data in partition has to be remapped to new indexes (0, <size_of_partition>). It has not to be used with use_indexer=True. The new remappings will be persisted.
name (str) – Name of the partition. This is internally used when the data is partitioned.
parent (GraphDataLoader) – Parent dataloader. This is internally used when the data is partitioned.
in_memory (bool) – Persist indexes or not.
use_filter (bool or dict) – If True, current dataset will be used as filter. If dict, the datasets specified in the dict will be used for filtering. If False, the true positives will not be filtered from corruptions.

add_dataset(data_source, dataset_type)¶: Adds the dataset to the backend (if possible).

clean()¶: Cleans up the temporary files created for training/evaluation.

get_batch_generator(dataset_type='train', use_filter=False)¶

Get batch generator from the backend.

Parameters:: dataset_type (str) – Specifies whether data are generated for “train”, “valid” or “test” set.

get_complementary_entities(triples, use_filter=False)¶

Get subjects and objects complementary to triples (?,p,?).

Returns the participating entities in the relation ?-p-o and s-p-?.

Parameters:: x_triple (nd-array (N,3,)) – N triples (s-p-o) that we are querying.
Returns:: entities – Tuple containing two lists, one with the subjects and one of with the objects participating in the relations ?-p-o and s-p-?.
Return type:: tuple

get_complementary_objects(triples, use_filter=False)¶

Get objects complementary to triples (s,p,?).

For a given triple retrieve all triples with same subjects and predicates. Function used during evaluation.

Parameters:: triples (list or array) – List or array of arrays with 3 elements (subject, predicate, object).
Returns:: subjects – Objects present in the input triples.
Return type:: list

get_complementary_subjects(triples, use_filter=False)¶

Get subjects complementary to triples (?,p,o).

For a given triple retrieve all subjects coming from triples with same objects and predicates.

Parameters:: triples (list or array) – List or array of arrays with 3 elements (subject, predicate, object).
Returns:: subjects – Subjects present in the input triples.
Return type:: list

get_data_size()¶: Returns number of triples.

get_participating_entities(triples, sides='s,o', use_filter=False)¶

Get entities from triples with fixed subjects or fixed objects or both fixed.

Parameters:

triples (list or array) – List or array of arrays with 3 elements (subject, predicate, object).
sides (str) – String specifying what entities to retrieve: “s” - subjects, “o” - objects, “s,o” - subjects and objects, “o,s” - objects and subjects.

Returns:

entities – List of subjects (if sides="s") or objects (if sides="o") or two lists with both (if sides="s,o" or sides="o,s").

Return type:

list

get_tf_generator()¶: Generates a tensorflow.data.Dataset object.

get_triples(subjects=None, objects=None, entities=None)¶

Get triples that subject is in subjects and object is in objects, or triples that eiter subject or object is in entities.

Parameters:

subjects (list) – List of entities that triples subject should belong to.
objects (list) – List of entities that triples object should belong to.
entities (list) – List of entities that triples subject and object should belong to.

Returns:

triples – List of triples constrained by subjects and objects.

Return type:

list

intersect(dataloader)¶

Returns the intersection between the current data loader and another one specified in dataloader.

Parameters:: dataloader (GraphDataLoader) – Dataloader for which to calculate the intersection for.
Returns:: intersection – Array of intersecting elements.
Return type:: ndarray

on_complete()¶

on_epoch_end()¶

reload(use_filter=False, dataset_type='train')¶: Reinstantiate batch iterator.