Advanced Topics¶

Support for loading and managing datasets.

This section is meant as a brief introduction to AmpliGraph’s data pipeline. Advanced users may use it as a starting point to understand how to train their models on custom datasets which are extremely large and do not fit either on CPU or GPU memory.

The first element of AmpliGraph’s data pipeline is a data handler, that leverages the GraphDataLoader class to load large datasets. This data loader takes data from a source and stores it in a certain backend. If, when initializing the GraphDataLoader, we specify as argument backend=NoBackend (default), we opt for storing data in memory, i.e., we are not using any backend. If, on the other hand, we set backend=SQLiteAdapter, then we initialize a backend that relies on SQLite. In this case, data is persisted on disk and is later loaded in memory in chunks, so to avoid overloading the RAM. This is the option to choose for handling massive datasets.

The instantiation of a backend is not by itself sufficient. Indeed, it is capital to specify how the chunks we load in memory are defined. This is equivalent to tackle the problem of graph partitioning. Partitioning a graph amounts to split its nodes into \(P\) partitions sized to fit in memory. When loading the data, partitions are created and singularly persisted on disk. Then, during training, single partitions are loaded in memory and the model is trained on it. Once the model finishes operating on one partition, it unloads it and loads the next one.

There are many possible strategies to partition a graph, but in AmpliGraph we recommend to use the default option, the BucketGraphPartitioner strategy, as its runtime performance are much better than the others baselines.

For more details about the data pipeline components see the API below:

`GraphDataLoader`(data_source[, batch_size, ...])	Data loader for models to ingest graph data.
`BucketGraphPartitioner`(data[, k])	Bucket-based partition strategy.