discover_facts

ampligraph.discovery.discover_facts(X, model, top_n=10, strategy='random_uniform', max_candidates=0.3, target_rel=None, seed=0)

Discover new facts from an existing knowledge graph.

You should use this function when you already have a model trained on a knowledge graph and you want to discover potentially true statements in that knowledge graph.

The general procedure of this function is to generate a set of candidate statements \(C\) according to some sampling strategy strategy, then rank them against a set of corruptions using the ampligraph.evaluation.evaluate_performance() function. Candidates that appear in the top_n ranked statements of this procedure are returned as likely true statements.

The majority of the strategies are implemented with the same underlying principle of searching for candidate statements:

  • from among the less frequent entities (‘entity_frequency’),

  • less connected entities (‘graph_degree’, cluster_coefficient’),

  • less frequent local graph structures (‘cluster_triangles’, ‘cluster_squares’), on the assumption that densely connected entities are less likely to have missing true statements.
  • The remaining strategies (‘random_uniform’, ‘exhaustive’) generate candidate statements by a random sampling of entity and relations and exhaustively, respectively.

    Warning

    Due to the significant amount of computation required to evaluate all triples using the ‘exhaustive’ strategy, we do not recommend its use at this time.

The function will automatically filter entities that haven’t been seen by the model, and operates on the assumption that the model provided has been fit on the data X (determined heuristically), although X may be a subset of the original data, in which case a warning is shown.

The target_rel argument indicates what relation to generate candidate statements for. If this is set to None then all target relations will be considered for sampling.

Parameters:
  • X (ndarray, shape [n, 3]) – The input knowledge graph used to train model, or a subset of it.
  • model (EmbeddingModel) – The trained model that will be used to score candidate facts.
  • top_n (int) – The cutoff position in ranking to consider a candidate triple as true positive.
  • strategy (string) –

    The candidates generation strategy:

    • ’exhaustive’ : generates all possible candidates given the `target_rel` and `consolidate_sides` parameter.
    • ’random_uniform’ : generates N candidates (N <= max_candidates) based on a uniform random sampling of head and tail entities.
    • ’entity_frequency’ : generates candidates by sampling entities with low frequency.
    • ’graph_degree’ : generates candidates by sampling entities with a low graph degree.
    • ’cluster_coefficient’ : generates candidates by sampling entities with a low clustering coefficient.
    • ’cluster_triangles’ : generates candidates by sampling entities with a low number of cluster triangles.
    • ’cluster_squares’ : generates candidates by sampling entities with a low number of cluster squares.
  • max_candidates (int or float) – The maximum numbers of candidates generated by ‘strategy’. Can be an absolute number or a percentage [0,1].
  • target_rel (str) – Target relation to focus on. The function will discover facts only for that specific relation type. If None, the function attempts to discover new facts for all relation types in the graph.
  • seed (int) – Seed to use for reproducible results.
Returns:

X_pred – A list of new facts predicted to be true.

Return type:

ndarray, shape [n, 3]

Examples

>>> import requests
>>> from ampligraph.datasets import load_from_csv
>>> from ampligraph.latent_features import ComplEx
>>> from ampligraph.discovery import discover_facts
>>>
>>> # Game of Thrones relations dataset
>>> url = 'https://ampligraph.s3-eu-west-1.amazonaws.com/datasets/GoT.csv'
>>> open('GoT.csv', 'wb').write(requests.get(url).content)
>>> X = load_from_csv('.', 'GoT.csv', sep=',')
>>>
>>> model = ComplEx(batches_count=10, seed=0, epochs=200, k=150, eta=5,
>>>                 optimizer='adam', optimizer_params={'lr':1e-3},
>>>                 loss='multiclass_nll', regularizer='LP',
>>>                 regularizer_params={'p':3, 'lambda':1e-5},
>>>                 verbose=True)
>>> model.fit(X)
>>>
>>> discover_facts(X, model, top_n=3, max_candidates=20000, strategy='entity_frequency',
>>>                target_rel='ALLIED_WITH', seed=42)
array([['House Reed of Greywater Watch', 'ALLIED_WITH', 'Sybelle Glover'],
       ['Hugo Wull', 'ALLIED_WITH', 'House Norrey'],
       ['House Grell', 'ALLIED_WITH', 'Delonne Allyrion'],
       ['Lorent Lorch', 'ALLIED_WITH', 'House Ruttiger']], dtype=object)