find_duplicates¶

ampligraph.discovery.find_duplicates(X, model, mode='e', metric='l2', tolerance='auto', expected_fraction_duplicates=0.1, verbose=False)¶

Find duplicate entities, relations or triples in a graph based on their embeddings.

For example, say you have a movie dataset that was scraped off the web with possible duplicate movies. The movies in this case are the entities. Therefore, you would use the “e” mode to find all the movies that could de duplicates of each other.

Duplicates are defined as points whose distance in the embedding space are smaller than some given threshold (called the tolerance).

The tolerance can be defined a priori or be found via an optimisation procedure given an expected fraction of duplicates. The optimisation algorithm applies a root-finding routine to find the tolerance that gets to the closest expected fraction. The routine always converges.

Distance is defined by the chosen metric, which by default is the Euclidean distance (L2 norm).

As the distances are calculated on the embedding space, the embeddings must be meaningful for this routine to work properly. Therefore, it is suggested to evaluate the embeddings first using a metric such as MRR before considering applying this method.

Parameters:

X (ndarray, shape (n, 3) or (n)) – The input to be clustered. X can either be the triples of a knowledge graph, its entities, or its relations. The argument mode defines whether X is supposed to be an array of triples or an array of either entities or relations.
model (EmbeddingModel) – The fitted model that will be used to generate the embeddings. This model must have been fully trained already, be it directly with fit() or from a helper function such as ampligraph.evaluation.select_best_model_ranking().
mode (str) –
Specifies among which type of entities to look for duplicates.

Choose from:
- ’e’ (default): the algorithm will find duplicates of the provided entities based on their embeddings.
- ’r’: the algorithm will find duplicates of the provided relations based on their embeddings.
- ’t’ : the algorithm will find duplicates of the concatenation of the embeddings of the subject, predicate and object for each provided triple.
metric (str) – A distance metric used to compare entity distance in the embedding space. See options here.
tolerance (int or str) – Minimum distance (depending on the chosen metric) to define one entity as the duplicate of another. If ‘auto’, it will be determined automatically in a way that you get the expected_fraction_duplicates. The ‘auto’ option can be much slower than the regular one, as the finding duplicate internal procedure will be repeated multiple times.
expected_fraction_duplicates (float) – Expected fraction of duplicates to be found. It is used only when tolerance='auto'. Should be between 0 and 1 (default: 0.1).
verbose (bool) – Whether to print evaluation messages during optimisation when tolerance='auto' (default: False).

Returns:

duplicates (set of frozensets) – Each entry in the duplicates set is a frozenset containing all entities that were found to be duplicates according to the metric and tolerance. Each frozenset will contain at least two entities.
tolerance (float) – Tolerance used to find the duplicates (useful if the automatic tolerance option is selected).

Example

>>> import pandas as pd
>>> import numpy as np
>>> import re
>>> from ampligraph.latent_features.models import ScoringBasedEmbeddingModel
>>> # The IMDB dataset used here is part of the Movies5 dataset found on:
>>> # The Magellan Data Repository (https://sites.google.com/site/anhaidgroup/projects/data)
>>> import requests
>>> url = 'http://pages.cs.wisc.edu/~anhai/data/784_data/movies5.tar.gz'
>>> open('movies5.tar.gz', 'wb').write(requests.get(url).content)
>>> import tarfile
>>> tar = tarfile.open('movies5.tar.gz', "r:gz")
>>> tar.extractall()
>>> tar.close()
>>>
>>> # Reading tabular dataset of IMDB movies and filling the missing values
>>> imdb = pd.read_csv("movies5/csv_files/imdb.csv")
>>> imdb["directors"] = imdb["directors"].fillna("UnknownDirector")
>>> imdb["actors"] = imdb["actors"].fillna("UnknownActor")
>>> imdb["genre"] = imdb["genre"].fillna("UnknownGenre")
>>> imdb["duration"] = imdb["duration"].fillna("0")
>>>
>>> # Creating knowledge graph triples from tabular dataset
>>> imdb_triples = []
>>>
>>> for _, row in imdb.iterrows():
>>>     movie_id = "ID" + str(row["id"])
>>>     directors = row["directors"].split(",")
>>>     actors = row["actors"].split(",")
>>>     genres = row["genre"].split(",")
>>>     duration = "Duration" + str(int(re.sub("\D", "", row["duration"])) // 30)
>>>
>>>     directors_triples = [(movie_id, "hasDirector", d) for d in directors]
>>>     actors_triples = [(movie_id, "hasActor", a) for a in actors]
>>>     genres_triples = [(movie_id, "hasGenre", g) for g in genres]
>>>     duration_triple = (movie_id, "hasDuration", duration)
>>>
>>>
>>>     imdb_triples.extend(directors_triples)
>>>     imdb_triples.extend(actors_triples)
>>>     imdb_triples.extend(genres_triples)
>>>     imdb_triples.append(duration_triple)
>>>
>>> # Training knowledge graph embedding with ComplEx model
>>> from ampligraph.latent_features import ScoringBasedEmbeddingModel
>>>
>>> imdb_triples = np.array(imdb_triples)
>>> model = ScoringBasedEmbeddingModel(eta=5,
>>>                                    k=300,
>>>                                    scoring_type='ComplEx')
>>> model.compile(optimizer='adam', loss='multiclass_nll')
>>> model.fit(imdb_triples,
>>>           batch_size=10000,
>>>           epochs=10)
>>>
>>> # Finding duplicates movies (entities)
>>> from ampligraph.discovery import find_duplicates
>>>
>>> entities = np.unique(imdb_triples[:, 0])
>>> dups, _ = find_duplicates(entities, model, mode='e', tolerance=0.45)
>>> id_list = []
>>> for data in dups:
>>>     for i in data:
>>>         id_list.append(int(i[2:]))
>>> print(imdb.iloc[id_list[:6]][['movie_name', 'year']])
Epoch 1/10
7/7 [==============================] - 1s 122ms/step - loss: 15612.8799
Epoch 2/10
7/7 [==============================] - 0s 20ms/step - loss: 15610.5010
Epoch 3/10
7/7 [==============================] - 0s 19ms/step - loss: 15607.7412
Epoch 4/10
7/7 [==============================] - 0s 19ms/step - loss: 15604.0674
Epoch 5/10
7/7 [==============================] - 0s 20ms/step - loss: 15598.9365
Epoch 6/10
7/7 [==============================] - 0s 19ms/step - loss: 15591.7188
Epoch 7/10
7/7 [==============================] - 0s 19ms/step - loss: 15581.6055
Epoch 8/10
7/7 [==============================] - 0s 20ms/step - loss: 15567.6807
Epoch 9/10
7/7 [==============================] - 0s 20ms/step - loss: 15548.8184
Epoch 10/10
7/7 [==============================] - 0s 21ms/step - loss: 15523.8721
           movie_name  year
5198    Duel to Death  1983
5199    Duel to Death  1983
2649   The Eliminator  2004
2650   The Eliminator  2004
3967  Lipstick Camera  1994
3968  Lipstick Camera  1994