find_duplicates¶
- ampligraph.discovery.find_duplicates(X, model, mode='e', metric='l2', tolerance='auto', expected_fraction_duplicates=0.1, verbose=False)¶
Find duplicate entities, relations or triples in a graph based on their embeddings.
For example, say you have a movie dataset that was scraped off the web with possible duplicate movies. The movies in this case are the entities. Therefore, you would use the “e” mode to find all the movies that could de duplicates of each other.
Duplicates are defined as points whose distance in the embedding space are smaller than some given threshold (called the tolerance).
The tolerance can be defined a priori or be found via an optimisation procedure given an expected fraction of duplicates. The optimisation algorithm applies a root-finding routine to find the tolerance that gets to the closest expected fraction. The routine always converges.
Distance is defined by the chosen metric, which by default is the Euclidean distance (L2 norm).
As the distances are calculated on the embedding space, the embeddings must be meaningful for this routine to work properly. Therefore, it is suggested to evaluate the embeddings first using a metric such as MRR before considering applying this method.
- Parameters:
X (ndarray, shape (n, 3) or (n)) – The input to be clustered. X can either be the triples of a knowledge graph, its entities, or its relations. The argument
modedefines whether X is supposed to be an array of triples or an array of either entities or relations.model (EmbeddingModel) – The fitted model that will be used to generate the embeddings. This model must have been fully trained already, be it directly with
fit()or from a helper function such asampligraph.evaluation.select_best_model_ranking().mode (str) –
Specifies among which type of entities to look for duplicates.
Choose from:
- ’e’ (default): the algorithm will find duplicates of the provided entities based on their embeddings.
- ’r’: the algorithm will find duplicates of the provided relations based on their embeddings.
- ’t’ : the algorithm will find duplicates of the concatenation of the embeddings of the subject, predicate and object for each provided triple.
metric (str) – A distance metric used to compare entity distance in the embedding space. See options here.
tolerance (int or str) – Minimum distance (depending on the chosen
metric) to define one entity as the duplicate of another. If ‘auto’, it will be determined automatically in a way that you get theexpected_fraction_duplicates. The ‘auto’ option can be much slower than the regular one, as the finding duplicate internal procedure will be repeated multiple times.expected_fraction_duplicates (float) – Expected fraction of duplicates to be found. It is used only when
tolerance='auto'. Should be between 0 and 1 (default: 0.1).verbose (bool) – Whether to print evaluation messages during optimisation when
tolerance='auto'(default: False).
- Returns:
duplicates (set of frozensets) – Each entry in the duplicates set is a frozenset containing all entities that were found to be duplicates according to the metric and tolerance. Each frozenset will contain at least two entities.
tolerance (float) – Tolerance used to find the duplicates (useful if the automatic tolerance option is selected).
Example
>>> import pandas as pd >>> import numpy as np >>> import re >>> from ampligraph.latent_features.models import ScoringBasedEmbeddingModel >>> # The IMDB dataset used here is part of the Movies5 dataset found on: >>> # The Magellan Data Repository (https://sites.google.com/site/anhaidgroup/projects/data) >>> import requests >>> url = 'http://pages.cs.wisc.edu/~anhai/data/784_data/movies5.tar.gz' >>> open('movies5.tar.gz', 'wb').write(requests.get(url).content) >>> import tarfile >>> tar = tarfile.open('movies5.tar.gz', "r:gz") >>> tar.extractall() >>> tar.close() >>> >>> # Reading tabular dataset of IMDB movies and filling the missing values >>> imdb = pd.read_csv("movies5/csv_files/imdb.csv") >>> imdb["directors"] = imdb["directors"].fillna("UnknownDirector") >>> imdb["actors"] = imdb["actors"].fillna("UnknownActor") >>> imdb["genre"] = imdb["genre"].fillna("UnknownGenre") >>> imdb["duration"] = imdb["duration"].fillna("0") >>> >>> # Creating knowledge graph triples from tabular dataset >>> imdb_triples = [] >>> >>> for _, row in imdb.iterrows(): >>> movie_id = "ID" + str(row["id"]) >>> directors = row["directors"].split(",") >>> actors = row["actors"].split(",") >>> genres = row["genre"].split(",") >>> duration = "Duration" + str(int(re.sub("\D", "", row["duration"])) // 30) >>> >>> directors_triples = [(movie_id, "hasDirector", d) for d in directors] >>> actors_triples = [(movie_id, "hasActor", a) for a in actors] >>> genres_triples = [(movie_id, "hasGenre", g) for g in genres] >>> duration_triple = (movie_id, "hasDuration", duration) >>> >>> >>> imdb_triples.extend(directors_triples) >>> imdb_triples.extend(actors_triples) >>> imdb_triples.extend(genres_triples) >>> imdb_triples.append(duration_triple) >>> >>> # Training knowledge graph embedding with ComplEx model >>> from ampligraph.latent_features import ScoringBasedEmbeddingModel >>> >>> imdb_triples = np.array(imdb_triples) >>> model = ScoringBasedEmbeddingModel(eta=5, >>> k=300, >>> scoring_type='ComplEx') >>> model.compile(optimizer='adam', loss='multiclass_nll') >>> model.fit(imdb_triples, >>> batch_size=10000, >>> epochs=10) >>> >>> # Finding duplicates movies (entities) >>> from ampligraph.discovery import find_duplicates >>> >>> entities = np.unique(imdb_triples[:, 0]) >>> dups, _ = find_duplicates(entities, model, mode='e', tolerance=0.45) >>> id_list = [] >>> for data in dups: >>> for i in data: >>> id_list.append(int(i[2:])) >>> print(imdb.iloc[id_list[:6]][['movie_name', 'year']]) Epoch 1/10 7/7 [==============================] - 1s 122ms/step - loss: 15612.8799 Epoch 2/10 7/7 [==============================] - 0s 20ms/step - loss: 15610.5010 Epoch 3/10 7/7 [==============================] - 0s 19ms/step - loss: 15607.7412 Epoch 4/10 7/7 [==============================] - 0s 19ms/step - loss: 15604.0674 Epoch 5/10 7/7 [==============================] - 0s 20ms/step - loss: 15598.9365 Epoch 6/10 7/7 [==============================] - 0s 19ms/step - loss: 15591.7188 Epoch 7/10 7/7 [==============================] - 0s 19ms/step - loss: 15581.6055 Epoch 8/10 7/7 [==============================] - 0s 20ms/step - loss: 15567.6807 Epoch 9/10 7/7 [==============================] - 0s 20ms/step - loss: 15548.8184 Epoch 10/10 7/7 [==============================] - 0s 21ms/step - loss: 15523.8721 movie_name year 5198 Duel to Death 1983 5199 Duel to Death 1983 2649 The Eliminator 2004 2650 The Eliminator 2004 3967 Lipstick Camera 1994 3968 Lipstick Camera 1994