find_duplicates¶

ampligraph.discovery.
find_duplicates
(X, model, mode='entity', metric='l2', tolerance='auto', expected_fraction_duplicates=0.1, verbose=False)¶ Find duplicate entities, relations or triples in a graph based on their embeddings.
For example, say you have a movie dataset that was scraped off the web with possible duplicate movies. The movies in this case are the entities. Therefore, you would use the ‘entity’ mode to find all the movies that could de duplicates of each other.
Duplicates are defined as points whose distance in the embedding space are smaller than some given threshold (called the tolerance).
The tolerance can be defined a priori or be found via an optimisation procedure given an expected fraction of duplicates. The optimisation algorithm applies a rootfinding routine to find the tolerance that gets to the closest expected fraction. The routine always converges.
Distance is defined by the chosen metric, which by default is the Euclidean distance (L2 norm).
As the distances are calculated on the embedding space, the embeddings must be meaningful for this routine to work properly. Therefore, it is suggested to evaluate the embeddings first using a metric such as MRR before considering applying this method.
Parameters:  X (ndarray, shape [n, 3] or [n]) – The input to be clustered. X can either be the triples of a knowledge graph, its entities, or its relations. The argument mode defines whether X is supposed an array of triples or an array of either entities or relations.
 model (EmbeddingModel) – The fitted model that will be used to generate the embeddings.
This model must have been fully trained already, be it directly with
fit()
or from a helper function such asampligraph.evaluation.select_best_model_ranking()
.  mode (string) –
Choose from:
 ’entity’ (default): the algorithm will find duplicates of the provided entities based on their embeddings.
 ’relation’: the algorithm will find duplicates of the provided relations based on their embeddings.
 ’triple’ : the algorithm will find duplicates of the concatenation of the embeddings of the subject, predicate and object for each provided triple.
 metric (str) – A distance metric used to compare entity distance in the embedding space. See options here.
 tolerance (int or str) – Minimum distance (depending on the chosen
metric
) to define one entity as the duplicate of another. If ‘auto’, it will be determined automatically in a way that you get theexpected_fraction_duplicates
. The ‘auto’ option can be much slower than the regular one, as the finding duplicate internal procedure will be repeated multiple times.  expected_fraction_duplicates (float) – Expected fraction of duplicates to be found. It is used only when
tolerance
is ‘auto’. Should be between 0 and 1 (default: 0.1).  verbose (bool) – Whether to print evaluation messages during optimisation (if
tolerance
is ‘auto’). Default: False.
Returns:  duplicates (set of frozensets) – Each entry in the duplicates set is a frozenset containing all entities that were found to be duplicates according to the metric and tolerance. Each frozenset will contain at least two entities.
 tolerance (float) – Tolerance used to find the duplicates (useful in the case of the automatic tolerance option).
Examples
>>> import pandas as pd >>> import numpy as np >>> import re >>> >>> # The IMDB dataset used here is part of the Movies5 dataset found on: >>> # The Magellan Data Repository (https://sites.google.com/site/anhaidgroup/projects/data) >>> import requests >>> url = 'http://pages.cs.wisc.edu/~anhai/data/784_data/movies5.tar.gz' >>> open('movies5.tar.gz', 'wb').write(requests.get(url).content) >>> import tarfile >>> tar = tarfile.open('movies5.tar.gz', "r:gz") >>> tar.extractall() >>> tar.close() >>> >>> # Reading tabular dataset of IMDB movies and filling the missing values >>> imdb = pd.read_csv("movies5/csv_files/imdb.csv") >>> imdb["directors"] = imdb["directors"].fillna("UnknownDirector") >>> imdb["actors"] = imdb["actors"].fillna("UnknownActor") >>> imdb["genre"] = imdb["genre"].fillna("UnknownGenre") >>> imdb["duration"] = imdb["duration"].fillna("0") >>> >>> # Creating knowledge graph triples from tabular dataset >>> imdb_triples = [] >>> >>> for _, row in imdb.iterrows(): >>> movie_id = "ID" + str(row["id"]) >>> directors = row["directors"].split(",") >>> actors = row["actors"].split(",") >>> genres = row["genre"].split(",") >>> duration = "Duration" + str(int(re.sub("\D", "", row["duration"])) // 30) >>> >>> directors_triples = [(movie_id, "hasDirector", d) for d in directors] >>> actors_triples = [(movie_id, "hasActor", a) for a in actors] >>> genres_triples = [(movie_id, "hasGenre", g) for g in genres] >>> duration_triple = (movie_id, "hasDuration", duration) >>> >>> imdb_triples.extend(directors_triples) >>> imdb_triples.extend(actors_triples) >>> imdb_triples.extend(genres_triples) >>> imdb_triples.append(duration_triple) >>> >>> # Training knowledge graph embedding with ComplEx model >>> from ampligraph.latent_features import ComplEx >>> >>> model = ComplEx(batches_count=10, >>> seed=0, >>> epochs=200, >>> k=150, >>> eta=5, >>> optimizer='adam', >>> optimizer_params={'lr':1e3}, >>> loss='multiclass_nll', >>> regularizer='LP', >>> regularizer_params={'p':3, 'lambda':1e5}, >>> verbose=True) >>> >>> imdb_triples = np.array(imdb_triples) >>> model.fit(imdb_triples) >>> >>> # Finding duplicates movies (entities) >>> from ampligraph.discovery import find_duplicates >>> >>> entities = np.unique(imdb_triples[:, 0]) >>> dups, _ = find_duplicates(entities, model, mode='entity', tolerance=0.4) >>> print(list(dups)[:3]) [frozenset({'ID4048', 'ID4049'}), frozenset({'ID5994', 'ID5993'}), frozenset({'ID6447', 'ID6448'})] >>> print(imdb[imdb.id.isin((4048, 4049, 5994, 5993, 6447, 6448))][['movie_name', 'year']]) movie_name year 4048 Ulterior Motives 1993 4049 Ulterior Motives 1993 5993 Chinese Hercules 1973 5994 Chinese Hercules 1973 6447 The Stranglers of Bombay 1959 6448 The Stranglers of Bombay 1959