RandomBaseline¶
-
class
ampligraph.latent_features.
RandomBaseline
(seed=0, verbose=False)¶ Random baseline
A dummy model that assigns a pseudo-random score included between 0 and 1, drawn from a uniform distribution.
The model is useful whenever you need to compare the performance of another model on a custom knowledge graph, and no other baseline is available.
Note
Although the model still requires invoking the
fit()
method, no actual training will be carried out.Examples
>>> import numpy as np >>> from ampligraph.latent_features import RandomBaseline >>> model = RandomBaseline() >>> X = np.array([['a', 'y', 'b'], >>> ['b', 'y', 'a'], >>> ['a', 'y', 'c'], >>> ['c', 'y', 'a'], >>> ['a', 'y', 'd'], >>> ['c', 'y', 'd'], >>> ['b', 'y', 'c'], >>> ['f', 'y', 'e']]) >>> model.fit(X) >>> model.predict(np.array([['f', 'y', 'e'], ['b', 'y', 'd']])) [0.5488135039273248, 0.7151893663724195]
Methods
__init__
([seed, verbose])Initialize the model fit
(X[, early_stopping, …])Train the random model. predict
(X[, from_idx])Predict the scores of triples using a trained embedding model. get_hyperparameter_dict
()Returns hyperparameters of the model. -
__init__
(seed=0, verbose=False)¶ Initialize the model
Parameters: - seed (int) – The seed used by the internal random numbers generator.
- verbose (bool) – Verbose mode.
-
fit
(X, early_stopping=False, early_stopping_params={}, focusE_numeric_edge_values=None, tensorboard_logs_path=None)¶ Train the random model.
There is no actual training involved in practice and the early stopping parameters won’t have any effect.
Parameters: - X (ndarray, shape [n, 3]) – The training triples
- early_stopping (bool) –
Flag to enable early stopping (default:False).
If set to
True
, the training loop adopts the following early stopping heuristic:- The model will be trained regardless of early stopping for
burn_in
epochs. - Every
check_interval
epochs the method will compute the metric specified incriteria
.
If such metric decreases for
stop_interval
checks, we stop training early.Note the metric is computed on
x_valid
. This is usually a validation set that you held out.Also, because
criteria
is a ranking metric, it requires generating negatives. Entities used to generate corruptions can be specified, as long as the side(s) of a triple to corrupt. The method supports filtered metrics, by passing an array of positives tox_filter
. This will be used to filter the negatives generated on the fly (i.e. the corruptions).Note
Keep in mind the early stopping criteria may introduce a certain overhead (caused by the metric computation). The goal is to strike a good trade-off between such overhead and saving training epochs.
A common approach is to use MRR unfiltered:
early_stopping_params={x_valid=X['valid'], 'criteria': 'mrr'}
Note the size of validation set also contributes to such overhead. In most cases a smaller validation set would be enough.
- The model will be trained regardless of early stopping for
- early_stopping_params (dictionary) –
Dictionary of hyperparameters for the early stopping heuristics.
The following string keys are supported:
- ’x_valid’: ndarray, shape [n, 3] : Validation set to be used for early stopping.
- ’criteria’: string : criteria for early stopping ‘hits10’, ‘hits3’, ‘hits1’ or ‘mrr’(default).
- ’x_filter’: ndarray, shape [n, 3] : Positive triples to use as filter if a ‘filtered’ early stopping criteria is desired (i.e. filtered-MRR if ‘criteria’:’mrr’). Note this will affect training time (no filter by default).
- ’burn_in’: int : Number of epochs to pass before kicking in early stopping (default: 100).
- check_interval’: int : Early stopping interval after burn-in (default:10).
- ’stop_interval’: int : Stop if criteria is performing worse over n consecutive checks (default: 3)
- ’corruption_entities’: List of entities to be used for corruptions. If ‘all’, it uses all entities (default: ‘all’)
- ’corrupt_side’: Specifies which side to corrupt. ‘s’, ‘o’, ‘s+o’ (default)
Example:
early_stopping_params={x_valid=X['valid'], 'criteria': 'mrr'}
- focusE_numeric_edge_values (nd array (n, 1)) – Numeric values associated with links. Semantically, the numeric value can signify importance, uncertainity, significance, confidence, etc. If the numeric value is unknown pass a NaN weight. The model will uniformly randomly assign a numeric value. One can also think about assigning numeric values by looking at the distribution of it per predicate.
- tensorboard_logs_path (str or None) – Path to store tensorboard logs, e.g. average training loss tracking per epoch (default:
None
indicating no logs will be collected). When provided it will create a folder under provided path and save tensorboard files there. To then view the loss in the terminal run:tensorboard --logdir <tensorboard_logs_path>
.
-
predict
(X, from_idx=False)¶ Predict the scores of triples using a trained embedding model. The function returns raw scores generated by the model.
Note
To obtain probability estimates, calibrate the model with
calibrate()
, then callpredict_proba()
.Parameters: - X (ndarray, shape [n, 3]) – The triples to score.
- from_idx (bool) – If True, will skip conversion to internal IDs. (default: False).
Returns: scores_predict – The predicted scores for input triples X.
Return type: ndarray, shape [n]
-
get_hyperparameter_dict
()¶ Returns hyperparameters of the model.
Returns: hyperparam_dict – Dictionary of hyperparameters that were used for training. Return type: dict
-