Evaluating Rating Predictions#
While rating prediction is no longer a widely-studied recommendation task, LensKit provides support for evaluating predictions for completeness and reproducing or comparing against historical research.
The lenskit.metrics.predict module contains the prediction accuracy
metrics, including RMSE() and
MAE(). They support both global
(micro-averaged) and per-user (macro-averaged) computation.
Changed in version 2025.1: The prediction accuracy metric interface has changed to use item lists.
Calling Metrics#
There are two ways to directly call a prediction accuracy metric:
Pass two item lists, the first containing predicted ratings (as the list’s
scores()) and the second containing ground-truth ratings as aratingfield.Pass a single item list with scores and a
ratingfield.
For evaluation, you will usually want to use RunAnalysis,
which takes care of calling the prediction metric for you.
Missing Data#
There are two important missing data cases for evaluating predictions:
Missing predictions (the test data has a rating for which the system could not generate a prediction).
Missing ratings (the system generated a prediction with no corresponding test rating).
By default, LensKit throws an error in both of these cases, to help you catch
bad configurations. We recommend using a fallback predictor, such setting up
FallbackScorer with
BiasScorer, when measuring rating predictions to
ensure that all items are scored. The alternative design — ignoring missing
predictions — means that different scorers may be evaluated on different items,
and a scorer perform exceptionally well by only scoring items with high
confidence.
Note
Both topn_pipeline() and
predict_pipeline() default to using a
BiasScorer() as a fallback when rating prediction is
enabled.
If you want to skip missing predictions, pass missing_scores="ignore" to the
metric function:
RMSE(user_preds, user_ratings, missing_scores="ignore")
The corresponding missing_truth="ignore" will cause the metric to ignore
predictions with no corresponding rating (this case is unlikely to produce
unhelpful eval results, but may indicate a misconfiguration in how you determine
the items to score).