Evaluating Rating Predictions#
While rating prediction is no longer a widely-studied recommendation task, LensKit provides support for evaluating predictions for completeness and reproducing or comparing against historical research.
LensKit provides two prediction accuracy metrics:
RMSE() and MAE().  They support both global (micro-averaged)
and per-user (macro-averaged) computation.
Changed in version 2025.1: The prediction accuracy metric interface has changed to use item lists.
Calling Metrics#
There are two ways to directly call a prediction accuracy metric:
- Pass two item lists, the first containing predicted ratings (as the list’s - scores()) and the second containing ground-truth ratings as a- ratingfield.
- Pass a single item list with scores and a - ratingfield.
For evaluation, you will usually want to use RunAnalysis,
which takes care of calling the prediction metric for you.
Missing Data#
There are two important missing data cases for evaluating predictions:
- Missing predictions (the test data has a rating for which the system could not generate a prediction). 
- Missing ratings (the system generated a prediction with no corresponding test rating). 
By default, LensKit throws an error in both of these cases, to help you catch
bad configurations.  We recommend using a fallback predictor, such setting up
FallbackScorer with
BiasScorer, when measuring rating predictions to
ensure that all items are scored.  The alternative design — ignoring missing
predictions — means that different scorers may be evaluated on different items,
and a scorer perform exceptionally well by only scoring items with high
confidence.
Note
Both topn_pipeline() and
predict_pipeline() default to using a
BiasScorer() as a fallback when rating prediction is
enabled.
If you want to skip missing predictions, pass missing_scores="ignore" to the
metric function:
RMSE(user_preds, user_ratings, missing_scores="ignore")
The corresponding missing_truth="ignore" will cause the metric to ignore
predictions with no corresponding rating (this case is unlikely to produce
unhelpful eval results, but may indicate a misconfiguration in how you determine
the items to score).
