Evaluator Metrics

Prediction Accuracy Metrics

LensKit provides several metrics for measuring prediction accuracy. They are implemented by the classes in the o.g.l.eval.metrics.predict package, and include:

  • Coverage (CoveragePredictMetric)
  • RMSE (RMSEPredictMetric)
  • MAE (MAEPredictMetric)
  • nDCG (NDCGPredictMetric) — this deploys nDCG as a weighted rank accuracy measure of prediction accuracy
  • Half-life utility (HLUtilityPredictMetric) — like nDCG, but using Breese’s half-life discounting
  • Entropy (EntropyPredctMetric) — measures mutual information between ratings and predictions

To use one of these metrics, just mention its class by name in your trainTest block:

metric RMSEPredictMetric

Top-N metrics

The metrics discussed above are all prediction accuracy metrics, evaluating the accuracy of the rating predictor either for ranking items or for predicting the user’s rating for individual items. LensKit also supports metrics over recommendation lists; these are called Top-N metrics, though the recommendation list may be generated by some other means.

Configuring a top-N metric is a bit more involved than a prediction accuracy metric. It requires you to specify a few things:

  • The length of recommendation list to consider
  • The items to consider as candidates for recommendation
  • The items to exclude from recommendation
  • For some metrics, the items considered ‘good’ or ‘bad’

For this reason, you cannot just add a top-N metric by its class. To compute top-N nDCG of 10-item lists over all items the user has not rated in the training set, you instead do this:

metric topNnDCG {
    listSize 10
    candidates ItemSelectors.allItems()
    exclude ItemSelectors.trainingItems()
}

More complex configurations are also possible. The following will compute the mean reciprocal rank in 10-item recommendation lists, where the recommendations are selected from the test items plus 100 random decoys, and consider an item relevant if it was rated at least 3.5 stars. The Matchers.greaterThanOrEqualTo method comes from Hamcrest.

metric topNMRR {
    listSize 10
    candidates ItemSelectors.addNRandom(ItemSelectors.testItems(), 100)
    exclude ItemSelectors.trainingItems()
    goodItems ItemSelectors.testRatingMatches(Matchers.greaterThanOrEqualTo(3.5d))
}

Note: it is possible for a training item to appear among the 100 decoys. It will be excluded by the exclude set, but the resulting recommendation run will have fewer decoys. This is probably not desired, and is tracked by #759.

As of LensKit 2.2, the following Top-N metrics are available:

  • topNnDCG — normalized discounted cumulative gain
  • topNLength — actual length of the top-N list (to measure truncated lists due to low coverage)
  • topNRecallPrecision — precision and recall at N; requires a good set
  • topNnDCG — nDCG applied to top-N lists (its more typical application)
  • topNPopularity — measures popularity of recommended items
  • topNMAP — mean average precision; requires a good set
  • topNMRR — mean reciprocal rank; requires a good set

Each of these is defined by a class in o.g.l.eval.metrics.topn. The available item selectors are in ItemSelectors.