Collecting and Aggregating Metrics#
Computing metrics over individual lists isn’t enough for evaluating a recommender system — we usually want to compute metrics for pipelines over entire test sets.
For simple metrics, it is possible to do this yourself: call an appropriate metric with each recommendation list and its corresponding truth, and compute aggregate statistics over those metrics. However, this has a few limitations:
You need to implement correct logic to handle cases such as missing recommendations, missing truth data, etc.
You need to implement the aggregation logic.
Aggregation logic beyond simple statistical aggregates, such as computing the total number of unique recommended items across all recommendation lists, requires your code to have knowledge of the specific requirements and structure of each metric.
LensKit provides the MeasurementCollector to help with all of this, and
to provide a unified way to collect metric measurements across all lists in a
recommendation run.
Changed in version 2026.1: MeasurementCollector was introduced in LensKit 2025.5.0, and
replaced RunAnalysis as the primary metric analysis tool in LensKit
2026.1.0.
Basic Principles#
A single measurement collector collects metrics for recommendation lists in a single run: evaluating one pipeline on one test set. The basic use pattern is as follows:
Create a
MeasurementCollector.Add metrics to collector with
add_metric().Measure individual lists and their corresponding truth with
measure_list()or an entire collection of recommendations withmeasure_collection().Obtain individual list metrics with
list_metrics()(returning data frame with one row per list), or aggregate metrics and summary statistics withsummary_metrics().
To measure multiple runs (e.g., the results of different recommendation pipelines), there are three ways:
Create a fresh
MeasurementCollectorfor each pipeline.Create an empty copy of a base collector with
empty_copy().Reset the collector with
reset().
The empty copy method is usually the easiest: create the measurement collector, and then create a copy for each run to measure. Getting Started provides an example of using a measurement collector to measure two runs, and reporting on the results.
Note
Design Goals for Aggregation
Since there are many different ways of organizing experiments, and supporting complex aggregations inside the measurement collector would effectively be a reimplementation of the kinds of aggregation logic already supported by data frame libraries, we have focused LensKit’s facilities on collecting and aggregating metrics within a single run, to produce summary statistics at the highest level that may be specific to recommendation. Further analysis can be done by collecting metric results (either summary or per-list) into larger data frames and analyzing with your preferred analytics library.
Example#
If you have a dictionary of recommendation results in run_recs, you can
measure them with:
base_mc = MeasurementCollector()
base_mc.add_metric(NDCG(n=10))
base_mc.add_metric(RBP(n=10))
base_mc.add_metric(RecipRank(n=10))
run_list_metrics = {}
run_summaries = {}
for name, recs in run_recs.items():
mc = base_mc.empty_copy()
mc.measure_collection(recs, test)
run_list_metrics[name] = mc.list_metrics()
run_summaries[name] = mc.summary_metrics()
list_metrics = pd.concat(run_list_metrics, name=['recommender'])
metrics = pd.DataFrame.from_dict(run_summaries, orient="index")