Collecting and Aggregating Metrics#

Computing metrics over individual lists isn’t enough for evaluating a recommender system — we usually want to compute metrics for pipelines over entire test sets.

For simple metrics, it is possible to do this yourself: call an appropriate metric with each recommendation list and its corresponding truth, and compute aggregate statistics over those metrics. However, this has a few limitations:

You need to implement correct logic to handle cases such as missing recommendations, missing truth data, etc.
You need to implement the aggregation logic.
Aggregation logic beyond simple statistical aggregates, such as computing the total number of unique recommended items across all recommendation lists, requires your code to have knowledge of the specific requirements and structure of each metric.

LensKit provides the MeasurementCollector to help with all of this, and to provide a unified way to collect metric measurements across all lists in a recommendation run.

Changed in version 2026.1: MeasurementCollector was introduced in LensKit 2025.5.0, and replaced RunAnalysis as the primary metric analysis tool in LensKit 2026.1.0.

Basic Principles#

A single measurement collector collects metrics for recommendation lists in a single run: evaluating one pipeline on one test set. The simple way to use a collector is as follows:

Create a MeasurementCollector.
Add metrics to collector with add_metric().
Measure a run with measure_run() to get both summary metrics and per-list metrics.

Example#

If you have a dictionary of recommendation results in run_recs, you can measure them with:

base_mc = MeasurementCollector()
base_mc.add_metric(NDCG(n=10))
base_mc.add_metric(RBP(n=10))
base_mc.add_metric(RecipRank(n=10))

run_list_metrics = {}
run_summaries = {}
for name, recs in run_recs.items():
    result = mc.measure_run(recs, test)
    run_summaries[name] = result.summary_metrics
    run_list_metrics[name] = result.list_metrics

list_metrics = pd.concat(run_list_metrics, name=['recommender'])
metrics = pd.DataFrame.from_dict(run_summaries, orient="index")

Advanced Usage#

The measurement collector is stateful, and can be used with state as follows:

Create a MeasurementCollector.
Add metrics to collector with add_metric().
Measure individual lists and their corresponding truth with add_list_measurement() or an entire collection of recommendations with add_collection_measurements().
Obtain individual list metrics with list_metrics() (returning data frame with one row per list), or aggregate metrics and summary statistics with summary_metrics().

To measure multiple runs (e.g., the results of different recommendation pipelines), there are three ways:

Create a fresh MeasurementCollector for each pipeline.
Create an empty copy of a base collector with empty_copy().
Reset the collector with reset().

The empty copy method is usually the easiest: create the measurement collector, and then create a copy for each run to measure. Getting Started provides an example of using a measurement collector to measure two runs, and reporting on the results.

Note

Design Goals for Aggregation

Since there are many different ways of organizing experiments, and supporting complex aggregations inside the measurement collector would effectively be a reimplementation of the kinds of aggregation logic already supported by data frame libraries, we have focused LensKit’s facilities on collecting and aggregating metrics within a single run, to produce summary statistics at the highest level that may be specific to recommendation. Further analysis can be done by collecting metric results (either summary or per-list) into larger data frames and analyzing with your preferred analytics library.

Collecting and Aggregating Metrics#

Basic Principles#

Example#

Advanced Usage#

This Page