.. _eval-collection:

Collecting and Aggregating Metrics
==================================

.. py:currentmodule:: lenskit.metrics

Computing metrics over individual lists isn't enough for evaluating a
recommender system — we usually want to compute metrics for pipelines over
entire test sets.

For simple metrics, it is possible to do this yourself: call an appropriate
metric with each recommendation list and its corresponding truth, and compute
aggregate statistics over those metrics.  However, this has a few limitations:

- You need to implement correct logic to handle cases such as missing
  recommendations, missing truth data, etc.
- You need to implement the aggregation logic.
- Aggregation logic beyond simple statistical aggregates, such as computing the
  total number of unique recommended items across all recommendation lists,
  requires your code to have knowledge of the specific requirements and
  structure of each metric.

LensKit provides the :class:`MeasurementCollector` to help with all of this, and
to provide a unified way to collect metric measurements across all lists in a
recommendation run.

.. versionchanged:: 2026.1

    :class:`MeasurementCollector` was introduced in LensKit :ref:`2025.5.0`, and
    replaced :class:`RunAnalysis` as the primary metric analysis tool in LensKit
    :ref:`2026.1.0`.

Basic Principles
~~~~~~~~~~~~~~~~

A single measurement collector collects metrics for recommendation lists in a
**single run**: evaluating one pipeline on one test set.  The basic use pattern
is as follows:

1.  Create a :class:`MeasurementCollector`.
2.  Add metrics to collector with :meth:`~MeasurementCollector.add_metric`.
3.  Measure individual lists and their corresponding truth with
    :meth:`~MeasurementCollector.measure_list` or an entire collection of
    recommendations with :meth:`~MeasurementCollector.measure_collection`.
4.  Obtain individual list metrics with
    :meth:`~MeasurementCollector.list_metrics` (returning data frame with one
    row per list), or aggregate metrics and summary statistics with
    :meth:`~MeasurementCollector.summary_metrics`.

To measure multiple runs (e.g., the results of different recommendation
pipelines), there are three ways:

- Create a fresh :class:`MeasurementCollector` for each pipeline.
- Create an empty copy of a base collector with :meth:`~MeasurementCollector.empty_copy`.
- Reset the collector with :meth:`~MeasurementCollector.reset`.

The empty copy method is usually the easiest: create the measurement collector,
and then create a copy for each run to measure. :ref:`getting-started` provides
an example of using a measurement collector to measure two runs, and reporting
on the results.

.. note:: Design Goals for Aggregation

    Since there are many different ways of organizing experiments, and
    supporting complex aggregations inside the measurement collector would
    effectively be a reimplementation of the kinds of aggregation logic already
    supported by data frame libraries, we have focused LensKit's facilities on
    collecting and aggregating metrics within a single run, to produce summary
    statistics at the highest level that may be specific to recommendation.
    Further analysis can be done by collecting metric results (either summary or
    per-list) into larger data frames and analyzing with your preferred
    analytics library.

Example
~~~~~~~

If you have a dictionary of recommendation results in ``run_recs``, you can
measure them with:

.. code:: python

    base_mc = MeasurementCollector()
    base_mc.add_metric(NDCG(n=10))
    base_mc.add_metric(RBP(n=10))
    base_mc.add_metric(RecipRank(n=10))

    run_list_metrics = {}
    run_summaries = {}
    for name, recs in run_recs.items():
        mc = base_mc.empty_copy()
        mc.measure_collection(recs, test)
        run_list_metrics[name] = mc.list_metrics()
        run_summaries[name] = mc.summary_metrics()

    list_metrics = pd.concat(run_list_metrics, name=['recommender'])
    metrics = pd.DataFrame.from_dict(run_summaries, orient="index")