lenskit.metrics.ranking#

LensKit ranking (and list) metrics.

Classes#

RankingMetricBase

Base class for most ranking metrics, implementing an n parameter for

DCG

Compute the _unnormalized_ discounted cumulative gain [JarvelinKekalainen02].

NDCG

Compute the normalized discounted cumulative gain [JarvelinKekalainen02].

Entropy

Evaluate diversity using Shannon entropy over item categories.

RankBiasedEntropy

Evaluate diversity using rank-biased Shannon entropy over item categories.

ExposureGini

Measure exposure distribution of recommendations with the Gini coefficient.

ListGini

Measure item diversity of recommendations with the Gini coefficient.

Hit

Compute whether or not a list is a hit; any list with at least one

ILS

Evaluate recommendation diversity using intra-list similarity (ILS).

AveragePrecision

Compute Average Precision (AP) for a single user's recommendations. This is

MeanPopRank

Compute the _obscurity_ (mean popularity rank) of the recommendations.

Precision

Compute recommendation precision. This is computed as:

Recall

Compute recommendation recall. This is computed as:

RBP

Evaluate recommendations with rank-biased precision [MZ08].

RecipRank

Compute the reciprocal rank [KV97] of the first relevant

GeometricRankWeight

Geometric cascade weighting for result ranks.

LogRankWeight

Logarithmic weighting for result ranks, as used in NDCG.

RankWeight

Base class for rank weighting models.

Functions#

rank_biased_precision(good, weights[, normalization])

Compute rank-biased precision given explicit weights.

Package Contents#

class lenskit.metrics.ranking.RankingMetricBase(n=None, *, k=None)#

Bases: lenskit.metrics._base.Metric

Base class for most ranking metrics, implementing an n parameter for truncation.

Parameters:
  • n (int | None) – Specify the length cutoff for rankings. Rankings longer than this will be truncated prior to measurement.

  • k (int | None) – Deprecated alias for n.

Stability:
Caller (see Stability Levels).
n: int | None = None#

The maximum length of rankings to consider.

property k#
property label#

Default name — class name, optionally @N.

truncate(items)#

Truncate an item list if it is longer than n.

Parameters:

items (lenskit.data.ItemList)

class lenskit.metrics.ranking.DCG(n=None, *, k=None, weight=LogRankWeight(), discount=None, gain=None)#

Bases: lenskit.metrics.ranking._base.ListMetric, lenskit.metrics.ranking._base.RankingMetricBase

Compute the _unnormalized_ discounted cumulative gain [JarvelinKekalainen02].

Discounted cumultative gain is computed as:

\[\begin{align*} \mathrm{DCG}(L,u) & = \sum_{i=1}^{|L|} \frac{r_{ui}}{d(i)} \end{align*}\]

Unrated items are assumed to have a utility of 0; if no rating values are provided in the truth frame, item ratings are assumed to be 1.

This metric does not normalize by ideal DCG. For that, use NDCG. See Jeunen et al. [JPU24] for an argument for using the unnormalized version.

Parameters:
  • n (int | None) – The maximum recommendation list length to consider (longer lists are truncated).

  • discount (Discount | None) – The discount function to use. The default, base-2 logarithm, is the original function used by Järvelin and Kekäläinen [JarvelinKekalainen02].

  • gain (str | None) – The field on the test data to use for gain values. If None (the default), all items present in the test data have a gain of 1. If set to a string, it is the name of a field (e.g. 'rating'). In all cases, items not present in the truth data have a gain of 0.

  • k (int | None)

  • weight (lenskit.metrics.ranking._weighting.RankWeight)

Stability:
Caller (see Stability Levels).
weight: lenskit.metrics.ranking._weighting.RankWeight#
discount: Discount | None#
gain: str | None#
property label#

The metric’s default label in output. The base implementation returns the class name by default.

measure_list(recs, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:
Return type:

float

class lenskit.metrics.ranking.NDCG(n=None, *, k=None, weight=LogRankWeight(), discount=None, gain=None)#

Bases: lenskit.metrics.ranking._base.ListMetric, lenskit.metrics.ranking._base.RankingMetricBase

Compute the normalized discounted cumulative gain [JarvelinKekalainen02].

Discounted cumultative gain is computed as:

\[\begin{align*} \mathrm{DCG}(L,u) & = \sum_{i=1}^{|L|} \frac{r_{ui}}{d(i)} \end{align*}\]

Unrated items are assumed to have a utility of 0; if no rating values are provided in the truth frame, item ratings are assumed to be 1.

This is then normalized as follows:

\[\begin{align*} \mathrm{nDCG}(L, u) & = \frac{\mathrm{DCG}(L,u)}{\mathrm{DCG}(L_{\mathrm{ideal}}, u)} \end{align*}\]

Note

Negative gains are clipped to zero before computing NDCG. This keeps the metric bounded between 0 and 1 and prevents cases where negative gains can lead to misleading positive scores due to cancellation effects.

Parameters:
  • n (int | None) – The maximum recommendation list length to consider (longer lists are truncated).

  • weight (lenskit.metrics.ranking._weighting.RankWeight) – The rank weighting to use.

  • discount (Discount | None) – The discount function to use. The default, base-2 logarithm, is the original function used by Järvelin and Kekäläinen [JarvelinKekalainen02]. It is deprecated in favor of the weight option.

  • gain (str | None) – The field on the test data to use for gain values. If None (the default), all items present in the test data have a gain of 1. If set to a string, it is the name of a field (e.g. 'rating'). In all cases, items not present in the truth data have a gain of 0.

  • k (int | None)

Stability:
Caller (see Stability Levels).
weight: lenskit.metrics.ranking._weighting.RankWeight#
discount: Discount | None#
gain: str | None#
property label#

The metric’s default label in output. The base implementation returns the class name by default.

measure_list(recs, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:
Return type:

float

class lenskit.metrics.ranking.Entropy(dataset, attribute, n=None)#

Bases: lenskit.metrics.ranking._base.ListMetric, lenskit.metrics.ranking._base.RankingMetricBase

Evaluate diversity using Shannon entropy over item categories.

This metric measures the diversity of categories in recommendation list. Higher entropy indicates more diverse category distribution.

Parameters:
  • dataset (lenskit.data.Dataset) – The LensKit dataset containing item entities and their attributes.

  • attribute (str) – Name of the attribute to use for categories (e.g., ‘genre’, ‘tag’)

  • n (int | None) – Recommendation list length to evaluate

Stability:
Caller (see Stability Levels).
attribute: str#
property label#

The metric’s default label in output. The base implementation returns the class name by default.

measure_list(recs, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:
Return type:

float

class lenskit.metrics.ranking.RankBiasedEntropy(dataset, attribute, n=None, *, weight=None)#

Bases: lenskit.metrics.ranking._base.ListMetric, lenskit.metrics.ranking._base.RankingMetricBase

Evaluate diversity using rank-biased Shannon entropy over item categories.

This metric measures the diversity of categories in recommendation list with rank-based weighting, giving more importance to items at the top of the recommendation list.

Parameters:
  • dataset (lenskit.data.Dataset) – The LensKit dataset containing item entities and their attributes.

  • attribute (str) – Name of the attribute to use for categories (e.g., ‘genre’, ‘tag’)

  • n (int | None) – Recommendation list length to evaluate

  • weight (lenskit.metrics.ranking._weighting.RankWeight | None) – Rank weighting model. Defaults to GeometricRankWeight(0.85)

Stability:
Caller (see Stability Levels).
attribute: str#
weight: lenskit.metrics.ranking._weighting.RankWeight#
property label#

The metric’s default label in output. The base implementation returns the class name by default.

measure_list(recs, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:
Return type:

float

class lenskit.metrics.ranking.ExposureGini(n=None, *, k=None, items, weight=GeometricRankWeight())#

Bases: GiniBase

Measure exposure distribution of recommendations with the Gini coefficient.

This uses a weighting model to compute the exposure of each item in each list, and computes the Gini coefficient of the total exposure.

Parameters:
Stability:
Caller (see Stability Levels).
weight: lenskit.metrics.ranking._weighting.RankWeight#
measure_list(output, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:

output (lenskit.data.ItemList)

Return type:

tuple[numpy.typing.NDArray[numpy.int32], numpy.typing.NDArray[numpy.float64]]

class lenskit.metrics.ranking.ListGini(n=None, *, k=None, items)#

Bases: GiniBase

Measure item diversity of recommendations with the Gini coefficient.

This computes the Gini coefficient of the number of lists that each item appears in.

Parameters:
Stability:
Caller (see Stability Levels).
measure_list(output, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:

output (lenskit.data.ItemList)

Return type:

tuple[numpy.typing.NDArray[numpy.int32], float]

class lenskit.metrics.ranking.Hit(n=None, *, k=None)#

Bases: lenskit.metrics.ranking._base.ListMetric, lenskit.metrics.ranking._base.RankingMetricBase

Compute whether or not a list is a hit; any list with at least one relevant item in the first \(k\) positions (\(L_{\le k} \cap I_u^{\mathrm{test}} \ne \emptyset\)) is scored as 1, and lists with no relevant items as 0. When averaged over the recommendation lists, this computes the hit rate [DK04].

Stability:
Caller (see Stability Levels).
Parameters:
  • n (int | None)

  • k (int | None)

property label#

The metric’s default label in output. The base implementation returns the class name by default.

measure_list(recs, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:
Return type:

float

class lenskit.metrics.ranking.ILS(dataset, attribute, n=None)#

Bases: lenskit.metrics.ranking._base.ListMetric, lenskit.metrics.ranking._base.RankingMetricBase

Evaluate recommendation diversity using intra-list similarity (ILS).

This metric measures the average pairwise cosine similarity between item vectors in a recommendation list. Lower values indicate more diverse recommendations, while higher values indicate less diverse recommendations.

Parameters:
  • dataset (lenskit.data.Dataset) – The LensKit dataset containing item entities and their attributes.

  • attribute (str) – Name of the attribute or vector source (e.g., ‘genre’, ‘tag’).

  • n (int | None) – Recommendation list length to evaluate.

Stability:
Caller (see Stability Levels).
attribute: str#
property label#

The metric’s default label in output. The base implementation returns the class name by default.

measure_list(recs, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:
Return type:

float

class lenskit.metrics.ranking.AveragePrecision(n=None, *, k=None)#

Bases: lenskit.metrics.ranking._base.ListMetric, lenskit.metrics.ranking._base.RankingMetricBase

Compute Average Precision (AP) for a single user’s recommendations. This is the average of the precision at each relevant item in the ranked list.

Parameters:
  • n (int | None)

  • k (int | None)

property label#

The metric’s default label in output. The base implementation returns the class name by default.

measure_list(recs, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:
Return type:

float

class lenskit.metrics.ranking.MeanPopRank(data, *, n=None, k=None, count='users')#

Bases: lenskit.metrics.ranking._base.ListMetric, lenskit.metrics.ranking._base.RankingMetricBase

Compute the _obscurity_ (mean popularity rank) of the recommendations.

Unlike other metrics, this metric requires access to the training dataset in order to compute item popularity metrics. Supply this as a constructor parameter.

This metric represents the popularity rank as a quantile, based on the either the number of distinct users who have interacted with the item, or the total interactions (depending on the options — distinct users is the default).

Let $q_i$ be the _popularity rank_, represented as a quantile, of item $i$. $q_i = 1$ for the most-popular item; $q_i=0$ for an item with no users or interactions (the quantiles are min-max scaled). This metric computes the mean of the quantile popularity ranks for the recommended items:

\[\mathcal{M}(L) = \frac{1}{|L|} \sum_{i \in L} q_i\]

This metric is based on the ``obscurity’’ metric of Ekstrand and Mahant [EM17] and the popularity-based item novelty metric of Vargas and Castells [VC11].

Stability:
Caller (see Stability Levels).
Parameters:
item_ranks: pandas.Series[float]#
measure_list(recs, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:
Return type:

float

class lenskit.metrics.ranking.Precision(n=None, *, k=None)#

Bases: lenskit.metrics.ranking._base.ListMetric, lenskit.metrics.ranking._base.RankingMetricBase

Compute recommendation precision. This is computed as:

\[\frac{|L \cap I_u^{\mathrm{test}}|}{|L|}\]

In the uncommon case that k is specified and len(recs) < k, this metric uses len(recs) as the denominator.

Stability:
Caller (see Stability Levels).
Parameters:
  • n (int | None)

  • k (int | None)

property label#

The metric’s default label in output. The base implementation returns the class name by default.

measure_list(recs, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:
Return type:

float

class lenskit.metrics.ranking.Recall(n=None, *, k=None)#

Bases: lenskit.metrics.ranking._base.ListMetric, lenskit.metrics.ranking._base.RankingMetricBase

Compute recommendation recall. This is computed as:

\[\frac{|L \cap I_u^{\mathrm{test}}|}{\operatorname{min}\{|I_u^{\mathrm{test}}|, k\}}\]
Parameters:
  • n (int | None)

  • k (int | None)

property label#

The metric’s default label in output. The base implementation returns the class name by default.

measure_list(recs, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:
Return type:

float

class lenskit.metrics.ranking.RBP(n=None, *, k=None, weight=None, patience=0.85, normalize=False, weight_field=None)#

Bases: lenskit.metrics.ranking._base.ListMetric, lenskit.metrics.ranking._base.RankingMetricBase

Evaluate recommendations with rank-biased precision [MZ08].

If \(r_{ui} \in \{0, 1\}\) is binary implicit ratings, and the weighting is the default geometric weight with patience \(p\), the RBP is computed by:

\[\begin{align*} \operatorname{RBP}_p(L, u) & =(1 - p) \sum_i r_{ui} p^i \end{align*}\]

The original RBP metric depends on the idea that the rank-biased sum of binary relevance scores in an infinitely-long, perfectly-precise list has is \(1/(1 - p)\). If RBP is used with a non-standard weighting that does not have a defined infinite series sum, then this metric will normalize by the sum of the discounts for the recommendation list.

Moffat and Zobel [MZ08] provide an extended discussion on choosing the patience parameter \(\gamma\). This metric defaults to \(\gamma=0.85\), to provide a relatively shallow curve and reward good items on the first few pages of results (in a 10-per-page setting). Recommendation systems data has no pooling, so the variance of this estimator may be high as they note in the paper; however, RBP with high patience should be no worse than nDCG (and perhaps even better) in this regard.

In recommender evaluation, we usually have a small test set, so the maximum achievable RBP is significantly less than the theoretical maximum, and is a function of the number of test items. With normalize=True, the RBP metric will be normalized by the maximum achievable with the provided test data, like NDCG.

Warning

The additional normalization is experimental, and should not yet be used for published research results.

Parameters:
  • n (int | None) – The maximum recommendation list length.

  • weight (lenskit.metrics.ranking._weighting.RankWeight | None) – The rank weighting model to use. Defaults to GeometricRankWeight with the specified patience parameter.

  • patience (float) – The patience parameter \(p\), the probability that the user continues browsing at each point. The default is 0.85.

  • normalize (bool) – Whether to normalize the RBP scores; if True, divides the RBP score by the maximum achievable with the test data (as in nDCG).

  • weight_field (str | None) – Name of a field in the item list to use as weights. If provided, weights are read from this field instead of being computed from the rank model.

  • k (int | None)

Stability:
Caller (see Stability Levels).
weight: lenskit.metrics.ranking._weighting.RankWeight | None#
patience: float#
normalize: bool#
weight_field: str | None#
property label#

The metric’s default label in output. The base implementation returns the class name by default.

measure_list(recs, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:
Return type:

float

lenskit.metrics.ranking.rank_biased_precision(good, weights, normalization=1.0)#

Compute rank-biased precision given explicit weights.

Parameters:
  • good (numpy.ndarray) – Boolean array indicating relevant items at each position.

  • weights (numpy.ndarray) – Weight for each item position (same length as good).

  • normalization (float) – Optional normalization factor, defaults to 1.0.

Returns:

RBP score

Return type:

float

class lenskit.metrics.ranking.RecipRank(n=None, *, k=None)#

Bases: lenskit.metrics.ranking._base.ListMetric, lenskit.metrics.ranking._base.RankingMetricBase

Compute the reciprocal rank [KV97] of the first relevant item in a list of recommendations. Taking the mean of this metric over the recommendation lists in a run yields the MRR (mean reciprocal rank).

Let \(\kappa\) denote the 1-based rank of the first relevant item in \(L\), with \(\kappa=\infty\) if none of the first \(k\) items in \(L\) are relevant; then the reciprocal rank is \(1 / \kappa\). If no elements are relevant, the reciprocal rank is therefore 0. Deshpande and Karypis [DK04] call this the “reciprocal hit rate”.

Stability:
Caller (see Stability Levels).
Parameters:
  • n (int | None)

  • k (int | None)

property label#

The metric’s default label in output. The base implementation returns the class name by default.

measure_list(recs, test)#

Compute measurements for a single list.

Returns:

  • A float for simple metrics

  • Intermediate data for decomposed metrics

  • A dict mapping metric names to values for multi-metric classes

Parameters:
Return type:

float

class lenskit.metrics.ranking.GeometricRankWeight(patience=0.85)#

Bases: RankWeight

Geometric cascade weighting for result ranks.

This is the ranking model used by RBP [MZ08].

For patience \(p\), the discount is given by \(p^{k-1}\). The sum of this infinite series is \(\frac{1}{1 - p}\).

Parameters:

patience (Annotated[float, Gt(0.0), Lt(1.0)]) – The patience parameter \(p\).

Stability:
Caller (see Stability Levels).
patience: float#
weight(ranks)#

Compute the discount for the specified ranks.

Ranks must start with 1.

Return type:

lenskit.data.types.NPVector[numpy.float64]

log_weight(ranks)#

Compute the (natural) log of the discount for the specified ranks.

Ranks must start with 1.

Return type:

lenskit.data.types.NPVector[numpy.float64]

series_sum()#

Get the sum of the infinite series of this discount function, if known. Some metrics (e.g. RBP()) will use this to normalize their measurements.

Return type:

float

class lenskit.metrics.ranking.LogRankWeight(*, base=2, offset=0)#

Bases: RankWeight

Logarithmic weighting for result ranks, as used in NDCG.

This is the ranking model typically used for DCG and NDCG.

Since \(\operatorname{lg} 1 = 0\), simply taking the log will result in division by 0 when weights are applied. The correction for this in the original NDCG paper [JarvelinKekalainen02] is to clip the ranks, so that both of the first two positions have discount \(\operatorname{lg} 2\). A different correction somtimes seen is to compute \(\operatorname{lg} (k+1)\). This discount supports both; the default is to clip, but if the offset option is set to a positive number, it is added to the ranks instead.

Parameters:
  • base (pydantic.PositiveFloat) – The log base to use.

  • offset (pydantic.NonNegativeInt) – An offset to add to ranks before computing logs.

base: float#
offset: int#
weight(ranks)#

Compute the discount for the specified ranks.

Ranks must start with 1.

class lenskit.metrics.ranking.RankWeight#

Bases: abc.ABC

Base class for rank weighting models.

This returns multiplicative weights, such that scores should be multiplied by the weights in order to produce weighted scores.

Stability:
Caller (see Stability Levels).
abstractmethod weight(ranks)#

Compute the discount for the specified ranks.

Ranks must start with 1.

Parameters:

ranks (lenskit.data.types.NPVector[numpy.int32])

Return type:

lenskit.data.types.NPVector[numpy.float64]

log_weight(ranks)#

Compute the (natural) log of the discount for the specified ranks.

Ranks must start with 1.

Parameters:

ranks (lenskit.data.types.NPVector[numpy.int32])

Return type:

lenskit.data.types.NPVector[numpy.float64]

series_sum()#

Get the sum of the infinite series of this discount function, if known. Some metrics (e.g. RBP()) will use this to normalize their measurements.

Return type:

float | None