lenskit.splitting#

Splitting data for train-test evaluation.

Classes#

HoldoutMethod

Holdout methods select test rows for a user (or occasionally an item).

LastFrac

Select a fraction of test rows per user/item.

LastN

Select a fixed number of test rows per user/item, based on ordering by a

SampleFrac

Randomly select a fraction of test rows per user/item.

SampleN

Randomly select a fixed number of test rows per user/item.

TTSplit

A train-test set from splitting or other sources.

Functions#

crossfold_records(data, partitions, *[, test_only, rng])

Partition a dataset by records into cross-fold partitions. This

sample_records(…)

Create a train-test split of data by randomly sampling individual

split_global_time(…)

Global temporal train-test split. This splits a data set into train/test

split_temporal_fraction(data, test_fraction[, ...])

Do a global temporal split of a data set based on a test set size.

crossfold_users(data, partitions, method, *[, ...])

Partition a dataset user-by-user for user-based cross-validation.

sample_users(…)

Create train-test splits by sampling users. When repeats is None,

simple_test_pair(ratings[, n_users, n_rates, f_rates, rng])

Return a single, basic train-test pair for some ratings. This is only intended

Package Contents#

class lenskit.splitting.HoldoutMethod#

Bases: Protocol

Holdout methods select test rows for a user (or occasionally an item). Partition methods are callable; when called with a data frame, they return the test entries.

Stability:
Caller (see Stability Levels).
abstractmethod __call__(items)#

Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs).

Parameters:
Returns:

The list of test items.

Return type:

lenskit.data.ItemList

class lenskit.splitting.LastFrac(frac, field='timestamp')#

Bases: HoldoutMethod

Select a fraction of test rows per user/item.

Stability:
Caller (see Stability Levels).
Parameters:
  • frac (double) – the fraction of items to select for testing.

  • field (str)

fraction: float#
field: str#
__call__(items)#

Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs).

Parameters:
Returns:

The list of test items.

Return type:

lenskit.data.ItemList

class lenskit.splitting.LastN(n, field='timestamp')#

Bases: HoldoutMethod

Select a fixed number of test rows per user/item, based on ordering by a field.

Stability:
Caller (see Stability Levels).
Parameters:
  • n (int) – The number of test items to select.

  • field (str) – The field to order by.

n: int#
field: str#
__call__(items)#

Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs).

Parameters:
Returns:

The list of test items.

Return type:

lenskit.data.ItemList

class lenskit.splitting.SampleFrac(frac, rng=None)#

Bases: HoldoutMethod

Randomly select a fraction of test rows per user/item.

Stability:
Caller (see Stability Levels).
Parameters:
fraction: float#
rng: numpy.random.Generator#
__call__(items)#

Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs).

Parameters:
Returns:

The list of test items.

Return type:

lenskit.data.ItemList

class lenskit.splitting.SampleN(n, rng=None)#

Bases: HoldoutMethod

Randomly select a fixed number of test rows per user/item.

Stability:
Caller (see Stability Levels).
Parameters:
n: int#
rng: numpy.random.Generator#
__call__(items)#

Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs).

Parameters:
Returns:

The list of test items.

Return type:

lenskit.data.ItemList

lenskit.splitting.crossfold_records(data, partitions, *, test_only=False, rng=None)#

Partition a dataset by records into cross-fold partitions. This partitions the records (ratings, play counts, clicks, etc.) into k partitions without regard to users or items.

Since record-based random cross-validation doesn’t make much sense with repeated interactions, this splitter only supports operating on the dataset’s interaction matrix.

Stability:
Caller (see Stability Levels).
Parameters:
  • data (lenskit.data.Dataset) – Ratings or other data you wish to partition.

  • partitions (int) – The number of partitions to produce.

  • test_only (bool) – If True, returns splits with empty training sets (useful when you just want to save the test data).

  • rng (lenskit.random.RNGInput) – The random number generator or seed (see Random Seeds).

Returns:

an iterator of train-test pairs

Return type:

iterator

lenskit.splitting.sample_records(data: lenskit.data.Dataset, size: int, *, disjoint: bool = True, test_only: bool = False, rng: lenskit.random.RNGInput = None, repeats: None = None) lenskit.splitting._split.TTSplit#
lenskit.splitting.sample_records(data: lenskit.data.Dataset, size: int, *, repeats: int, disjoint: bool = True, test_only: bool = False, rng: lenskit.random.RNGInput = None) Iterator[lenskit.splitting._split.TTSplit]

Create a train-test split of data by randomly sampling individual interactions.

We can loop over a sequence of train-test pairs:

>>> from lenskit.data import load_movielens
>>> movielens = load_movielens('data/ml-latest-small')
>>> for split in sample_records(movielens, 1000, repeats=5):
...     print(sum(len(il) for il in split.test.lists()))
1000
1000
1000
1000
1000

Sometimes for testing, it is useful to just get a single pair:

>>> split = sample_records(movielens, 1000)
>>> sum(len(il) for il in split.test.lists())
1000
Stability:
Caller (see Stability Levels).
Parameters:
  • data – The data set to split.

  • size – The size of each test sample.

  • repeats – The number of data splits to produce. If None, produce a _single_ train-test pair instead of an iterator or list.

  • disjoint – If True, force test samples to be disjoint.

  • test_only – If True, returns splits with empty training sets (useful when you just want to save the test data).

  • rng – The random number generator or seed (see Random Seeds).

Returns:

A train-test pair or iterator of such pairs (depending on repeats).

class lenskit.splitting.TTSplit#

Bases: Generic[TK]

A train-test set from splitting or other sources.

Stability:
Caller (see Stability Levels).
train: lenskit.data.Dataset#

The training data.

test: lenskit.data.ItemListCollection[TK]#

The test data.

name: str | None = None#

A name for this train-test split.

property test_size: int#

Get the number of test pairs.

Return type:

int

property test_df: pandas.DataFrame#

Get the test data as a data frame.

Return type:

pandas.DataFrame

property train_df: pandas.DataFrame#

Get the training data as a data frame.

Return type:

pandas.DataFrame

lenskit.splitting.split_global_time(data: lenskit.data.Dataset, time: int | float | str | datetime.datetime, end: int | float | str | datetime.datetime | None = None, filter_test_users: bool | int | None = False) lenskit.splitting._split.TTSplit#
lenskit.splitting.split_global_time(data: lenskit.data.Dataset, time: Sequence[int | float | str | datetime.datetime], end: int | float | str | datetime.datetime | None = None, filter_test_users: bool | int | None = False) list[lenskit.splitting._split.TTSplit]

Global temporal train-test split. This splits a data set into train/test pairs using a single global timestamp. When given multiple timestamps, it will return multiple splits, where split \(i\) has training data from before \(t_i\) and testing data on or after \(t_i\) and before \(t_{i+1}\) (the last split has no upper bound on the testing data).

Stability:
Caller (see Stability Levels).
Parameters:
  • data – The dataset to split.

  • time – Time or sequence of times at which to split. Strings must be in ISO format.

  • end – A final cutoff time for the testing data.

  • filter_test_users – Limit test data to only have users who had item in the training data.

Returns:

The data splits.

lenskit.splitting.split_temporal_fraction(data, test_fraction, filter_test_users=False)#

Do a global temporal split of a data set based on a test set size.

Parameters:
  • data (lenskit.data.Dataset) – The dataset to split.

  • test_fraction (float) – The fraction of the interactions to put in the testing data.

  • filter_test_users (bool | int | None) – Limit test data to only have users who had item in the training data.

Return type:

lenskit.splitting._split.TTSplit

lenskit.splitting.crossfold_users(data, partitions, method, *, test_only=False, rng=None)#

Partition a dataset user-by-user for user-based cross-validation.

Stability:
Caller (see Stability Levels).
Parameters:
Return type:

Iterator[lenskit.splitting._split.TTSplit]

Returns

The train-test pairs.

lenskit.splitting.sample_users(data: lenskit.data.Dataset, size: int, method: lenskit.splitting._holdout.HoldoutMethod, *, repeats: int, disjoint: bool = True, test_only: bool = False, rng: lenskit.random.RNGInput = None) Iterator[lenskit.splitting._split.TTSplit]#
lenskit.splitting.sample_users(data: lenskit.data.Dataset, size: int, method: lenskit.splitting._holdout.HoldoutMethod, *, disjoint: bool = True, rng: lenskit.random.RNGInput = None, test_only: bool = False, repeats: None = None) lenskit.splitting._split.TTSplit

Create train-test splits by sampling users. When repeats is None, returns a single train-test split; otherwise, it returns an iterator over multiple splits. If repeats=1, this function returns an iterator that yields a single train-test pair.

Stability:
Caller (see Stability Levels).
Parameters:
  • data – The data set to sample.

  • size – The sample size.

  • method – The method for obtaining user test ratings.

  • repeats – The number of samples to produce.

  • test_only – If True, returns splits with empty training sets (useful when you just want to save the test data).

  • rng – The random number generator or seed (see Random Seeds).

Returns:

The train-test pair(s).

lenskit.splitting.simple_test_pair(ratings, n_users=200, n_rates=5, f_rates=None, rng=None)#

Return a single, basic train-test pair for some ratings. This is only intended for convenience use in test and demos - do not use for research.

Parameters:
Return type:

_split.TTSplit

Exported Aliases#

class lenskit.splitting.Dataset#

Re-exported alias for lenskit.data.Dataset.