lenskit.splitting#

Splitting data for train-test evaluation.

Classes#

`HoldoutMethod`	Holdout methods select test rows for a user (or occasionally an item).
`LastFrac`	Select a fraction of test rows per user/item.
`LastN`	Select a fixed number of test rows per user/item, based on ordering by a
`SampleFrac`	Randomly select a fraction of test rows per user/item.
`SampleN`	Randomly select a fixed number of test rows per user/item.
`TTSplit`	A train-test set from splitting or other sources.

Functions#

`crossfold_records`(data, partitions, *[, test_only, rng])	Partition a dataset by records into cross-fold partitions. This
`sample_records`(…)	Create a train-test split of data by randomly sampling individual
`split_global_time`(…)	Global temporal train-test split. This splits a data set into train/test
`split_temporal_fraction`(data, test_fraction[, ...])	Do a global temporal split of a data set based on a test set size.
`crossfold_users`(data, partitions, method, *[, ...])	Partition a dataset user-by-user for user-based cross-validation.
`sample_users`(…)	Create train-test splits by sampling users. When `repeats` is None,
`simple_test_pair`(ratings[, n_users, n_rates, f_rates, rng])	Return a single, basic train-test pair for some ratings. This is only intended

Package Contents#

lenskit.splitting.crossfold_records(data, partitions, *, test_only=False, rng=None)#

Partition a dataset by records into cross-fold partitions. This partitions the records (ratings, play counts, clicks, etc.) into k partitions without regard to users or items.

Since record-based random cross-validation doesn’t make much sense with repeated interactions, this splitter only supports operating on the dataset’s interaction matrix.

Stability:

Caller (see Stability Levels).

Parameters:

data (lenskit.data.Dataset) – Ratings or other data you wish to partition.
partitions (int) – The number of partitions to produce.
test_only (bool) – If True, returns splits with empty training sets (useful when you just want to save the test data).
rng (lenskit.random.RNGInput) – The random number generator or seed (see Random Seeds).

Returns:

an iterator of train-test pairs

Return type:

iterator

lenskit.splitting.sample_records(data: lenskit.data.Dataset, size: int, *, disjoint: bool = True, test_only: bool = False, rng: lenskit.random.RNGInput = None, repeats: None = None) → lenskit.splitting._split.TTSplit#

lenskit.splitting.sample_records(data: lenskit.data.Dataset, size: int, *, repeats: int, disjoint: bool = True, test_only: bool = False, rng: lenskit.random.RNGInput = None) → Iterator[lenskit.splitting._split.TTSplit]

Create a train-test split of data by randomly sampling individual interactions.

We can loop over a sequence of train-test pairs:

>>> from lenskit.data import load_movielens
>>> movielens = load_movielens('data/ml-latest-small')
>>> for split in sample_records(movielens, 1000, repeats=5):
...     print(sum(len(il) for il in split.test.lists()))
1000
1000
1000
1000
1000

Sometimes for testing, it is useful to just get a single pair:

>>> split = sample_records(movielens, 1000)
>>> sum(len(il) for il in split.test.lists())
1000

Stability:

Caller (see Stability Levels).

Parameters:

data – The data set to split.
size – The size of each test sample.
repeats – The number of data splits to produce. If None, produce a _single_ train-test pair instead of an iterator or list.
disjoint – If True, force test samples to be disjoint.
test_only – If True, returns splits with empty training sets (useful when you just want to save the test data).
rng – The random number generator or seed (see Random Seeds).

Returns:

A train-test pair or iterator of such pairs (depending on repeats).

Global temporal train-test split. This splits a data set into train/test pairs using a single global timestamp. When given multiple timestamps, it will return multiple splits, where split \(i\) has training data from before \(t_i\) and testing data on or after \(t_i\) and before \(t_{i+1}\) (the last split has no upper bound on the testing data).

Stability:

Caller (see Stability Levels).

Parameters:

data – The dataset to split.
time – Time or sequence of times at which to split. Strings must be in ISO format.
end – A final cutoff time for the testing data.
filter_test_users – Limit test data to only have users who had item in the training data.

Returns:

The data splits.

lenskit.splitting.split_temporal_fraction(data, test_fraction, filter_test_users=False)#

Do a global temporal split of a data set based on a test set size.

Parameters:

data (lenskit.data.Dataset) – The dataset to split.
test_fraction (float) – The fraction of the interactions to put in the testing data.
filter_test_users (bool | int | None) – Limit test data to only have users who had item in the training data.

Return type:

lenskit.splitting._split.TTSplit

lenskit.splitting.crossfold_users(data, partitions, method, *, test_only=False, rng=None)#

Partition a dataset user-by-user for user-based cross-validation.

Stability:

Caller (see Stability Levels).

Parameters:

data (lenskit.data.Dataset) – The dataset to partition.
partitions (int) – The number of partitions to produce.
method (lenskit.splitting._holdout.HoldoutMethod) – The method for selecting test rows for each user.
test_only (bool) – If True, returns splits with only testing data.
rng (lenskit.random.RNGInput | None) – The random number generator or seed (see Random Seeds).

Return type:

Iterator[lenskit.splitting._split.TTSplit]

Returns: The train-test pairs.

lenskit.splitting.sample_users(data: lenskit.data.Dataset, size: int, method: lenskit.splitting._holdout.HoldoutMethod, *, repeats: int, disjoint: bool = True, test_only: bool = False, rng: lenskit.random.RNGInput = None) → Iterator[lenskit.splitting._split.TTSplit]#

lenskit.splitting.sample_users(data: lenskit.data.Dataset, size: int, method: lenskit.splitting._holdout.HoldoutMethod, *, disjoint: bool = True, rng: lenskit.random.RNGInput = None, test_only: bool = False, repeats: None = None) → lenskit.splitting._split.TTSplit

Create train-test splits by sampling users. When repeats is None, returns a single train-test split; otherwise, it returns an iterator over multiple splits. If repeats=1, this function returns an iterator that yields a single train-test pair.

Stability:

Caller (see Stability Levels).

Parameters:

data – The data set to sample.
size – The sample size.
method – The method for obtaining user test ratings.
repeats – The number of samples to produce.
test_only – If True, returns splits with empty training sets (useful when you just want to save the test data).
rng – The random number generator or seed (see Random Seeds).

Returns:

The train-test pair(s).

lenskit.splitting.simple_test_pair(ratings, n_users=200, n_rates=5, f_rates=None, rng=None)#

Return a single, basic train-test pair for some ratings. This is only intended for convenience use in test and demos - do not use for research.

Parameters:

ratings (lenskit.data.Dataset)
rng (numpy.random.Generator | None)

Return type:

_split.TTSplit

Exported Aliases#

class lenskit.splitting.Dataset#: Re-exported alias for lenskit.data.Dataset.

lenskit.splitting#

Classes#

Functions#

Package Contents#

Exported Aliases#

This Page