lenskit.splitting#
Splitting data for train-test evaluation.
Classes#
Holdout methods select test rows for a user (or occasionally an item). |
|
Select a fraction of test rows per user/item. |
|
Select a fixed number of test rows per user/item, based on ordering by a |
|
Randomly select a fraction of test rows per user/item. |
|
Randomly select a fixed number of test rows per user/item. |
|
A train-test set from splitting or other sources. |
Functions#
|
Partition a dataset by records into cross-fold partitions. This |
Create a train-test split of data by randomly sampling individual |
|
Global temporal train-test split. This splits a data set into train/test |
|
|
Do a global temporal split of a data set based on a test set size. |
|
Partition a dataset user-by-user for user-based cross-validation. |
|
Create train-test splits by sampling users. When |
|
Return a single, basic train-test pair for some ratings. This is only intended |
Package Contents#
- class lenskit.splitting.HoldoutMethod#
Bases:
ProtocolHoldout methods select test rows for a user (or occasionally an item). Partition methods are callable; when called with a data frame, they return the test entries.
- Stability:
- Caller (see Stability Levels).
- abstractmethod __call__(items)#
Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs).
- Parameters:
udf – The item list from which holdout items should be selected.
items (lenskit.data.ItemList)
- Returns:
The list of test items.
- Return type:
- class lenskit.splitting.LastFrac(frac, field='timestamp')#
Bases:
HoldoutMethodSelect a fraction of test rows per user/item.
- Stability:
- Caller (see Stability Levels).
- Parameters:
frac (double) – the fraction of items to select for testing.
field (str)
- __call__(items)#
Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs).
- Parameters:
udf – The item list from which holdout items should be selected.
items (lenskit.data.ItemList)
- Returns:
The list of test items.
- Return type:
- class lenskit.splitting.LastN(n, field='timestamp')#
Bases:
HoldoutMethodSelect a fixed number of test rows per user/item, based on ordering by a field.
- Stability:
- Caller (see Stability Levels).
- Parameters:
- __call__(items)#
Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs).
- Parameters:
udf – The item list from which holdout items should be selected.
items (lenskit.data.ItemList)
- Returns:
The list of test items.
- Return type:
- class lenskit.splitting.SampleFrac(frac, rng=None)#
Bases:
HoldoutMethodRandomly select a fraction of test rows per user/item.
- Stability:
- Caller (see Stability Levels).
- Parameters:
frac (float) – The fraction items to select for testing.
rng (lenskit.random.RNGInput) – The random number generator or seed (see Random Seeds).
- __call__(items)#
Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs).
- Parameters:
udf – The item list from which holdout items should be selected.
items (lenskit.data.ItemList)
- Returns:
The list of test items.
- Return type:
- class lenskit.splitting.SampleN(n, rng=None)#
Bases:
HoldoutMethodRandomly select a fixed number of test rows per user/item.
- Stability:
- Caller (see Stability Levels).
- Parameters:
n (int) – The number of test items to select.
rng (lenskit.random.RNGInput) – The random number generator or seed (see Random Seeds).
- __call__(items)#
Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs).
- Parameters:
udf – The item list from which holdout items should be selected.
items (lenskit.data.ItemList)
- Returns:
The list of test items.
- Return type:
- lenskit.splitting.crossfold_records(data, partitions, *, test_only=False, rng=None)#
Partition a dataset by records into cross-fold partitions. This partitions the records (ratings, play counts, clicks, etc.) into k partitions without regard to users or items.
Since record-based random cross-validation doesn’t make much sense with repeated interactions, this splitter only supports operating on the dataset’s interaction matrix.
- Stability:
- Caller (see Stability Levels).
- Parameters:
data (lenskit.data.Dataset) – Ratings or other data you wish to partition.
partitions (int) – The number of partitions to produce.
test_only (bool) – If
True, returns splits with empty training sets (useful when you just want to save the test data).rng (lenskit.random.RNGInput) – The random number generator or seed (see Random Seeds).
- Returns:
an iterator of train-test pairs
- Return type:
iterator
- lenskit.splitting.sample_records(data: lenskit.data.Dataset, size: int, *, disjoint: bool = True, test_only: bool = False, rng: lenskit.random.RNGInput = None, repeats: None = None) lenskit.splitting._split.TTSplit#
- lenskit.splitting.sample_records(data: lenskit.data.Dataset, size: int, *, repeats: int, disjoint: bool = True, test_only: bool = False, rng: lenskit.random.RNGInput = None) Iterator[lenskit.splitting._split.TTSplit]
Create a train-test split of data by randomly sampling individual interactions.
We can loop over a sequence of train-test pairs:
>>> from lenskit.data import load_movielens >>> movielens = load_movielens('data/ml-latest-small') >>> for split in sample_records(movielens, 1000, repeats=5): ... print(sum(len(il) for il in split.test.lists())) 1000 1000 1000 1000 1000
Sometimes for testing, it is useful to just get a single pair:
>>> split = sample_records(movielens, 1000) >>> sum(len(il) for il in split.test.lists()) 1000
- Stability:
- Caller (see Stability Levels).
- Parameters:
data – The data set to split.
size – The size of each test sample.
repeats – The number of data splits to produce. If
None, produce a _single_ train-test pair instead of an iterator or list.disjoint – If
True, force test samples to be disjoint.test_only – If
True, returns splits with empty training sets (useful when you just want to save the test data).rng – The random number generator or seed (see Random Seeds).
- Returns:
A train-test pair or iterator of such pairs (depending on
repeats).
- class lenskit.splitting.TTSplit#
Bases:
Generic[TK]A train-test set from splitting or other sources.
- Stability:
- Caller (see Stability Levels).
- train: lenskit.data.Dataset#
The training data.
- test: lenskit.data.ItemListCollection[TK]#
The test data.
- property test_df: pandas.DataFrame#
Get the test data as a data frame.
- Return type:
- property train_df: pandas.DataFrame#
Get the training data as a data frame.
- Return type:
- lenskit.splitting.split_global_time(data: lenskit.data.Dataset, time: int | float | str | datetime.datetime, end: int | float | str | datetime.datetime | None = None, filter_test_users: bool | int | None = False) lenskit.splitting._split.TTSplit#
- lenskit.splitting.split_global_time(data: lenskit.data.Dataset, time: Sequence[int | float | str | datetime.datetime], end: int | float | str | datetime.datetime | None = None, filter_test_users: bool | int | None = False) list[lenskit.splitting._split.TTSplit]
Global temporal train-test split. This splits a data set into train/test pairs using a single global timestamp. When given multiple timestamps, it will return multiple splits, where split \(i\) has training data from before \(t_i\) and testing data on or after \(t_i\) and before \(t_{i+1}\) (the last split has no upper bound on the testing data).
- Stability:
- Caller (see Stability Levels).
- Parameters:
data – The dataset to split.
time – Time or sequence of times at which to split. Strings must be in ISO format.
end – A final cutoff time for the testing data.
filter_test_users – Limit test data to only have users who had item in the training data.
- Returns:
The data splits.
- lenskit.splitting.split_temporal_fraction(data, test_fraction, filter_test_users=False)#
Do a global temporal split of a data set based on a test set size.
- Parameters:
data (lenskit.data.Dataset) – The dataset to split.
test_fraction (float) – The fraction of the interactions to put in the testing data.
filter_test_users (bool | int | None) – Limit test data to only have users who had item in the training data.
- Return type:
- lenskit.splitting.crossfold_users(data, partitions, method, *, test_only=False, rng=None)#
Partition a dataset user-by-user for user-based cross-validation.
- Stability:
- Caller (see Stability Levels).
- Parameters:
data (lenskit.data.Dataset) – The dataset to partition.
partitions (int) – The number of partitions to produce.
method (lenskit.splitting._holdout.HoldoutMethod) – The method for selecting test rows for each user.
test_only (bool) – If
True, returns splits with only testing data.rng (lenskit.random.RNGInput | None) – The random number generator or seed (see Random Seeds).
- Return type:
Iterator[lenskit.splitting._split.TTSplit]
- Returns
The train-test pairs.
- lenskit.splitting.sample_users(data: lenskit.data.Dataset, size: int, method: lenskit.splitting._holdout.HoldoutMethod, *, repeats: int, disjoint: bool = True, test_only: bool = False, rng: lenskit.random.RNGInput = None) Iterator[lenskit.splitting._split.TTSplit]#
- lenskit.splitting.sample_users(data: lenskit.data.Dataset, size: int, method: lenskit.splitting._holdout.HoldoutMethod, *, disjoint: bool = True, rng: lenskit.random.RNGInput = None, test_only: bool = False, repeats: None = None) lenskit.splitting._split.TTSplit
Create train-test splits by sampling users. When
repeatsis None, returns a single train-test split; otherwise, it returns an iterator over multiple splits. Ifrepeats=1, this function returns an iterator that yields a single train-test pair.- Stability:
- Caller (see Stability Levels).
- Parameters:
data – The data set to sample.
size – The sample size.
method – The method for obtaining user test ratings.
repeats – The number of samples to produce.
test_only – If
True, returns splits with empty training sets (useful when you just want to save the test data).rng – The random number generator or seed (see Random Seeds).
- Returns:
The train-test pair(s).
- lenskit.splitting.simple_test_pair(ratings, n_users=200, n_rates=5, f_rates=None, rng=None)#
Return a single, basic train-test pair for some ratings. This is only intended for convenience use in test and demos - do not use for research.
- Parameters:
ratings (lenskit.data.Dataset)
rng (numpy.random.Generator | None)
- Return type:
Exported Aliases#
- class lenskit.splitting.Dataset#
Re-exported alias for
lenskit.data.Dataset.