lenskit.splitting ================= .. py:module:: lenskit.splitting .. autoapi-nested-parse:: Splitting data for train-test evaluation. Classes ------- .. autoapisummary:: lenskit.splitting.HoldoutMethod lenskit.splitting.LastFrac lenskit.splitting.LastN lenskit.splitting.SampleFrac lenskit.splitting.SampleN lenskit.splitting.TTSplit Functions --------- .. autoapisummary:: lenskit.splitting.crossfold_records lenskit.splitting.sample_records lenskit.splitting.split_global_time lenskit.splitting.split_temporal_fraction lenskit.splitting.crossfold_users lenskit.splitting.sample_users lenskit.splitting.simple_test_pair Package Contents ---------------- .. py:class:: HoldoutMethod :canonical: lenskit.splitting._holdout.HoldoutMethod Bases: :py:obj:`Protocol` Holdout methods select test rows for a user (or occasionally an item). Partition methods are callable; when called with a data frame, they return the test entries. :Stability: Caller .. py:method:: __call__(items) :abstractmethod: Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs). :param udf: The item list from which holdout items should be selected. :returns: The list of test items. .. py:class:: LastFrac(frac, field = 'timestamp') :canonical: lenskit.splitting._holdout.LastFrac Bases: :py:obj:`HoldoutMethod` Select a fraction of test rows per user/item. :Stability: Caller :param frac: the fraction of items to select for testing. :type frac: double .. py:attribute:: fraction :type: float .. py:attribute:: field :type: str .. py:method:: __call__(items) Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs). :param udf: The item list from which holdout items should be selected. :returns: The list of test items. .. py:class:: LastN(n, field = 'timestamp') :canonical: lenskit.splitting._holdout.LastN Bases: :py:obj:`HoldoutMethod` Select a fixed number of test rows per user/item, based on ordering by a field. :Stability: Caller :param n: The number of test items to select. :param field: The field to order by. .. py:attribute:: n :type: int .. py:attribute:: field :type: str .. py:method:: __call__(items) Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs). :param udf: The item list from which holdout items should be selected. :returns: The list of test items. .. py:class:: SampleFrac(frac, rng = None) :canonical: lenskit.splitting._holdout.SampleFrac Bases: :py:obj:`HoldoutMethod` Randomly select a fraction of test rows per user/item. :Stability: Caller :param frac: The fraction items to select for testing. :param rng: The random number generator or seed (see :ref:`rng`). .. py:attribute:: fraction :type: float .. py:attribute:: rng :type: numpy.random.Generator .. py:method:: __call__(items) Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs). :param udf: The item list from which holdout items should be selected. :returns: The list of test items. .. py:class:: SampleN(n, rng = None) :canonical: lenskit.splitting._holdout.SampleN Bases: :py:obj:`HoldoutMethod` Randomly select a fixed number of test rows per user/item. :Stability: Caller :param n: The number of test items to select. :param rng: The random number generator or seed (see :ref:`rng`). .. py:attribute:: n :type: int .. py:attribute:: rng :type: numpy.random.Generator .. py:method:: __call__(items) Subset an item list (in the uncommon case of item-based holdouts, the item list actually holds user IDs). :param udf: The item list from which holdout items should be selected. :returns: The list of test items. .. py:function:: crossfold_records(data, partitions, *, test_only = False, rng = None) Partition a dataset by **records** into cross-fold partitions. This partitions the records (ratings, play counts, clicks, etc.) into *k* partitions without regard to users or items. Since record-based random cross-validation doesn't make much sense with repeated interactions, this splitter only supports operating on the dataset's interaction matrix. :Stability: Caller :param data: Ratings or other data you wish to partition. :param partitions: The number of partitions to produce. :param test_only: If ``True``, returns splits with empty training sets (useful when you just want to save the test data). :param rng: The random number generator or seed (see :ref:`rng`). :returns: an iterator of train-test pairs :rtype: iterator .. py:function:: sample_records(data: lenskit.data.Dataset, size: int, *, disjoint: bool = True, test_only: bool = False, rng: lenskit.random.RNGInput = None, repeats: None = None) -> lenskit.splitting._split.TTSplit sample_records(data: lenskit.data.Dataset, size: int, *, repeats: int, disjoint: bool = True, test_only: bool = False, rng: lenskit.random.RNGInput = None) -> Iterator[lenskit.splitting._split.TTSplit] Create a train-test split of data by randomly sampling individual interactions. We can loop over a sequence of train-test pairs:: >>> from lenskit.data import load_movielens >>> movielens = load_movielens('data/ml-latest-small') >>> for split in sample_records(movielens, 1000, repeats=5): ... print(sum(len(il) for il in split.test.lists())) 1000 1000 1000 1000 1000 Sometimes for testing, it is useful to just get a single pair:: >>> split = sample_records(movielens, 1000) >>> sum(len(il) for il in split.test.lists()) 1000 :Stability: Caller :param data: The data set to split. :param size: The size of each test sample. :param repeats: The number of data splits to produce. If ``None``, produce a _single_ train-test pair instead of an iterator or list. :param disjoint: If ``True``, force test samples to be disjoint. :param test_only: If ``True``, returns splits with empty training sets (useful when you just want to save the test data). :param rng: The random number generator or seed (see :ref:`rng`). :returns: A train-test pair or iterator of such pairs (depending on ``repeats``). .. py:class:: TTSplit :canonical: lenskit.splitting._split.TTSplit Bases: :py:obj:`Generic`\ [\ :py:obj:`TK`\ ] A train-test set from splitting or other sources. :Stability: Caller .. py:attribute:: train :type: lenskit.data.Dataset The training data. .. py:attribute:: test :type: lenskit.data.ItemListCollection[TK] The test data. .. py:attribute:: name :type: str | None :value: None A name for this train-test split. .. py:property:: test_size :type: int Get the number of test pairs. .. py:property:: test_df :type: pandas.DataFrame Get the test data as a data frame. .. py:property:: train_df :type: pandas.DataFrame Get the training data as a data frame. .. py:function:: split_global_time(data: lenskit.data.Dataset, time: int | float | str | datetime.datetime, end: int | float | str | datetime.datetime | None = None, filter_test_users: bool | int | None = False) -> lenskit.splitting._split.TTSplit split_global_time(data: lenskit.data.Dataset, time: Sequence[int | float | str | datetime.datetime], end: int | float | str | datetime.datetime | None = None, filter_test_users: bool | int | None = False) -> list[lenskit.splitting._split.TTSplit] Global temporal train-test split. This splits a data set into train/test pairs using a single global timestamp. When given multiple timestamps, it will return multiple splits, where split :math:`i` has training data from before :math:`t_i` and testing data on or after :math:`t_i` and before :math:`t_{i+1}` (the last split has no upper bound on the testing data). :Stability: Caller :param data: The dataset to split. :param time: Time or sequence of times at which to split. Strings must be in ISO format. :param end: A final cutoff time for the testing data. :param filter_test_users: Limit test data to only have users who had item in the training data. :returns: The data splits. .. py:function:: split_temporal_fraction(data, test_fraction, filter_test_users = False) Do a global temporal split of a data set based on a test set size. :param data: The dataset to split. :param test_fraction: The fraction of the interactions to put in the testing data. :param filter_test_users: Limit test data to only have users who had item in the training data. .. py:function:: crossfold_users(data, partitions, method, *, test_only = False, rng = None) Partition a dataset user-by-user for user-based cross-validation. :Stability: Caller :param data: The dataset to partition. :param partitions: The number of partitions to produce. :param method: The method for selecting test rows for each user. :param test_only: If ``True``, returns splits with only testing data. :param rng: The random number generator or seed (see :ref:`rng`). Returns The train-test pairs. .. py:function:: sample_users(data: lenskit.data.Dataset, size: int, method: lenskit.splitting._holdout.HoldoutMethod, *, repeats: int, disjoint: bool = True, test_only: bool = False, rng: lenskit.random.RNGInput = None) -> Iterator[lenskit.splitting._split.TTSplit] sample_users(data: lenskit.data.Dataset, size: int, method: lenskit.splitting._holdout.HoldoutMethod, *, disjoint: bool = True, rng: lenskit.random.RNGInput = None, test_only: bool = False, repeats: None = None) -> lenskit.splitting._split.TTSplit Create train-test splits by sampling users. When ``repeats`` is None, returns a single train-test split; otherwise, it returns an iterator over multiple splits. If ``repeats=1``, this function returns an iterator that yields a single train-test pair. :Stability: Caller :param data: The data set to sample. :param size: The sample size. :param method: The method for obtaining user test ratings. :param repeats: The number of samples to produce. :param test_only: If ``True``, returns splits with empty training sets (useful when you just want to save the test data). :param rng: The random number generator or seed (see :ref:`rng`). :returns: The train-test pair(s). .. py:function:: simple_test_pair(ratings, n_users=200, n_rates=5, f_rates=None, rng = None) Return a single, basic train-test pair for some ratings. This is only intended for convenience use in test and demos - do not use for research. Exported Aliases ---------------- .. py:class:: lenskit.splitting.Dataset Re-exported alias for :py:class:`lenskit.data.Dataset`.