Datasets#
LensKit provides a unified data model for recommender systems data along with classes and utility functions for working with it, described in this section of the manual.
Changed in version 2025.1: The new Dataset class replaces the Pandas data frames
that were passed to algorithms in the past. It also subsumes
the old support for producing sparse matrices from rating frames.
Getting started with the dataset is fairly straightforward:
>>> from lenskit.data import load_movielens
>>> mlds = load_movielens('data/ml-latest-small')
>>> mlds.item_count
9125
You can then access the data from the various methods of the Dataset class.
For example, if you want to get the ratings as a data frame:
>>> mlds.interaction_matrix(format='pandas', field='rating')
user_num item_num rating
0 0 30 2.5
1 0 833 3.0
2 0 859 3.0
3 0 906 2.0
4 0 931 4.0
...
[100004 rows x 3 columns]
Or obtain item statistics:
>>> mlds.item_stats()
record_count user_count ... first_time last_time
item_id ...
1 247 247 ... 1996-03-30 19:00:13 2016-10-06 19:55:11
2 107 107 ... 1996-03-30 19:12:30 2016-08-01 17:42:33
3 59 59 ... 1996-06-05 06:19:04 2016-08-16 22:07:21
4 13 13 ... 1996-06-10 16:45:35 2004-07-27 06:14:12
5 56 56 ... 1996-04-14 14:23:59 2016-08-16 22:15:47
...
[9125 rows x 7 columns]
Data Model and Key Concepts#
The LensKit data model, detailed in Data Model consists of entities (often users and items) and interactions, with attributes providing additional (optional) data about each of these entities. The simplest valid LensKit data set is simply a list of user and item identifiers indicating which items each user has interacted with. These may be augmented with ratings, timestamps, or any other attributes.
Data can be read from a range of sources, but ultimately resolves to a
collection of tables (e.g. Pandas DataFrame) that record user,
item, and interaction data.
Identifiers#
Users and items have two identifiers:
The identifier as presented in the original source table(s). It appears in LensKit data frames as
user_idanditem_idcolumns. Identifiers can be integers, strings, or byte arrays, and are represented in LensKit by theIDtype.The number assigned by the dataset handling code. This is a 0-based contiguous user or item number that is suitable for indexing into arrays or matrices, a common operation in recommendation models. In data frames, this appears as a
user_numoritem_numcolumn. It is the only representation supported by NumPy and PyTorch array formats.User and item numbers are assigned based on identifiers in the initial data source. Adding all entities at once, or using one of the standard loaders, will sort the identifiers before assigning numbers, so reloading the same data set will yield the same numbers. Loading a subset, however, is not guaranteed to result in the same numbers, as the subset may be missing some users or items.
Adding additional users or items to a data set builder will assign numbers based on the sorted identifiers that do not yet have numbers.
Identifiers and numbers can be mapped to each other with the user and item
vocabularies (users and items, see the
Vocabulary class).
Dataset Abstraction#
The LensKit Dataset class is the standard LensKit interface to datasets
for training, evaluation, etc. Trainable models and components expect a dataset
instance to be passed to train().
Datasets provide several views of different aspsects of a dataset, documented in
more detail in the reference documentation. These include:
Sets of known user and item identifiers, through
Vocabularyobjects exposed through theDataset.usersandDataset.itemsproperties.Access to the entities and relationships (including interactions) defined in the dataset.
Analyzing Interactions#
Dataset allows client code to obtain interactions between entities
(such as users rating items), or other inter-entity relationships, in a variety
of formats (including Pandas data frames and SciPy or PyTorch sparse matrices).
The RelationshipSet and MatrixRelationshipSet classes provide
the primary interfaces to these capabilities.
Interaction Statistics#
Datasets also provide cached access to various statistics of the entities
involved in an interaction class. These are currently exposed through
MatrixRelationshipSet.row_stats() and
col_stats(); for convenience, the statistics from
the default interaction class are available on Dataset.user_stats() and
Dataset.item_stats().
These statistics include:
countThe total number of relationships for the entity.
record_countThe number of relationship or interaction records for the entity. This is equal to
count, unless the relationship type has acountattribute, in which case this attribute is the number of records andcountis the total number of interactions.<other>_countThe number of distinct entities of type <other> this entity has interacted with. For example, the user statistics of a normal user-item interaction type will have an
item_countcolumn.rating_countThe number of explicit rating values (only defined if the interaction type has a
ratingattribute).mean_ratingThe mean rating provided by or for this entity (only defined if the interaction type has a
ratingattribute).first_timeThe first recorded timestamp for this entity’s interactions (only defined if the interaction type has a
timestampattribute).last_timeThe last recorded timestamp for this entity’s interactions (only defined if the interaction type has a
timestampattribute).
Creating Datasets#
Several functions and classes can create a Dataset from different input
data sources.
|
Construct data sets from data and tables. |
|
Create a dataset from a data frame of ratings or other user-item interactions. |
Loading Common Datasets#
LensKit also provides support for loading several common data sets directly from their source files.
|
Load a MovieLens dataset. |
Saving Datasets#
LensKit has a native dataset format to which datasets can be saved and loaded. This format fully represents the internal data structures. See the following methods to use it:
|
Save the data set in the LensKit native format. |
|
Load a dataset in the LensKit native format. |
|
Save the dataset to disk in the LensKit native format. |
Compatibility
The LensKit native dataset format code maintains the following Compatibility guarantees:
LensKit can read datasets saved with any earlier minor version in the same major-version series (e.g. 2025.2 can read from 2025.1).
LensKit can usually read datasets saved with a later minor version, but this is not fully guaranteed.
LensKit will read datasets saved with any prior version on a best-effort basis. We may in the future upgrade this to guarantee full backwards compatibility for reading older dataset versions.