lenskit.data.Dataset#

class lenskit.data.Dataset(data)#

Representation of a data set for LensKit training, evaluation, etc. Data can be accessed in a variety of formats depending on the needs of a component. See Data Model for details of the LensKit data model.

Dataset objects should not be directly constructed; instead, use a DatasetBuilder, load(), or from_interactions_df().

Note

Zero-copy conversions are used whenever possible, so client code must not modify returned data in-place.

Parameters:: data (lenskit.data._container.DataContainer | collections.abc.Callable[[], lenskit.data._container.DataContainer | Dataset]) – The container for this dataset’s data, or a function that will return such a container to create a lazy-loaded dataset.

Stability: Caller

This API is at the caller stability level: breaking changes for code calling this function or class will be reserved for annual major version bumps, but minor versions may introduce changes that break subclasses or reimplementations. See Stability Levels for details.

classmethod load(path)#

Load a dataset in the LensKit native format.

Parameters:: path (str | os.PathLike[str]) – The path to the dataset to load.
Returns:: The loaded dataset.
Return type:: Dataset

save(path)#

Save the data set in the LensKit native format.

Parameters:: path (str | os.PathLike[str]) – The path in which to save the data set (will be created as a directory).

property name: str | None#

Get the dataset’s name.

Return type:: str | None

property schema: lenskit.data.schema.DataSchema#

Get the schema of this dataset.

Return type:: lenskit.data.schema.DataSchema

property items: lenskit.data._vocab.Vocabulary#

The items known by this dataset.

Return type:: lenskit.data._vocab.Vocabulary

property users: lenskit.data._vocab.Vocabulary#

The users known by this dataset.

Return type:: lenskit.data._vocab.Vocabulary

property item_count: int#

Return type:: int

property user_count: int#

Return type:: int

entities(name)#

Get the entities of a particular type / class.

Parameters:: name (str)
Return type:: lenskit.data._entities.EntitySet

relationships(name)#

Get the relationship records of a particular type / class.

Parameters:: name (str)
Return type:: lenskit.data._relationships.RelationshipSet

interactions(name=None)#

Get the interaction records of a particular class. If no class is specified, returns the default interaction class.

Parameters:: name (str | None)
Return type:: lenskit.data._relationships.RelationshipSet

default_interaction_class()#

Return type:: str

property interaction_count: int#

Count the total number of interactions of the default class, taking into account any count attribute.

Return type:: int

abstractmethod interaction_table(*, format: Literal['pandas'], fields: str | list[str] | None = None, original_ids: bool = False) → pandas.DataFrame#

abstractmethod interaction_table(*, format: Literal['numpy'], fields: str | list[str] | None = None) → dict[str, numpy.typing.NDArray[Any]]

abstractmethod interaction_table(*, format: Literal['arrow'], fields: str | list[str] | None = None) → pyarrow.Table

Get the user-item interactions as a table in the requested format. The table is not in a specified order. Interactions may be repeated (e.g. the same user may listen to a song multiple times). For a non-repeated “ratings matrix” view of the data, see interaction_matrix().

This is a convenince wrapper on top of interactions() and the methods of RelationshipSet.

Warning

Client code must not perform in-place modifications on the table returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.

Parameters:

format –
The desired data format. Currently-supported formats are:
- "pandas" — returns a pandas.DataFrame. The index is not meaningful.
- "arrow" — returns a PyArrow Table. The index is not meaningful.
- "numpy" — returns a dictionary mapping names to arrays.
fields – Which fields (attributes) to include, or None to include all fields. Commonly-available fields include "rating" and "timestamp".
original_ids – If True, return user and item IDs as represented in the original source data in columns named user_id and item_id, instead of the user and item numbers typically returned.

Returns:

The user-item interaction log in the specified format.

abstractmethod interaction_matrix(*, format: Literal['pandas'], field: str | None = None, original_ids: bool = False) → pandas.DataFrame#

abstractmethod interaction_matrix(*, format: Literal['torch'], layout: Literal['csr', 'coo'] = 'csr', field: str | None = None) → torch.Tensor

abstractmethod interaction_matrix(*, format: Literal['scipy'], layout: Literal['coo'], field: str | None = None) → scipy.sparse.coo_array

abstractmethod interaction_matrix(*, format: Literal['scipy'], layout: Literal['csr'] = 'csr', field: str | None = None) → scipy.sparse.csr_array

abstractmethod interaction_matrix(*, format: Literal['structure'], layout: Literal['csr'] = 'csr') → lenskit.data.matrix.CSRStructure

Get the user-item interactions as “ratings” matrix from the default interaction class. Interactions are not repeated, and are coalesced with the default coalescing strategy for each attribute.

The matrix may be returned in “coordinate” format, in which case it is comparable to interaction_table() but without repeated interactions, or it may be in a compressed sparse row format.

This is a convenince wrapper on top of interactions() and the methods of MatrixRelationshipSet.

Warning

Client code must not perform in-place modifications on the matrix returned from this method. Whenever possible, it will be a shallow view on top of the underlying storage, and modifications may corrupt data for other code.

Parameters:

format –
The desired data format. Currently-supported formats are:
- "pandas" — returns a pandas.DataFrame.
- "torch" — returns a sparse torch.Tensor (see torch.sparse).
- "scipy" — returns a sparse array from scipy.sparse.
- "structure" — returns a CSRStructure containing only the user and item numbers in compressed sparse row format.
field –
Which field to return in the matrix. Common fields include "rating" and "timestamp".

If unspecified (None), this will yield an implicit-feedback indicator matrix, with 1s for observed items, except for the "pandas" format, which will return all attributes. Specify an empty list to return a Pandas data frame with only the user and item attributes.
layout – The layout for a sparse matrix. Can be either csr or coo, or None to use the default for the specified format. Ignored for the Pandas format.
original_ids – True to return user and item IDs instead of numbers in a pandas-format matrix.

abstractmethod user_row(user_id: lenskit.data.types.ID) → lenskit.data._items.ItemList | None#

abstractmethod user_row(*, user_num: int) → lenskit.data._items.ItemList

Get a user’s row from the interaction matrix for the default interaction class, using default coalsecing for repeated interactions. Available fields are returned as fields. If the dataset has ratings, these are provided as a rating field, not as the item scores. The item list is unordered, but items are returned in order by item number.

Parameters:

user_id – The ID of the user to retrieve.
user_num – The number of the user to retrieve.

Returns:

The user’s interaction matrix row, or None if no user with that ID exists.

item_stats()#

Get item statistics from the default interaction class.

Returns:

A data frame indexed by item ID with the interaction statistics. See Interaction Statistics for a description of the columns returned.

The index is aligned with the vocabulary, so iloc works with item numbers.

Return type:

pandas.DataFrame

user_stats()#

Get user statistics from the default interaction class.

Returns:

A data frame indexed by user ID with the interaction statistics. See Interaction Statistics for a description of the columns returned.

The index is the vocabulary, so iloc works with user numbers.

Return type:

pandas.DataFrame

__getstate__()#

Return type:: DatasetState

__setstate__(state)#

Parameters:: state (DatasetState)

__str__()#

Return type:: str

__repr__()#

Return type:: str

lenskit.data.Dataset#

This Page