lenskit.data.DatasetBuilder =========================== .. py:class:: lenskit.data.DatasetBuilder(name = None) :canonical: lenskit.data._builder.DatasetBuilder Construct data sets from data and tables. :param name: The name of the new dataset, or a data container or dataset to use as the basis for building a new dataset. .. py:attribute:: schema :type: lenskit.data.schema.DataSchema The data schema assembled so far. Do not modify this schema directly. .. py:property:: name :type: str | None Get the dataset name. .. py:method:: entity_classes() Get the entity classes defined so far. .. py:method:: relationship_classes() Get the relationship classes defined so far. .. py:method:: record_count(class_name) Get the number of records for the specified entity or relationship class. .. py:method:: entity_id_type(name) Get the PyArrow data type for an entity classes's identifiers. .. py:method:: add_entity_class(name) Add an entity class to the dataset. :param name: The name of the entity class. .. py:method:: add_relationship_class(name, entities, allow_repeats = True, interaction = False) Add a relationship class to the dataset. This usually doesn't need to be called; :meth:`add_relationships` and :meth:`add_interactions` will automatically add the relationship class if needed. As noted in :ref:`data-model`, a *relationship* records a relationship or interaction between two or more *entities*. Interactions are usually between users and items. The ``entities`` option to this method defines the names of the entity classes participating. .. note:: The order of entity classes in ``entities`` matters, as the relationship matrix logic (:meth:`lenskit.data.RelationshipSet.matrix`) will default to using the first and last entity classes as the rows and columns of the matrix. :param name: The name of the relationship class. :param entities: The entity classes participating in the relationship class. :param allow_repeats: Whether repeated records for the same combination of entities are allowed. :param interaction: Whether this is an interaction relationship. .. py:method:: add_entities(cls: str, ids: lenskit.data.types.IDSequence | pyarrow.Array[Any] | pyarrow.ChunkedArray[Any], /, *, duplicates: DuplicateAction = 'error') -> None add_entities(cls: str, frame: TableInput, /, *, duplicates: DuplicateAction = 'error') -> None Add entities to the data set. When constructed with a data frame or table, this method looks for entitiy IDs in the column ``{cls}_id`` (e.g. ``item_id``). If no such column exists, and the data frame is a Pandas data frame, then item IDs are taken from the data frame's index. A warning is issued if the data frame index name is not ``{cls}_id``. :param cls: The name of the entity class (e.g. ``"item"``). :param source: The input data, as an array or list of entity IDs, or a data frame of entities with attributes. :param duplicates: How to handle duplicate entity IDs. :raises DataError: When there is a fatal problem with the supplied data. :Warns: **DataWarning** -- When the data is valid but suspect, such as a data frame with no column or index named ``{cls}_id``. .. py:method:: add_relationships(cls, data, *, entities = None, missing = 'error', allow_repeats = True, interaction = False, _warning_parent = 0, remove_repeats = False) Add relationship records to the data set. This method adds relationship records, provided as a Pandas data frame or an Arrow table, to the data set being built. The relationships can be of a new class (in which case it will be created), or new relationship records for an existing class. For each entity ``E`` participating in the relationship, the table must have a column named ``E_id`` storing the entity IDs. :param cls: The name of the interaction class (e.g. ``rating``, ``purchase``). :param data: The interaction data. :param entities: The entity classes involved in this interaction class. :param missing: What to do when interactions reference nonexisting entities; can be ``"error"`` or ``"insert"``. :param allow_repeats: Whether repeated interactions are allowed. :param interaction: Whether this is an interaction relationship or not; can be ``"default"`` to indicate this is the default interaction relationship. :param remove_repeats: If ``True``, repeated interactions will be removed. If ``"exact"``, duplicated interactions will be removed. .. py:method:: add_interactions(cls, data, *, entities = None, missing = 'error', allow_repeats = True, default = False, remove_repeats = False) Add a interaction records to the data set. This method adds new interaction records, provided as a Pandas data frame or an Arrow table, to the data set being built. The interactions can be of a new class (in which case it will be created), or new interactions for an existing class. For each entity ``E`` participating in the interaction, the table must have a column named ``E_id`` storing the entity IDs. Interactions should usually have user as the first entity and item as the last; the default interaction matrix logic uses the first and last entities as the rows and columns, respectively, of the interaction matrix. :param cls: The name of the interaction class (e.g. ``rating``, ``purchase``). :param data: The interaction data. :param entities: The entity classes involved in this interaction class. :param missing: What to do when interactions reference nonexisting entities; can be ``"error"`` or ``"insert"``. :param allow_repeats: Whether repeated interactions are allowed. :param default: If ``True``, set this as the default interaction class (if the dataset has more than one interaction class). :param remove_repeats: If ``True``, repeated interactions will be removed. If ``"exact"``, duplicated interactions will be removed. .. py:method:: filter_interactions(cls, min_time = None, max_time = None, remove = None) Filter interactions based on timestamp or to remove particular entities. :param cls: The interaction class to filter. :param min_time: The minimum interaction time to keep (inclusive). :param max_time: The maximum interaction time to keep (exclusive). :param remove: Combinations of entity numbers or IDs to remove. The entities are filtered using an anti-join with this table, so providing a single column of entity IDs or numbers will remove all interactions associated with the listed entities. .. py:method:: binarize_ratings(cls = 'rating', min_pos_rating = 3.0, method = 'remove') Binarize the ratings in a relationship class. :param cls: The relationship class to binarize (default: "rating"). :param min_pos_rating: Minimum rating to consider as positive. :param method: 'zero' to set ratings to 0/1, 'remove' to drop rows below min_rating. .. py:method:: clear_relationships(cls) Remove all records for a specified relationship class. .. py:method:: add_scalar_attribute(cls: str, name: str, data: pandas.Series[Any] | TableInput, /, *, dictionary: bool = False) -> None add_scalar_attribute(cls: str, name: str, entities: lenskit.data.types.IDSequence | tuple[lenskit.data.types.IDSequence, Ellipsis], values: numpy.typing.ArrayLike, /, *, dictionary: bool = False) -> None Add a scalar attribute to an entity class. :param cls: The entity class name. :param name: The attribute name. :param entities: The IDs for the entities whose attribute should be set. :param values: The attribute values. :param data: A Pandas datatframe or Arrow table storing entity IDs and attribute values. :param dictionary: ``True`` to dictionary-encode the attribute values (saves space for string categorical values). .. py:method:: add_list_attribute(cls: str, name: str, data: pandas.Series[Any] | TableInput, /, *, dictionary: bool = False) -> None add_list_attribute(cls: str, name: str, entities: lenskit.data.types.IDSequence | tuple[lenskit.data.types.IDSequence, Ellipsis], values: numpy.typing.ArrayLike, /, *, dictionary: bool = False) -> None Add a list attribute to an entity class. :param cls: The entity class name. :param name: The attribute name. :param entities: The IDs for the entities whose attribute should be set. :param values: The attribute values (an array or list of lists) :param data: A Pandas datatframe or Arrow table storing entity IDs and attribute values. :param dictionary: ``True`` to dictionary-encode the attribute values (saves space for string categorical values). .. py:method:: add_vector_attribute(cls, name, entities, values, /, dim_names = None) Add a vector attribute to a set of entities. .. warning:: Dense vector attributes are stored densely, even for entities for which it is not set. High-dimensional vectors can therefore take up a lot of space. :param cls: The entity class name. :param name: The attribute name. :param entities: The entity IDs to which the attribute should be attached. :param values: The attribute values, as a fixed-length list array or a two-dimensional NumPy array (for dense vector attributes) or a SciPy sparse array (for sparse vector attributes). :param dim_names: The names for the dimensions of the array. .. py:method:: build() Build the dataset. .. py:method:: build_container() Build a data container (backing store for a dataset). .. py:method:: save(path) Save the dataset to disk in the LensKit native format. :param path: The path where the dataset will be saved (will be created as a directory).