lenskit.data.DatasetBuilder
===========================

.. py:class:: lenskit.data.DatasetBuilder(name = None)
   :canonical: lenskit.data._builder.DatasetBuilder

   Construct data sets from data and tables.

   :param name: The name of the new dataset, or a data container or dataset to
                use as the basis for building a new dataset.


   .. py:attribute:: schema
      :type:  lenskit.data.schema.DataSchema

      The data schema assembled so far.  Do not modify this schema directly.


   .. py:property:: name
      :type: str | None


      Get the dataset name.


   .. py:method:: entity_classes()

      Get the entity classes defined so far.


   .. py:method:: relationship_classes()

      Get the relationship classes defined so far.


   .. py:method:: record_count(class_name)

      Get the number of records for the specified entity or relationship class.


   .. py:method:: entity_id_type(name)

      Get the PyArrow data type for an entity classes's identifiers.


   .. py:method:: add_entity_class(name)

      Add an entity class to the dataset.

      :param name: The name of the entity class.


   .. py:method:: add_relationship_class(name, entities, allow_repeats = True, interaction = False)

      Add a relationship class to the dataset.  This usually doesn't need to
      be called; :meth:`add_relationships` and :meth:`add_interactions` will
      automatically add the relationship class if needed.

      As noted in :ref:`data-model`, a *relationship* records a relationship
      or interaction between two or more *entities*.  Interactions are usually
      between users and items.  The ``entities`` option to this method defines
      the names of the entity classes participating.

      .. note::

          The order of entity classes in ``entities`` matters, as the
          relationship matrix logic
          (:meth:`lenskit.data.RelationshipSet.matrix`) will default to using
          the first and last entity classes as the rows and columns of the
          matrix.

      :param name: The name of the relationship class.
      :param entities: The entity classes participating in the relationship class.
      :param allow_repeats: Whether repeated records for the same combination of entities
                            are allowed.
      :param interaction: Whether this is an interaction relationship.


   .. py:method:: add_entities(cls: str, ids: lenskit.data.types.IDSequence | pyarrow.Array[Any] | pyarrow.ChunkedArray[Any], /, *, duplicates: DuplicateAction = 'error') -> None
                  add_entities(cls: str, frame: TableInput, /, *, duplicates: DuplicateAction = 'error') -> None

      Add entities to the data set.

      When constructed with a data frame or table, this method looks for
      entitiy IDs in the column ``{cls}_id`` (e.g. ``item_id``).  If no such
      column exists, and the data frame is a Pandas data frame, then item IDs
      are taken from the data frame's index.  A warning is issued if the data
      frame index name is not ``{cls}_id``.

      :param cls: The name of the entity class (e.g. ``"item"``).
      :param source: The input data, as an array or list of entity IDs, or a data
                     frame of entities with attributes.
      :param duplicates: How to handle duplicate entity IDs.

      :raises DataError: When there is a fatal problem with the supplied data.

      :Warns: **DataWarning** -- When the data is valid but suspect, such as a data frame with no
              column or index named ``{cls}_id``.


   .. py:method:: add_relationships(cls, data, *, entities = None, missing = 'error', allow_repeats = True, interaction = False, _warning_parent = 0, remove_repeats = False)

      Add relationship records to the data set.

      This method adds relationship records, provided as a Pandas data frame
      or an Arrow table, to the data set being built.  The relationships can
      be of a new class (in which case it will be created), or new
      relationship records for an existing class.

      For each entity ``E`` participating in the relationship, the table must
      have a column named ``E_id`` storing the entity IDs.

      :param cls: The name of the interaction class (e.g. ``rating``,
                  ``purchase``).
      :param data: The interaction data.
      :param entities: The entity classes involved in this interaction class.
      :param missing: What to do when interactions reference nonexisting entities; can
                      be ``"error"`` or ``"insert"``.
      :param allow_repeats: Whether repeated interactions are allowed.
      :param interaction: Whether this is an interaction relationship or not; can be
                          ``"default"`` to indicate this is the default interaction
                          relationship.
      :param remove_repeats: If ``True``, repeated interactions will be removed. If ``"exact"``,
                             duplicated interactions will be removed.


   .. py:method:: add_interactions(cls, data, *, entities = None, missing = 'error', allow_repeats = True, default = False, remove_repeats = False)

      Add a interaction records to the data set.

      This method adds new interaction records, provided as a Pandas data
      frame or an Arrow table, to the data set being built.  The interactions
      can be of a new class (in which case it will be created), or new
      interactions for an existing class.

      For each entity ``E`` participating in the interaction, the table must
      have a column named ``E_id`` storing the entity IDs.

      Interactions should usually have user as the first entity and item as
      the last; the default interaction matrix logic uses the first and last
      entities as the rows and columns, respectively, of the interaction
      matrix.


      :param cls: The name of the interaction class (e.g. ``rating``,
                  ``purchase``).
      :param data: The interaction data.
      :param entities: The entity classes involved in this interaction class.
      :param missing: What to do when interactions reference nonexisting entities; can
                      be ``"error"`` or ``"insert"``.
      :param allow_repeats: Whether repeated interactions are allowed.
      :param default: If ``True``, set this as the default interaction class (if the
                      dataset has more than one interaction class).
      :param remove_repeats: If ``True``, repeated interactions will be removed. If ``"exact"``,
                             duplicated interactions will be removed.


   .. py:method:: filter_interactions(cls, min_time = None, max_time = None, remove = None)

      Filter interactions based on timestamp or to remove particular entities.

      :param cls: The interaction class to filter.
      :param min_time: The minimum interaction time to keep (inclusive).
      :param max_time: The maximum interaction time to keep (exclusive).
      :param remove: Combinations of entity numbers or IDs to remove.  The entities
                     are filtered using an anti-join with this table, so providing a
                     single column of entity IDs or numbers will remove all
                     interactions associated with the listed entities.


   .. py:method:: binarize_ratings(cls = 'rating', min_pos_rating = 3.0, method = 'remove')

      Binarize the ratings in a relationship class.

      :param cls: The relationship class to binarize (default: "rating").
      :param min_pos_rating: Minimum rating to consider as positive.
      :param method: 'zero' to set ratings to 0/1, 'remove' to drop rows below min_rating.


   .. py:method:: clear_relationships(cls)

      Remove all records for a specified relationship class.


   .. py:method:: add_scalar_attribute(cls: str, name: str, data: pandas.Series[Any] | TableInput, /, *, dictionary: bool = False) -> None
                  add_scalar_attribute(cls: str, name: str, entities: lenskit.data.types.IDSequence | tuple[lenskit.data.types.IDSequence, Ellipsis], values: numpy.typing.ArrayLike, /, *, dictionary: bool = False) -> None

      Add a scalar attribute to an entity class.

      :param cls: The entity class name.
      :param name: The attribute name.
      :param entities: The IDs for the entities whose attribute should be set.
      :param values: The attribute values.
      :param data: A Pandas datatframe or Arrow table storing entity IDs and
                   attribute values.
      :param dictionary: ``True`` to dictionary-encode the attribute values (saves space
                         for string categorical values).


   .. py:method:: add_list_attribute(cls: str, name: str, data: pandas.Series[Any] | TableInput, /, *, dictionary: bool = False) -> None
                  add_list_attribute(cls: str, name: str, entities: lenskit.data.types.IDSequence | tuple[lenskit.data.types.IDSequence, Ellipsis], values: numpy.typing.ArrayLike, /, *, dictionary: bool = False) -> None

      Add a list attribute to an entity class.

      :param cls: The entity class name.
      :param name: The attribute name.
      :param entities: The IDs for the entities whose attribute should be set.
      :param values: The attribute values (an array or list of lists)
      :param data: A Pandas datatframe or Arrow table storing entity IDs and
                   attribute values.
      :param dictionary: ``True`` to dictionary-encode the attribute values (saves space
                         for string categorical values).


   .. py:method:: add_vector_attribute(cls, name, entities, values, /, dim_names = None)

      Add a vector attribute to a set of entities.

      .. warning::

          Dense vector attributes are stored densely, even for entities for
          which it is not set. High-dimensional vectors can therefore take up
          a lot of space.

      :param cls: The entity class name.
      :param name: The attribute name.
      :param entities: The entity IDs to which the attribute should be attached.
      :param values: The attribute values, as a fixed-length list array or a
                     two-dimensional NumPy array (for dense vector attributes) or a
                     SciPy sparse array (for sparse vector attributes).
      :param dim_names: The names for the dimensions of the array.


   .. py:method:: build()

      Build the dataset.


   .. py:method:: build_container()

      Build a data container (backing store for a dataset).


   .. py:method:: save(path)

      Save the dataset to disk in the LensKit native format.

      :param path: The path where the dataset will be saved (will be created as a
                   directory).