Flexible Matrix Factorization#

Stability: Experimental

The FlexMF model framework is currently provided as an experimental preview. It works pretty well, but may be adjusted as we stabilize it and gain more experience in the next months.

The LensKit FlexMF (Flexible Matrix Factorization) family of models use matrix factorization in various configurations to realize several scoring models from the literature in a single configurable design, implemented in PyTorch with support for GPU-based training.

The FlexMF components and configuration are in the lenskit.flexmf package.

First Model#

FlexMF works like any other LensKit scorer. To train a simple implicit-feedback scorer with logistic matrix factorization [Joh14], you can do:

>>> from lenskit.flexmf import FlexMFImplicitScorer
>>> from lenskit.data import load_movielens
>>> from lenskit import topn_pipeline, recommend
>>> # load movie data
>>> data = load_movielens('data/ml-latest-small')
>>> # set up model
>>> model = FlexMFImplicitScorer(embedding_size=32, loss="logistic")
>>> pipe = topn_pipeline(model, n=10)
>>> # train the model
>>> pipe.train(data)
>>> # recommend for user 500
>>> recommend(pipe, 500)
<ItemList of 10 items with 2 fields {
  ids: ...
  numbers: [...]
  rank: ...
  score: [...]
}>

Common Configuration#

All FlexMF models share some configuration option in common, defined by FlexMFConfigBase:

Model Structure Options
embedding_size

The dimension of the matrix factorization. FlexMF works best when this is a power of 2.

Regularization Options

FlexMF supports two different forms of regularization: AdamW weight decay and L2 regularization. With L2 regularization, the term is included directly in the loss function, and the model can be trained with sparse gradients.

reg_method:

The method to use for regularization (AdamW or L2), or None to disable regularization.

regularization:

The regularization weight.

Tip

Our internal experiments have generally found AdamW regularization to be more effective, and seen little to no benefit from the sparse gradients allowed by L2. We may remove configurable regularization types before removing the experimental label from FlexMF.

Training Options
batch_size:

The size for individual training batches. The optimal batch size is usually much larger than deep models, because the collaborative filtering models are relatively simple.

learning_rate:

The base learning rate for the AdamW or SparseAdam optimizer.

epochs:

The number of training epochs.

Explicit Feedback#

The FlexMFExplicitScorer class provides an explicit-feedback rating prediction model with biased matrix factorization. The model itself is mathematically identical to BiasedMFScorer, but is trained using minibatch gradient descent in PyTorch and can use a GPU. User and item biases are learned jointly with the embeddings, and are attenuated for low-information users and items through regularization instead of an explicit damping term.

Implicit Feedback#

FlexMFImplicitScorer provides the implicit-feedback scorers in FlexMF. This scorer supports multiple loss functions and training regimes that can be selected to yield logistic matrix factorization, BPR, WARP, or others (see FlexMFImplicitConfig for full options).

Presets#

For easy configuration, FlexMF provides presets (the preset option) that set the defaults for other options to match the originally-published versions of the models FlexMF can realize. Select a preset by configuring e.g. preset="bpr".

Below are the settings controlled by each preset:

Preset

bpr

warp

lightgcn

loss

pairwise

warp

pairwise

negative_strategy

uniform

misranked

uniform

user_bias

False

False

False

item_bias

False

False

False

convolution_layers

0

0

3

These are only defaults — settings in Common Configuration and Additional Configuration Options will override these settings.

Additional Configuration Options#

The two primary options that control the model’s overall behavior are the loss function and the sampling strategy.

Three loss functions (loss) are supported:

"logistic"

Logistic loss, as used in Logistic Matrix Factorization [Joh14].

"pairwise"

Pairwise rank loss, as used in Bayesian Personalized Ranking [RFGSchmidtThieme09].

"warp"

Weighted approximate rank lost, a revised version of pairwise loss used in WARP [WYW13]. Only works with the "misranked" sampling strategy.

Three sampling strategies (negative_strategy) are supported:

"uniform"

Negative items are sampled uniformly at random from the corpus. This is the default for logistic and pairwise losses.

"popular"

Negative items are sampled proportional to their popularity in the training data.

"misranked"

Negative items are sampled based on their scores from the model so far, looking for misranked items. This strategy comes from WARP, but can be used with other loss functions as well. It is the default (and only) strategy for WARP loss.

Finally, neighborhood aggregation in the style of LightGCN [HDW+20] can be enabled with the convolution_layers option.

Relationship to Published Models#

FlexMF is designed to provide reasonable implementations of various designs from the literature in an integrated manner, differing primarily in fundamental design points rather than implementation details, and facilitating good code re-use and testing across implementations.

  • The explicit-feedback model implements biased matrix factorization as described in many papers, with the core model appearing in both ALS [PilaszyZT10] and FunkSVD [Fun06]. The description most closely aligning with this implementation is probably that of Koren et al. [KBV09].

  • The default implicit settings are similar to logistic matrix factorization [Joh14]. It is not an exact implementation, because implicit-feedback FlexMF uses negative sampling whereas Johnson [Joh14] trained on entire users with different weights for positive and negative items.

  • Bayesian Personalized Ranking with matrix factorization (BPR-MF) [RFGSchmidtThieme09] uses the pairwise loss and uniform negative sampling; these are the default with the "bpr" preset.

  • Weighted Approximate Rank Loss (WARP) [WBU11, WYW13] uses misranked negative sampling and WARP loss. These are the default with the "warp" preset.

  • Light Graph Convolutional Networks (LightGCN) [HDW+20] modifies the model, not the loss, by adding multiple layers of aggregated neighbor embeddings; it can be used with any supported loss or negative strategy. LightGCN is enabled by the convolution_layers setting, where a positive number of layers puts the model into LightGCN mode. The "lightgcn" preset defaults to 2 layers with pairwise loss. The mixing coefficients are not currently configurable, as He et al. [HDW+20] found that just averaging the embedding layers worked pretty well.