MovieLens Data#
The MovieLens data sets are a widely-used set of movie rating datasets, available from the GroupLens dataset collection. The core of these data sets is matrix of user-provided 5-star ratings of movies, along with movie metadata such as titles and IMDB links. Some sets include user demographics as well, and others include various forms of tag data.
Loading MovieLens Data#
The load_movielens() function will load any published
MovieLens dataset, constructing a LensKit Dataset with
its contents. This dataset can then be split, saved in LensKit native format,
used to train models and pipelines, etc.
This function automatically detects which MovieLens dataset is being loaded, and can load them from either the Zip archives published by GroupLens or from a directory where the archive has been unpacked.
MovieLens Data Model#
The MovieLens loader loads the data into the standard user and item
entities, with a rating interaction class storing the user-provided ratings.
The items have the following attributes:
titleThe movie title.
genresA list of genres for this movie.
tag_countsA sparse vector attribute storing the number of times each tag has been applied to this movie. It is a summary of the
tagsdata provided by MovieLens. The tag names themselves are on the attribute’snames.tag_genomeA vector attribute storing the relevance values from the tag genome [VSR12], when it is available (ML20M and 25M).
For most data sets, there are no user attributes; ML100K and ML1M have
gender, age, and zip_code attributes. See the MovieLens data
documentation for details on these.
Ratings have two attributes: rating and timestamp. The timestamps
are parsed into Arrow/NumPy/Pandas timestamps.
Todo
A future version of LensKit will likely introduce tags as first-class entities.