Get Started

Data Format

RePlay uses PySpark for internal data representation. To convert Pandas dataframe into a spark one you can use replay.utils.convert2spark function.

By default you don’t have to think about Spark session at all because it will be created automatically. If you want to use custom session, refer to this page.

There are also requirements regarding column names.

Entity	Name
User identificator	user_idx
Item identificator	item_idx
Date info	timestamp
Rating/weight	relevance

ID requirements

user_idx and item_idx should be numerical indexes starting at zero without gaps. This is important for models that use sparse matrices and estimate their dimensions on biggest seen index.

You should convert your data with Indexer class. It will store label encoders for you to convert raw id to idx and back.

Timestamp requirements

timestamp can be integer, but it is preferable if it is a datetime timestamp.