Get Started

Data Format

RePlay uses PySpark for internal data representation. To convert Pandas dataframe into a spark one you can use replay.utils.convert2spark function.

By default you don’t have to think about Spark session at all because it will be created automatically. If you want to use custom session, refer to this page.

There are also requirements regarding column names.

Entity

Name

User identificator

user_idx

Item identificator

item_idx

Date info

timestamp

Rating/weight

relevance

ID requirements

user_idx and item_idx should be numerical indexes starting at zero without gaps. This is important for models that use sparse matrices and estimate their dimensions on biggest seen index.

You should convert your data with Indexer class. It will store label encoders for you to convert raw id to idx and back.

Timestamp requirements

timestamp can be integer, but it is preferable if it is a datetime timestamp.