Data Preparation

Replay has a number of requirements for input data. We await that input columns are in the form [user_id, item_id, timestamp, relevance]. And internal format is a spark DataFrame with indexed integer values for [user_idx, item_idx]. You can convert indexes of your Spark DataFrame with Indexer class.

class replay.data_preparator.Indexer(user_col='user_id', item_col='item_id')

This class is used to convert arbitrary id to numerical idx and back.

fit(users, items)

Creates indexers to map raw id to numerical idx so that spark can handle them. :param user: DataFrame containing user column :param item: DataFrame containing item column :rtype: None :return:

inverse_transform(df)

Convert DataFrame to the initial indexes.

Parameters

df (DataFrame) – DataFrame with idxs

Return type

DataFrame

Returns

DataFrame with ids

transform(df)

Convert raw user_id and item_id to numerical user_idx and item_idx

Parameters

data_frame – dataframe with raw indexes

Return type

Optional[DataFrame]

Returns

dataframe with converted indexes

If your DataFrame is in the form of Pandas DataFrame and has different column names, you can either preprocess it yourself with convert2spark function or use DataPreparator class

class replay.data_preparator.DataPreparator

Convert pandas DataFrame to Spark, rename columns and apply indexer.

back(df)

Convert DataFrame to the initial indexes.

Parameters

df (DataFrame) – DataFrame with idxs

Return type

DataFrame

Returns

DataFrame with ids