Data Preparation

Replay has a number of requirements for input data. We await that input columns are in the form [user_id, item_id, timestamp, relevance]. And internal format is a spark DataFrame with indexed integer values for [user_idx, item_idx]. You can convert indexes of your Spark DataFrame with Indexer class.

class replay.data_preparator.Indexer(user_col='user_id', item_col='item_id')

This class is used to convert arbitrary id to numerical idx and back.

fit(users, items): Creates indexers to map raw id to numerical idx so that spark can handle them. :param user: DataFrame containing user column :param item: DataFrame containing item column :rtype: None :return:

inverse_transform(df)

Convert DataFrame to the initial indexes.

Parameters: df (DataFrame) – DataFrame with idxs
Return type: DataFrame
Returns: DataFrame with ids

transform(df)

Convert raw user_id and item_id to numerical user_idx and item_idx

Parameters: data_frame – dataframe with raw indexes
Return type: Optional[DataFrame]
Returns: dataframe with converted indexes

If your DataFrame is in the form of Pandas DataFrame and has different column names, you can either preprocess it yourself with convert2spark function or use DataPreparator class

class replay.data_preparator.DataPreparator

Convert pandas DataFrame to Spark, rename columns and apply indexer.

back(df)

Convert DataFrame to the initial indexes.

Parameters: df (DataFrame) – DataFrame with idxs
Return type: DataFrame
Returns: DataFrame with ids