Data Preparation
Replay has a number of requirements for input data.
We await that input columns are in the form [user_id, item_id, timestamp, relevance]
.
And internal format is a spark DataFrame with indexed integer values for [user_idx, item_idx]
.
You can convert indexes of your Spark DataFrame with Indexer
class.
- class replay.data_preparator.Indexer(user_col='user_id', item_col='item_id')
This class is used to convert arbitrary id to numerical idx and back.
- fit(users, items)
Creates indexers to map raw id to numerical idx so that spark can handle them. :param user: DataFrame containing user column :param item: DataFrame containing item column :rtype:
None
:return:
- inverse_transform(df)
Convert DataFrame to the initial indexes.
- Parameters
df (
DataFrame
) – DataFrame with idxs- Return type
DataFrame
- Returns
DataFrame with ids
- transform(df)
Convert raw
user_id
anditem_id
to numericaluser_idx
anditem_idx
- Parameters
data_frame – dataframe with raw indexes
- Return type
Optional
[DataFrame
]- Returns
dataframe with converted indexes
If your DataFrame is in the form of Pandas DataFrame and has different column names, you can either
preprocess it yourself with convert2spark
function or use DataPreparator
class