Splitters

Splits data into train and test

Splits are returned with split method.

replay.splitters.base_splitter.Splitter.split(self, log)

Splits input DataFrame into train and test

Parameters

log (Union[DataFrame, DataFrame]) – input DataFrame [timestamp, user_id, item_id, relevance]

Return type

Tuple[DataFrame, DataFrame]

Returns

train and test DataFrame

UserSplitter

class replay.splitters.user_log_splitter.UserSplitter(item_test_size=1, user_test_size=None, shuffle=False, drop_cold_items=False, drop_cold_users=False, seed=None)

Split data inside each user’s history separately.

Example:

>>> from replay.session_handler import get_spark_session, State
>>> spark = get_spark_session(1, 1)
>>> state = State(spark)
>>> from replay.splitters import UserSplitter
>>> import pandas as pd
>>> data_frame = pd.DataFrame({"user_idx": [1,1,1,2,2,2],
...    "item_idx": [1,2,3,1,2,3],
...    "relevance": [1,2,3,4,5,6],
...    "timestamp": [1,2,3,3,2,1]})
>>> data_frame
   user_idx  item_idx  relevance  timestamp
0         1         1          1          1
1         1         2          2          2
2         1         3          3          3
3         2         1          4          3
4         2         2          5          2
5         2         3          6          1
>>> from replay.utils import convert2spark
>>> data_frame = convert2spark(data_frame)

By default, test is one last item for each user

>>> UserSplitter(seed=80083).split(data_frame)[-1].toPandas()
   user_idx  item_idx  relevance  timestamp
0         1         3          3          3
1         2         1          4          3

Random records can be retrieved with shuffle:

>>> UserSplitter(shuffle=True, seed=80083).split(data_frame)[-1].toPandas()
   user_idx  item_idx  relevance  timestamp
0         1         2          2          2
1         2         3          6          1

You can specify the number of items for each user:

>>> UserSplitter(item_test_size=3, shuffle=True, seed=80083).split(data_frame)[-1].toPandas()
   user_idx  item_idx  relevance  timestamp
0         1         2          2          2
1         1         3          3          3
2         1         1          1          1
3         2         3          6          1
4         2         2          5          2
5         2         1          4          3

Or a fraction:

>>> UserSplitter(item_test_size=0.67, shuffle=True, seed=80083).split(data_frame)[-1].toPandas()
   user_idx  item_idx  relevance  timestamp
0         1         2          2          2
1         1         3          3          3
2         2         3          6          1
3         2         2          5          2

user_test_size allows to put exact number of users into test set

>>> UserSplitter(user_test_size=1, item_test_size=2, seed=42).split(data_frame)[-1].toPandas().user_idx.nunique()
1
>>> UserSplitter(user_test_size=0.5, item_test_size=2, seed=42).split(data_frame)[-1].toPandas().user_idx.nunique()
1
__init__(item_test_size=1, user_test_size=None, shuffle=False, drop_cold_items=False, drop_cold_users=False, seed=None)
Parameters
  • item_test_size (Union[float, int]) – fraction or a number of items per user

  • user_test_size (Union[float, int, None]) – similar to item_test_size, but corresponds to the number of users. None is all available users.

  • shuffle – take random items and not last based on timestamp.

  • drop_cold_items (bool) – flag to drop cold items from test

  • drop_cold_users (bool) – flag to drop cold users from test

  • seed (Optional[int]) – random seed

k_folds

replay.splitters.user_log_splitter.k_folds(log, n_folds=5, seed=None, splitter='user')

Splits log inside each user into folds at random

Parameters
  • log (Union[DataFrame, DataFrame]) – input DataFrame

  • n_folds (Optional[int]) – number of folds

  • seed (Optional[int]) – random seed

  • splitter (Optional[str]) – splitting strategy. Only user variant is available atm.

Return type

Tuple[DataFrame, DataFrame]

Returns

yields train and test DataFrames by folds

DateSplitter

class replay.splitters.log_splitter.DateSplitter(test_start, drop_cold_items=False, drop_cold_users=False)

Split into train and test by date.

__init__(test_start, drop_cold_items=False, drop_cold_users=False)
Parameters
  • test_start (Union[datetime, float, str, int]) – string``yyyy-mm-dd``, int unix timestamp, datetime or a fraction for test size to determine data automatically

  • drop_cold_items (bool) – flag to drop cold items from test

  • drop_cold_users (bool) – flag to drop cold users from test

RandomSplitter

class replay.splitters.log_splitter.RandomSplitter(test_size, drop_cold_items=False, drop_cold_users=False, seed=None)

Assign records into train and test at random.

__init__(test_size, drop_cold_items=False, drop_cold_users=False, seed=None)
Parameters
  • test_size (float) – test size 0 to 1

  • drop_cold_items (bool) – flag to drop cold items from test

  • drop_cold_users (bool) – flag to drop cold users from test

  • seed (Optional[int]) – random seed

NewUsersSplitter

class replay.splitters.log_splitter.NewUsersSplitter(test_size, drop_cold_items=False)

Only new users will be assigned to test set. Splits log by timestamp so that test has test_size fraction of most recent users.

>>> from replay.splitters import NewUsersSplitter
>>> import pandas as pd
>>> data_frame = pd.DataFrame({"user_idx": [1,1,2,2,3,4],
...    "item_idx": [1,2,3,1,2,3],
...    "relevance": [1,2,3,4,5,6],
...    "timestamp": [20,40,20,30,10,40]})
>>> data_frame
   user_idx  item_idx  relevance  timestamp
0         1         1          1         20
1         1         2          2         40
2         2         3          3         20
3         2         1          4         30
4         3         2          5         10
5         4         3          6         40
>>> train, test = NewUsersSplitter(test_size=0.1).split(data_frame)
>>> train.show()
+--------+--------+---------+---------+
|user_idx|item_idx|relevance|timestamp|
+--------+--------+---------+---------+
|       1|       1|        1|       20|
|       2|       3|        3|       20|
|       2|       1|        4|       30|
|       3|       2|        5|       10|
+--------+--------+---------+---------+

>>> test.show()
+--------+--------+---------+---------+
|user_idx|item_idx|relevance|timestamp|
+--------+--------+---------+---------+
|       4|       3|        6|       40|
+--------+--------+---------+---------+

Train DataFrame can be drastically reduced even with moderate test_size if the amount of new users is small.

>>> train, test = NewUsersSplitter(test_size=0.3).split(data_frame)
>>> train.show()
+--------+--------+---------+---------+
|user_idx|item_idx|relevance|timestamp|
+--------+--------+---------+---------+
|       3|       2|        5|       10|
+--------+--------+---------+---------+
__init__(test_size, drop_cold_items=False)
Parameters
  • test_size (float) – test size 0 to 1

  • drop_cold_items (bool) – flag to drop cold items from test

ColdUserRandomSplitter

class replay.splitters.log_splitter.ColdUserRandomSplitter(test_size, drop_cold_items=False, drop_cold_users=False)

Test set consists of all actions of randomly chosen users.

__init__(test_size, drop_cold_items=False, drop_cold_users=False)
Parameters
  • test_size (float) – fraction of users to be in test

  • drop_cold_items (bool) – flag to drop cold items from test

  • drop_cold_users (bool) – flag to drop cold users from test