Splitters
Splits data into train and test
Splits are returned with split
method.
- replay.splitters.base_splitter.Splitter.split(self, log)
Splits input DataFrame into train and test
- Parameters
log (
Union
[DataFrame
,DataFrame
]) – input DataFrame[timestamp, user_id, item_id, relevance]
- Return type
Tuple
[DataFrame
,DataFrame
]- Returns
train and test DataFrame
UserSplitter
- class replay.splitters.user_log_splitter.UserSplitter(item_test_size=1, user_test_size=None, shuffle=False, drop_cold_items=False, drop_cold_users=False, seed=None)
Split data inside each user’s history separately.
Example:
>>> from replay.session_handler import get_spark_session, State >>> spark = get_spark_session(1, 1) >>> state = State(spark)
>>> from replay.splitters import UserSplitter >>> import pandas as pd >>> data_frame = pd.DataFrame({"user_idx": [1,1,1,2,2,2], ... "item_idx": [1,2,3,1,2,3], ... "relevance": [1,2,3,4,5,6], ... "timestamp": [1,2,3,3,2,1]}) >>> data_frame user_idx item_idx relevance timestamp 0 1 1 1 1 1 1 2 2 2 2 1 3 3 3 3 2 1 4 3 4 2 2 5 2 5 2 3 6 1
>>> from replay.utils import convert2spark >>> data_frame = convert2spark(data_frame)
By default, test is one last item for each user
>>> UserSplitter(seed=80083).split(data_frame)[-1].toPandas() user_idx item_idx relevance timestamp 0 1 3 3 3 1 2 1 4 3
Random records can be retrieved with
shuffle
:>>> UserSplitter(shuffle=True, seed=80083).split(data_frame)[-1].toPandas() user_idx item_idx relevance timestamp 0 1 2 2 2 1 2 3 6 1
You can specify the number of items for each user:
>>> UserSplitter(item_test_size=3, shuffle=True, seed=80083).split(data_frame)[-1].toPandas() user_idx item_idx relevance timestamp 0 1 2 2 2 1 1 3 3 3 2 1 1 1 1 3 2 3 6 1 4 2 2 5 2 5 2 1 4 3
Or a fraction:
>>> UserSplitter(item_test_size=0.67, shuffle=True, seed=80083).split(data_frame)[-1].toPandas() user_idx item_idx relevance timestamp 0 1 2 2 2 1 1 3 3 3 2 2 3 6 1 3 2 2 5 2
user_test_size allows to put exact number of users into test set
>>> UserSplitter(user_test_size=1, item_test_size=2, seed=42).split(data_frame)[-1].toPandas().user_idx.nunique() 1
>>> UserSplitter(user_test_size=0.5, item_test_size=2, seed=42).split(data_frame)[-1].toPandas().user_idx.nunique() 1
- __init__(item_test_size=1, user_test_size=None, shuffle=False, drop_cold_items=False, drop_cold_users=False, seed=None)
- Parameters
item_test_size (
Union
[float
,int
]) – fraction or a number of items per useruser_test_size (
Union
[float
,int
,None
]) – similar toitem_test_size
, but corresponds to the number of users.None
is all available users.shuffle – take random items and not last based on
timestamp
.drop_cold_items (
bool
) – flag to drop cold items from testdrop_cold_users (
bool
) – flag to drop cold users from testseed (
Optional
[int
]) – random seed
k_folds
- replay.splitters.user_log_splitter.k_folds(log, n_folds=5, seed=None, splitter='user')
Splits log inside each user into folds at random
- Parameters
log (
Union
[DataFrame
,DataFrame
]) – input DataFramen_folds (
Optional
[int
]) – number of foldsseed (
Optional
[int
]) – random seedsplitter (
Optional
[str
]) – splitting strategy. Only user variant is available atm.
- Return type
Tuple
[DataFrame
,DataFrame
]- Returns
yields train and test DataFrames by folds
DateSplitter
- class replay.splitters.log_splitter.DateSplitter(test_start, drop_cold_items=False, drop_cold_users=False)
Split into train and test by date.
- __init__(test_start, drop_cold_items=False, drop_cold_users=False)
- Parameters
test_start (
Union
[datetime
,float
,str
,int
]) – string``yyyy-mm-dd``, int unix timestamp, datetime or a fraction for test size to determine data automaticallydrop_cold_items (
bool
) – flag to drop cold items from testdrop_cold_users (
bool
) – flag to drop cold users from test
RandomSplitter
- class replay.splitters.log_splitter.RandomSplitter(test_size, drop_cold_items=False, drop_cold_users=False, seed=None)
Assign records into train and test at random.
- __init__(test_size, drop_cold_items=False, drop_cold_users=False, seed=None)
- Parameters
test_size (
float
) – test size 0 to 1drop_cold_items (
bool
) – flag to drop cold items from testdrop_cold_users (
bool
) – flag to drop cold users from testseed (
Optional
[int
]) – random seed
NewUsersSplitter
- class replay.splitters.log_splitter.NewUsersSplitter(test_size, drop_cold_items=False)
Only new users will be assigned to test set. Splits log by timestamp so that test has test_size fraction of most recent users.
>>> from replay.splitters import NewUsersSplitter >>> import pandas as pd >>> data_frame = pd.DataFrame({"user_idx": [1,1,2,2,3,4], ... "item_idx": [1,2,3,1,2,3], ... "relevance": [1,2,3,4,5,6], ... "timestamp": [20,40,20,30,10,40]}) >>> data_frame user_idx item_idx relevance timestamp 0 1 1 1 20 1 1 2 2 40 2 2 3 3 20 3 2 1 4 30 4 3 2 5 10 5 4 3 6 40 >>> train, test = NewUsersSplitter(test_size=0.1).split(data_frame) >>> train.show() +--------+--------+---------+---------+ |user_idx|item_idx|relevance|timestamp| +--------+--------+---------+---------+ | 1| 1| 1| 20| | 2| 3| 3| 20| | 2| 1| 4| 30| | 3| 2| 5| 10| +--------+--------+---------+---------+ >>> test.show() +--------+--------+---------+---------+ |user_idx|item_idx|relevance|timestamp| +--------+--------+---------+---------+ | 4| 3| 6| 40| +--------+--------+---------+---------+
Train DataFrame can be drastically reduced even with moderate test_size if the amount of new users is small.
>>> train, test = NewUsersSplitter(test_size=0.3).split(data_frame) >>> train.show() +--------+--------+---------+---------+ |user_idx|item_idx|relevance|timestamp| +--------+--------+---------+---------+ | 3| 2| 5| 10| +--------+--------+---------+---------+
- __init__(test_size, drop_cold_items=False)
- Parameters
test_size (
float
) – test size 0 to 1drop_cold_items (
bool
) – flag to drop cold items from test
ColdUserRandomSplitter
- class replay.splitters.log_splitter.ColdUserRandomSplitter(test_size, drop_cold_items=False, drop_cold_users=False)
Test set consists of all actions of randomly chosen users.
- __init__(test_size, drop_cold_items=False, drop_cold_users=False)
- Parameters
test_size (
float
) – fraction of users to be in testdrop_cold_items (
bool
) – flag to drop cold items from testdrop_cold_users (
bool
) – flag to drop cold users from test