Filters
Select or remove data by some criteria
- replay.filters.filter_between_dates(log, start_date=None, end_date=None, date_column='timestamp')
Select a part of data between
[start_date, end_date)
.>>> import pandas as pd >>> from replay.utils import convert2spark >>> log_pd = pd.DataFrame({"user_idx": ["u1", "u2", "u2", "u3", "u3", "u3"], ... "item_idx": ["i1", "i2","i3", "i1", "i2","i3"], ... "rel": [1., 0.5, 3, 1, 0, 1], ... "timestamp": ["2020-01-01 23:59:59", "2020-02-01", ... "2020-02-01", "2020-01-01 00:04:15", ... "2020-01-02 00:04:14", "2020-01-05 23:59:59"]}, ... ) >>> log_pd["timestamp"] = pd.to_datetime(log_pd["timestamp"]) >>> log_sp = convert2spark(log_pd) >>> log_sp.show() +--------+--------+---+-------------------+ |user_idx|item_idx|rel| timestamp| +--------+--------+---+-------------------+ | u1| i1|1.0|2020-01-01 23:59:59| | u2| i2|0.5|2020-02-01 00:00:00| | u2| i3|3.0|2020-02-01 00:00:00| | u3| i1|1.0|2020-01-01 00:04:15| | u3| i2|0.0|2020-01-02 00:04:14| | u3| i3|1.0|2020-01-05 23:59:59| +--------+--------+---+-------------------+
>>> filter_between_dates(log_sp, start_date="2020-01-01 14:00:00", end_date=datetime(2020, 1, 3, 0, 0, 0)).show() +--------+--------+---+-------------------+ |user_idx|item_idx|rel| timestamp| +--------+--------+---+-------------------+ | u1| i1|1.0|2020-01-01 23:59:59| | u3| i2|0.0|2020-01-02 00:04:14| +--------+--------+---+-------------------+
- Parameters
log (
DataFrame
) – historical DataFramestart_date (
Union
[str
,datetime
,None
]) – datetime or str with format “yyyy-MM-dd HH:mm:ss”.end_date (
Union
[str
,datetime
,None
]) – datetime or str with format “yyyy-MM-dd HH:mm:ss”.date_column (
str
) – date column
- Return type
DataFrame
- replay.filters.filter_by_duration(log, duration_days, first=True, date_column='timestamp')
Select first/last days from
log
.>>> import pandas as pd >>> from replay.utils import convert2spark >>> log_pd = pd.DataFrame({"user_idx": ["u1", "u2", "u2", "u3", "u3", "u3"], ... "item_idx": ["i1", "i2","i3", "i1", "i2","i3"], ... "rel": [1., 0.5, 3, 1, 0, 1], ... "timestamp": ["2020-01-01 23:59:59", "2020-02-01", ... "2020-02-01", "2020-01-01 00:04:15", ... "2020-01-02 00:04:14", "2020-01-05 23:59:59"]}, ... ) >>> log_pd["timestamp"] = pd.to_datetime(log_pd["timestamp"]) >>> log_sp = convert2spark(log_pd) >>> log_sp.show() +--------+--------+---+-------------------+ |user_idx|item_idx|rel| timestamp| +--------+--------+---+-------------------+ | u1| i1|1.0|2020-01-01 23:59:59| | u2| i2|0.5|2020-02-01 00:00:00| | u2| i3|3.0|2020-02-01 00:00:00| | u3| i1|1.0|2020-01-01 00:04:15| | u3| i2|0.0|2020-01-02 00:04:14| | u3| i3|1.0|2020-01-05 23:59:59| +--------+--------+---+-------------------+
>>> filter_by_duration(log_sp, 1).show() +--------+--------+---+-------------------+ |user_idx|item_idx|rel| timestamp| +--------+--------+---+-------------------+ | u1| i1|1.0|2020-01-01 23:59:59| | u3| i1|1.0|2020-01-01 00:04:15| | u3| i2|0.0|2020-01-02 00:04:14| +--------+--------+---+-------------------+
>>> filter_by_duration(log_sp, 1, first=False).show() +--------+--------+---+-------------------+ |user_idx|item_idx|rel| timestamp| +--------+--------+---+-------------------+ | u2| i2|0.5|2020-02-01 00:00:00| | u2| i3|3.0|2020-02-01 00:00:00| +--------+--------+---+-------------------+
- Parameters
log (
DataFrame
) – historical DataFrameduration_days (
int
) – length of selected data in daysfirst (
bool
) – take either firstduration_days
or lastdate_column (
str
) – date column
- Return type
DataFrame
- replay.filters.filter_by_user_duration(log, days=10, first=True, date_col='timestamp', user_col='user_idx')
Get first/last
days
of user interactions.>>> import pandas as pd >>> from replay.utils import convert2spark >>> log_pd = pd.DataFrame({"user_idx": ["u1", "u2", "u2", "u3", "u3", "u3"], ... "item_idx": ["i1", "i2","i3", "i1", "i2","i3"], ... "rel": [1., 0.5, 3, 1, 0, 1], ... "timestamp": ["2020-01-01 23:59:59", "2020-02-01", ... "2020-02-01", "2020-01-01 00:04:15", ... "2020-01-02 00:04:14", "2020-01-05 23:59:59"]}, ... ) >>> log_pd["timestamp"] = pd.to_datetime(log_pd["timestamp"]) >>> log_sp = convert2spark(log_pd) >>> log_sp.orderBy('user_idx', 'item_idx').show() +--------+--------+---+-------------------+ |user_idx|item_idx|rel| timestamp| +--------+--------+---+-------------------+ | u1| i1|1.0|2020-01-01 23:59:59| | u2| i2|0.5|2020-02-01 00:00:00| | u2| i3|3.0|2020-02-01 00:00:00| | u3| i1|1.0|2020-01-01 00:04:15| | u3| i2|0.0|2020-01-02 00:04:14| | u3| i3|1.0|2020-01-05 23:59:59| +--------+--------+---+-------------------+
Get first day:
>>> filter_by_user_duration(log_sp, 1, True).orderBy('user_idx', 'item_idx').show() +--------+--------+---+-------------------+ |user_idx|item_idx|rel| timestamp| +--------+--------+---+-------------------+ | u1| i1|1.0|2020-01-01 23:59:59| | u2| i2|0.5|2020-02-01 00:00:00| | u2| i3|3.0|2020-02-01 00:00:00| | u3| i1|1.0|2020-01-01 00:04:15| | u3| i2|0.0|2020-01-02 00:04:14| +--------+--------+---+-------------------+
Get last day:
>>> filter_by_user_duration(log_sp, 1, False).orderBy('user_idx', 'item_idx').show() +--------+--------+---+-------------------+ |user_idx|item_idx|rel| timestamp| +--------+--------+---+-------------------+ | u1| i1|1.0|2020-01-01 23:59:59| | u2| i2|0.5|2020-02-01 00:00:00| | u2| i3|3.0|2020-02-01 00:00:00| | u3| i3|1.0|2020-01-05 23:59:59| +--------+--------+---+-------------------+
- Parameters
log (
DataFrame
) – historical DataFramedays (
int
) – how many days to return per userfirst (
bool
) – take either firstdays
or lastdate_col (
str
) – date columnuser_col (
str
) – user column
- Return type
DataFrame
- replay.filters.filter_user_interactions(log, num_interactions=10, first=True, date_col='timestamp', user_col='user_idx', item_col='item_idx')
Get first/last
num_interactions
interactions for each user.>>> import pandas as pd >>> from replay.utils import convert2spark >>> log_pd = pd.DataFrame({"user_idx": ["u1", "u2", "u2", "u3", "u3", "u3"], ... "item_idx": ["i1", "i2","i3", "i1", "i2","i3"], ... "rel": [1., 0.5, 3, 1, 0, 1], ... "timestamp": ["2020-01-01 23:59:59", "2020-02-01", ... "2020-02-01", "2020-01-01 00:04:15", ... "2020-01-02 00:04:14", "2020-01-05 23:59:59"]}, ... ) >>> log_pd["timestamp"] = pd.to_datetime(log_pd["timestamp"]) >>> log_sp = convert2spark(log_pd) >>> log_sp.show() +--------+--------+---+-------------------+ |user_idx|item_idx|rel| timestamp| +--------+--------+---+-------------------+ | u1| i1|1.0|2020-01-01 23:59:59| | u2| i2|0.5|2020-02-01 00:00:00| | u2| i3|3.0|2020-02-01 00:00:00| | u3| i1|1.0|2020-01-01 00:04:15| | u3| i2|0.0|2020-01-02 00:04:14| | u3| i3|1.0|2020-01-05 23:59:59| +--------+--------+---+-------------------+
Only first interaction:
>>> filter_user_interactions(log_sp, 1, True).orderBy('user_idx').show() +--------+--------+---+-------------------+ |user_idx|item_idx|rel| timestamp| +--------+--------+---+-------------------+ | u1| i1|1.0|2020-01-01 23:59:59| | u2| i2|0.5|2020-02-01 00:00:00| | u3| i1|1.0|2020-01-01 00:04:15| +--------+--------+---+-------------------+
Only last interaction:
>>> filter_user_interactions(log_sp, 1, False, item_col=None).orderBy('user_idx').show() +--------+--------+---+-------------------+ |user_idx|item_idx|rel| timestamp| +--------+--------+---+-------------------+ | u1| i1|1.0|2020-01-01 23:59:59| | u2| i2|0.5|2020-02-01 00:00:00| | u3| i3|1.0|2020-01-05 23:59:59| +--------+--------+---+-------------------+
>>> filter_user_interactions(log_sp, 1, False).orderBy('user_idx').show() +--------+--------+---+-------------------+ |user_idx|item_idx|rel| timestamp| +--------+--------+---+-------------------+ | u1| i1|1.0|2020-01-01 23:59:59| | u2| i3|3.0|2020-02-01 00:00:00| | u3| i3|1.0|2020-01-05 23:59:59| +--------+--------+---+-------------------+
- Parameters
log (
DataFrame
) – historical interactions DataFramenum_interactions (
int
) – number of interactions to leave per userfirst (
bool
) – take either firstnum_interactions
or last.date_col (
str
) – date columnuser_col (
str
) – user columnitem_col (
Optional
[str
]) – item column to help sort simultaneous interactions. If None, it is ignored.
- Return type
DataFrame
- Returns
filtered DataFrame
- replay.filters.min_entries(data_frame, num_entries)
Remove users with less than
num_entries
ratings.>>> import pandas as pd >>> data_frame = pd.DataFrame({"user_idx": [1, 1, 2]}) >>> min_entries(data_frame, 2).toPandas() user_idx 0 1 1 1
- Return type
DataFrame
- replay.filters.min_rating(data_frame, value, column='relevance')
Remove records with records less than
value
incolumn
.>>> import pandas as pd >>> data_frame = pd.DataFrame({"relevance": [1, 5, 3, 4]}) >>> min_rating(data_frame, 3.5).toPandas() relevance 0 5 1 4
- Return type
DataFrame