Filters

Select or remove data by some criteria

replay.filters.filter_between_dates(log, start_date=None, end_date=None, date_column='timestamp')

Select a part of data between [start_date, end_date).

>>> import pandas as pd
>>> from replay.utils import convert2spark
>>> log_pd = pd.DataFrame({"user_idx": ["u1", "u2", "u2", "u3", "u3", "u3"],
...                     "item_idx": ["i1", "i2","i3", "i1", "i2","i3"],
...                     "rel": [1., 0.5, 3, 1, 0, 1],
...                     "timestamp": ["2020-01-01 23:59:59", "2020-02-01",
...                                   "2020-02-01", "2020-01-01 00:04:15",
...                                   "2020-01-02 00:04:14", "2020-01-05 23:59:59"]},
...             )
>>> log_pd["timestamp"] = pd.to_datetime(log_pd["timestamp"])
>>> log_sp = convert2spark(log_pd)
>>> log_sp.show()
+--------+--------+---+-------------------+
|user_idx|item_idx|rel|          timestamp|
+--------+--------+---+-------------------+
|      u1|      i1|1.0|2020-01-01 23:59:59|
|      u2|      i2|0.5|2020-02-01 00:00:00|
|      u2|      i3|3.0|2020-02-01 00:00:00|
|      u3|      i1|1.0|2020-01-01 00:04:15|
|      u3|      i2|0.0|2020-01-02 00:04:14|
|      u3|      i3|1.0|2020-01-05 23:59:59|
+--------+--------+---+-------------------+
>>> filter_between_dates(log_sp, start_date="2020-01-01 14:00:00", end_date=datetime(2020, 1, 3, 0, 0, 0)).show()
+--------+--------+---+-------------------+
|user_idx|item_idx|rel|          timestamp|
+--------+--------+---+-------------------+
|      u1|      i1|1.0|2020-01-01 23:59:59|
|      u3|      i2|0.0|2020-01-02 00:04:14|
+--------+--------+---+-------------------+
Parameters
  • log (DataFrame) – historical DataFrame

  • start_date (Union[str, datetime, None]) – datetime or str with format “yyyy-MM-dd HH:mm:ss”.

  • end_date (Union[str, datetime, None]) – datetime or str with format “yyyy-MM-dd HH:mm:ss”.

  • date_column (str) – date column

Return type

DataFrame

replay.filters.filter_by_duration(log, duration_days, first=True, date_column='timestamp')

Select first/last days from log.

>>> import pandas as pd
>>> from replay.utils import convert2spark
>>> log_pd = pd.DataFrame({"user_idx": ["u1", "u2", "u2", "u3", "u3", "u3"],
...                     "item_idx": ["i1", "i2","i3", "i1", "i2","i3"],
...                     "rel": [1., 0.5, 3, 1, 0, 1],
...                     "timestamp": ["2020-01-01 23:59:59", "2020-02-01",
...                                   "2020-02-01", "2020-01-01 00:04:15",
...                                   "2020-01-02 00:04:14", "2020-01-05 23:59:59"]},
...             )
>>> log_pd["timestamp"] = pd.to_datetime(log_pd["timestamp"])
>>> log_sp = convert2spark(log_pd)
>>> log_sp.show()
+--------+--------+---+-------------------+
|user_idx|item_idx|rel|          timestamp|
+--------+--------+---+-------------------+
|      u1|      i1|1.0|2020-01-01 23:59:59|
|      u2|      i2|0.5|2020-02-01 00:00:00|
|      u2|      i3|3.0|2020-02-01 00:00:00|
|      u3|      i1|1.0|2020-01-01 00:04:15|
|      u3|      i2|0.0|2020-01-02 00:04:14|
|      u3|      i3|1.0|2020-01-05 23:59:59|
+--------+--------+---+-------------------+
>>> filter_by_duration(log_sp, 1).show()
+--------+--------+---+-------------------+
|user_idx|item_idx|rel|          timestamp|
+--------+--------+---+-------------------+
|      u1|      i1|1.0|2020-01-01 23:59:59|
|      u3|      i1|1.0|2020-01-01 00:04:15|
|      u3|      i2|0.0|2020-01-02 00:04:14|
+--------+--------+---+-------------------+
>>> filter_by_duration(log_sp, 1, first=False).show()
+--------+--------+---+-------------------+
|user_idx|item_idx|rel|          timestamp|
+--------+--------+---+-------------------+
|      u2|      i2|0.5|2020-02-01 00:00:00|
|      u2|      i3|3.0|2020-02-01 00:00:00|
+--------+--------+---+-------------------+
Parameters
  • log (DataFrame) – historical DataFrame

  • duration_days (int) – length of selected data in days

  • first (bool) – take either first duration_days or last

  • date_column (str) – date column

Return type

DataFrame

replay.filters.filter_by_user_duration(log, days=10, first=True, date_col='timestamp', user_col='user_idx')

Get first/last days of user interactions.

>>> import pandas as pd
>>> from replay.utils import convert2spark
>>> log_pd = pd.DataFrame({"user_idx": ["u1", "u2", "u2", "u3", "u3", "u3"],
...                     "item_idx": ["i1", "i2","i3", "i1", "i2","i3"],
...                     "rel": [1., 0.5, 3, 1, 0, 1],
...                     "timestamp": ["2020-01-01 23:59:59", "2020-02-01",
...                                   "2020-02-01", "2020-01-01 00:04:15",
...                                   "2020-01-02 00:04:14", "2020-01-05 23:59:59"]},
...             )
>>> log_pd["timestamp"] = pd.to_datetime(log_pd["timestamp"])
>>> log_sp = convert2spark(log_pd)
>>> log_sp.orderBy('user_idx', 'item_idx').show()
+--------+--------+---+-------------------+
|user_idx|item_idx|rel|          timestamp|
+--------+--------+---+-------------------+
|      u1|      i1|1.0|2020-01-01 23:59:59|
|      u2|      i2|0.5|2020-02-01 00:00:00|
|      u2|      i3|3.0|2020-02-01 00:00:00|
|      u3|      i1|1.0|2020-01-01 00:04:15|
|      u3|      i2|0.0|2020-01-02 00:04:14|
|      u3|      i3|1.0|2020-01-05 23:59:59|
+--------+--------+---+-------------------+

Get first day:

>>> filter_by_user_duration(log_sp, 1, True).orderBy('user_idx', 'item_idx').show()
+--------+--------+---+-------------------+
|user_idx|item_idx|rel|          timestamp|
+--------+--------+---+-------------------+
|      u1|      i1|1.0|2020-01-01 23:59:59|
|      u2|      i2|0.5|2020-02-01 00:00:00|
|      u2|      i3|3.0|2020-02-01 00:00:00|
|      u3|      i1|1.0|2020-01-01 00:04:15|
|      u3|      i2|0.0|2020-01-02 00:04:14|
+--------+--------+---+-------------------+

Get last day:

>>> filter_by_user_duration(log_sp, 1, False).orderBy('user_idx', 'item_idx').show()
+--------+--------+---+-------------------+
|user_idx|item_idx|rel|          timestamp|
+--------+--------+---+-------------------+
|      u1|      i1|1.0|2020-01-01 23:59:59|
|      u2|      i2|0.5|2020-02-01 00:00:00|
|      u2|      i3|3.0|2020-02-01 00:00:00|
|      u3|      i3|1.0|2020-01-05 23:59:59|
+--------+--------+---+-------------------+
Parameters
  • log (DataFrame) – historical DataFrame

  • days (int) – how many days to return per user

  • first (bool) – take either first days or last

  • date_col (str) – date column

  • user_col (str) – user column

Return type

DataFrame

replay.filters.filter_user_interactions(log, num_interactions=10, first=True, date_col='timestamp', user_col='user_idx', item_col='item_idx')

Get first/last num_interactions interactions for each user.

>>> import pandas as pd
>>> from replay.utils import convert2spark
>>> log_pd = pd.DataFrame({"user_idx": ["u1", "u2", "u2", "u3", "u3", "u3"],
...                     "item_idx": ["i1", "i2","i3", "i1", "i2","i3"],
...                     "rel": [1., 0.5, 3, 1, 0, 1],
...                     "timestamp": ["2020-01-01 23:59:59", "2020-02-01",
...                                   "2020-02-01", "2020-01-01 00:04:15",
...                                   "2020-01-02 00:04:14", "2020-01-05 23:59:59"]},
...             )
>>> log_pd["timestamp"] = pd.to_datetime(log_pd["timestamp"])
>>> log_sp = convert2spark(log_pd)
>>> log_sp.show()
+--------+--------+---+-------------------+
|user_idx|item_idx|rel|          timestamp|
+--------+--------+---+-------------------+
|      u1|      i1|1.0|2020-01-01 23:59:59|
|      u2|      i2|0.5|2020-02-01 00:00:00|
|      u2|      i3|3.0|2020-02-01 00:00:00|
|      u3|      i1|1.0|2020-01-01 00:04:15|
|      u3|      i2|0.0|2020-01-02 00:04:14|
|      u3|      i3|1.0|2020-01-05 23:59:59|
+--------+--------+---+-------------------+

Only first interaction:

>>> filter_user_interactions(log_sp, 1, True).orderBy('user_idx').show()
+--------+--------+---+-------------------+
|user_idx|item_idx|rel|          timestamp|
+--------+--------+---+-------------------+
|      u1|      i1|1.0|2020-01-01 23:59:59|
|      u2|      i2|0.5|2020-02-01 00:00:00|
|      u3|      i1|1.0|2020-01-01 00:04:15|
+--------+--------+---+-------------------+

Only last interaction:

>>> filter_user_interactions(log_sp, 1, False, item_col=None).orderBy('user_idx').show()
+--------+--------+---+-------------------+
|user_idx|item_idx|rel|          timestamp|
+--------+--------+---+-------------------+
|      u1|      i1|1.0|2020-01-01 23:59:59|
|      u2|      i2|0.5|2020-02-01 00:00:00|
|      u3|      i3|1.0|2020-01-05 23:59:59|
+--------+--------+---+-------------------+
>>> filter_user_interactions(log_sp, 1, False).orderBy('user_idx').show()
+--------+--------+---+-------------------+
|user_idx|item_idx|rel|          timestamp|
+--------+--------+---+-------------------+
|      u1|      i1|1.0|2020-01-01 23:59:59|
|      u2|      i3|3.0|2020-02-01 00:00:00|
|      u3|      i3|1.0|2020-01-05 23:59:59|
+--------+--------+---+-------------------+
Parameters
  • log (DataFrame) – historical interactions DataFrame

  • num_interactions (int) – number of interactions to leave per user

  • first (bool) – take either first num_interactions or last.

  • date_col (str) – date column

  • user_col (str) – user column

  • item_col (Optional[str]) – item column to help sort simultaneous interactions. If None, it is ignored.

Return type

DataFrame

Returns

filtered DataFrame

replay.filters.min_entries(data_frame, num_entries)

Remove users with less than num_entries ratings.

>>> import pandas as pd
>>> data_frame = pd.DataFrame({"user_idx": [1, 1, 2]})
>>> min_entries(data_frame, 2).toPandas()
   user_idx
0         1
1         1
Return type

DataFrame

replay.filters.min_rating(data_frame, value, column='relevance')

Remove records with records less than value in column.

>>> import pandas as pd
>>> data_frame = pd.DataFrame({"relevance": [1, 5, 3, 4]})
>>> min_rating(data_frame, 3.5).toPandas()
   relevance
0          5
1          4
Return type

DataFrame