Metrics

Most metrics require dataframe with recommendations and dataframe with ground truth values — which objects each user interacted with.

  • recommendations (Union[pandas.DataFrame, spark.DataFrame]):

    predictions of a recommender system, DataFrame with columns [user_id, item_id, relevance]

  • ground_truth (Union[pandas.DataFrame, spark.DataFrame]):

    test data, DataFrame with columns [user_id, item_id, timestamp, relevance]

Every metric is calculated using top K items for each user. It is also possible to calculate metrics using multiple values for K simultaneously. In this case the result will be a dictionary and not a number.

  • k (Union[Iterable[int], int]):

    a single number or a list, specifying the truncation length for recommendation list for each user

By default metrics are averaged by users, but you can alternatively use method metric.median. Also you can get the lower bound of conf_interval for a given alpha.

Diversity metrics require extra parameters on initialization stage, but do not use ground_truth parameter.

You can also add new metrics.

HitRate

class replay.metrics.HitRate
Percentage of users that have at least one

correctly recommended item among top-k.

\[HitRate@K(i) = \max_{j \in [1..K]}\mathbb{1}_{r_{ij}}\]
\[HitRate@K = \frac {\sum_{i=1}^{N}HitRate@K(i)}{N}\]

\(\mathbb{1}_{r_{ij}}\) – indicator function stating that user \(i\) interacted with item \(j\)

Precision

class replay.metrics.Precision

Mean percentage of relevant items among top K recommendations.

\[Precision@K(i) = \frac {\sum_{j=1}^{K}\mathbb{1}_{r_{ij}}}{K}\]
\[Precision@K = \frac {\sum_{i=1}^{N}Precision@K(i)}{N}\]

\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)

MAP

class replay.metrics.MAP
Mean Average Precision – average the Precision at relevant positions for each user,

and then calculate the mean across all users.

\[ \begin{align}\begin{aligned}&AP@K(i) = \frac 1K \sum_{j=1}^{K}\mathbb{1}_{r_{ij}}Precision@j(i)\\&MAP@K = \frac {\sum_{i=1}^{N}AP@K(i)}{N}\end{aligned}\end{align} \]

\(\mathbb{1}_{r_{ij}}\) – indicator function showing if user \(i\) interacted with item \(j\)

Recall

class replay.metrics.Recall

Mean percentage of relevant items, that was shown among top K recommendations.

\[Recall@K(i) = \frac {\sum_{j=1}^{K}\mathbb{1}_{r_{ij}}}{|Rel_i|}\]
\[Recall@K = \frac {\sum_{i=1}^{N}Recall@K(i)}{N}\]

\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)

\(|Rel_i|\) – the number of relevant items for user \(i\)

ROC-AUC

class replay.metrics.RocAuc

Receiver Operating Characteristic/Area Under the Curve is the aggregated performance measure, that depends only on the order of recommended items. It can be interpreted as the fraction of object pairs (object of class 1, object of class 0) that were correctly ordered by the model. The bigger the value of AUC, the better the classification model.

\[ROCAUC@K(i) = \frac {\sum_{s=1}^{K}\sum_{t=1}^{K} \mathbb{1}_{r_{si}<r_{ti}} \mathbb{1}_{gt_{si}<gt_{ti}}} {\sum_{s=1}^{K}\sum_{t=1}^{K} \mathbb{1}_{gt_{si}<gt_{tj}}}\]

\(\mathbb{1}_{r_{si}<r_{ti}}\) – indicator function showing that recommendation score for user \(i\) for item \(s\) is bigger than for item \(t\)

\(\mathbb{1}_{gt_{si}<gt_{ti}}\) – indicator function showing that user \(i\) values item \(s\) more than item \(t\).

Metric is averaged by all users.

\[ROCAUC@K = \frac {\sum_{i=1}^{N}ROCAUC@K(i)}{N}\]
>>> import pandas as pd
>>> true=pd.DataFrame({"user_idx": 1,
...                    "item_idx": [4, 5, 6],
...                    "relevance": [1, 1, 1]})
>>> pred=pd.DataFrame({"user_idx": 1,
...                    "item_idx": [1, 2, 3, 4, 5, 6, 7],
...                    "relevance": [0.5, 0.1, 0.25, 0.6, 0.2, 0.3, 0]})
>>> roc = RocAuc()
>>> roc(pred, true, 7)
0.75

MRR

class replay.metrics.MRR

Mean Reciprocal Rank – Reciprocal Rank is the inverse position of the first relevant item among top-k recommendations, \(\frac {1}{rank_i}\). This value is averaged by all users.

>>> import pandas as pd
>>> pred = pd.DataFrame({"user_idx": [1, 1, 1], "item_idx": [3, 2, 1], "relevance": [5 ,5, 5]})
>>> true = pd.DataFrame({"user_idx": [1, 1, 1], "item_idx": [2, 4, 5], "relevance": [5, 5, 5]})
>>> MRR()(pred, true, 3)
0.5
>>> MRR()(pred, true, 1)
0.0
>>> MRR()(true, pred, 1)
1.0

NDCG

class replay.metrics.NDCG

Normalized Discounted Cumulative Gain is a metric that takes into account positions of relevant items.

This is the binary version, it takes into account whether the item was consumed or not, relevance value is ignored.

\[DCG@K(i) = \sum_{j=1}^{K}\frac{\mathbb{1}_{r_{ij}}}{\log_2 (j+1)}\]

\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)

To get from \(DCG\) to \(nDCG\) we calculate the biggest possible value of DCG for user \(i\) and recommendation length \(K\).

\[IDCG@K(i) = max(DCG@K(i)) = \sum_{j=1}^{K}\frac{\mathbb{1}_{j\le|Rel_i|}}{\log_2 (j+1)}\]
\[nDCG@K(i) = \frac {DCG@K(i)}{IDCG@K(i)}\]

\(|Rel_i|\) – number of relevant items for user \(i\)

Metric is averaged by users.

\[nDCG@K = \frac {\sum_{i=1}^{N}nDCG@K(i)}{N}\]
>>> import pandas as pd
>>> pred=pd.DataFrame({"user_idx": [1, 1, 2, 2],
...                    "item_idx": [4, 5, 6, 7],
...                    "relevance": [1, 1, 1, 1]})
>>> true=pd.DataFrame({"user_idx": [1, 1, 1, 1, 1, 2],
...                    "item_idx": [1, 2, 3, 4, 5, 8],
...                    "relevance": [0.5, 0.1, 0.25, 0.6, 0.2, 0.3]})
>>> ndcg = NDCG()
>>> ndcg(pred, true, 2)
0.5

Surprisal

class replay.metrics.Surprisal(log)

Measures how many surprising rare items are present in recommendations.

\[\textit{Self-Information}(j)= -\log_2 \frac {u_j}{N}\]

\(u_j\) – number of users that interacted with item \(j\). Cold items are treated as if they were rated by 1 user. That is, if they appear in recommendations it will be completely unexpected.

Metric is normalized.

Surprisal for item \(j\) is

\[Surprisal(j)= \frac {\textit{Self-Information}(j)}{log_2 N}\]

Recommendation list surprisal is the average surprisal of items in it.

\[Surprisal@K(i) = \frac {\sum_{j=1}^{K}Surprisal(j)} {K}\]

Final metric is averaged by users.

\[Surprisal@K = \frac {\sum_{i=1}^{N}Surprisal@K(i)}{N}\]
__init__(log)

Here we calculate self-information for each item

Parameters

log (Union[DataFrame, DataFrame]) – historical data

Unexpectedness

class replay.metrics.Unexpectedness(pred)

Fraction of recommended items that are not present in some baseline recommendations.

>>> import pandas as pd
>>> from replay.session_handler import get_spark_session, State
>>> spark = get_spark_session(1, 1)
>>> state = State(spark)
>>> log = pd.DataFrame({"user_idx": [1, 1, 1], "item_idx": [1, 2, 3], "relevance": [5, 5, 5], "timestamp": [1, 1, 1]})
>>> recs = pd.DataFrame({"user_idx": [1, 1, 1], "item_idx": [0, 0, 1], "relevance": [5, 5, 5], "timestamp": [1, 1, 1]})
>>> metric = Unexpectedness(log)
>>> round(metric(recs, 3), 2)
0.67
__init__(pred)
Parameters

pred (Union[DataFrame, DataFrame]) – model predictions

Coverage

class replay.metrics.Coverage(log)

Metric calculation is as follows:

  • take K recommendations with the biggest relevance for each user_id

  • count the number of distinct item_id in these recommendations

  • devide it by the number of items in the whole data set

__init__(log)
Parameters

log (Union[DataFrame, DataFrame]) – pandas or Spark DataFrame It is important for log to contain all available items.


Custom Metric

Your metric should be inherited from Metric class and implement following methods:

  • __init__

  • _get_enriched_recommendations

  • _get_metric_value_by_user

get_enriched_recommendations is already implemented, but you can change it if it is required for your metric. _get_metric_value_by_user is required for every metric because this is where the actual calculations happen.

replay.metrics.base_metric.get_enriched_recommendations(recommendations, ground_truth)

Merge recommendations and ground truth into a single DataFrame and aggregate items into lists so that each user has only one record.

Parameters
  • recommendations (Union[DataFrame, DataFrame]) – recommendation list

  • ground_truth (Union[DataFrame, DataFrame]) – test data

Return type

DataFrame

Returns

[user_id, pred, ground_truth]

class replay.metrics.base_metric.Metric

Base metric class

abstract static _get_metric_value_by_user(k, pred, ground_truth)

Metric calculation for one user.

Parameters
  • k – depth cut-off

  • pred – recommendations

  • ground_truth – test data

Return type

float

Returns

metric value for current user

class replay.metrics.base_metric.RecOnlyMetric(log, *args, **kwargs)

Base class for metrics that do not need holdout data