Metrics

Most metrics require dataframe with recommendations and dataframe with ground truth values — which objects each user interacted with.

recommendations (Union[pandas.DataFrame, spark.DataFrame]):
predictions of a recommender system, DataFrame with columns [user_id, item_id, relevance]
ground_truth (Union[pandas.DataFrame, spark.DataFrame]):
test data, DataFrame with columns [user_id, item_id, timestamp, relevance]

Every metric is calculated using top K items for each user. It is also possible to calculate metrics using multiple values for K simultaneously. In this case the result will be a dictionary and not a number.

k (Union[Iterable[int], int]):
a single number or a list, specifying the truncation length for recommendation list for each user

By default metrics are averaged by users, but you can alternatively use method metric.median. Also you can get the lower bound of conf_interval for a given alpha.

Diversity metrics require extra parameters on initialization stage, but do not use ground_truth parameter.

You can also add new metrics.

HitRate

class replay.metrics.HitRate

Percentage of users that have at least one: correctly recommended item among top-k.

\[HitRate@K(i) = \max_{j \in [1..K]}\mathbb{1}_{r_{ij}}\]

\[HitRate@K = \frac {\sum_{i=1}^{N}HitRate@K(i)}{N}\]

\(\mathbb{1}_{r_{ij}}\) – indicator function stating that user \(i\) interacted with item \(j\)

Precision

class replay.metrics.Precision

Mean percentage of relevant items among top K recommendations.

\[Precision@K(i) = \frac {\sum_{j=1}^{K}\mathbb{1}_{r_{ij}}}{K}\]

\[Precision@K = \frac {\sum_{i=1}^{N}Precision@K(i)}{N}\]

\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)

MAP

class replay.metrics.MAP

Mean Average Precision – average the Precision at relevant positions for each user,: and then calculate the mean across all users.

\[ \begin{align}\begin{aligned}&AP@K(i) = \frac 1K \sum_{j=1}^{K}\mathbb{1}_{r_{ij}}Precision@j(i)\\&MAP@K = \frac {\sum_{i=1}^{N}AP@K(i)}{N}\end{aligned}\end{align} \]

\(\mathbb{1}_{r_{ij}}\) – indicator function showing if user \(i\) interacted with item \(j\)

Recall

class replay.metrics.Recall

Mean percentage of relevant items, that was shown among top K recommendations.

\[Recall@K(i) = \frac {\sum_{j=1}^{K}\mathbb{1}_{r_{ij}}}{|Rel_i|}\]

\[Recall@K = \frac {\sum_{i=1}^{N}Recall@K(i)}{N}\]

\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)

\(|Rel_i|\) – the number of relevant items for user \(i\)

ROC-AUC

class replay.metrics.RocAuc

Receiver Operating Characteristic/Area Under the Curve is the aggregated performance measure, that depends only on the order of recommended items. It can be interpreted as the fraction of object pairs (object of class 1, object of class 0) that were correctly ordered by the model. The bigger the value of AUC, the better the classification model.

\[ROCAUC@K(i) = \frac {\sum_{s=1}^{K}\sum_{t=1}^{K} \mathbb{1}_{r_{si}<r_{ti}} \mathbb{1}_{gt_{si}<gt_{ti}}} {\sum_{s=1}^{K}\sum_{t=1}^{K} \mathbb{1}_{gt_{si}<gt_{tj}}}\]

\(\mathbb{1}_{r_{si}<r_{ti}}\) – indicator function showing that recommendation score for user \(i\) for item \(s\) is bigger than for item \(t\)

\(\mathbb{1}_{gt_{si}<gt_{ti}}\) – indicator function showing that user \(i\) values item \(s\) more than item \(t\).

Metric is averaged by all users.

\[ROCAUC@K = \frac {\sum_{i=1}^{N}ROCAUC@K(i)}{N}\]

>>> import pandas as pd
>>> true=pd.DataFrame({"user_idx": 1,
...                    "item_idx": [4, 5, 6],
...                    "relevance": [1, 1, 1]})
>>> pred=pd.DataFrame({"user_idx": 1,
...                    "item_idx": [1, 2, 3, 4, 5, 6, 7],
...                    "relevance": [0.5, 0.1, 0.25, 0.6, 0.2, 0.3, 0]})
>>> roc = RocAuc()
>>> roc(pred, true, 7)
0.75

MRR

class replay.metrics.MRR

Mean Reciprocal Rank – Reciprocal Rank is the inverse position of the first relevant item among top-k recommendations, \(\frac {1}{rank_i}\). This value is averaged by all users.

>>> import pandas as pd
>>> pred = pd.DataFrame({"user_idx": [1, 1, 1], "item_idx": [3, 2, 1], "relevance": [5 ,5, 5]})
>>> true = pd.DataFrame({"user_idx": [1, 1, 1], "item_idx": [2, 4, 5], "relevance": [5, 5, 5]})
>>> MRR()(pred, true, 3)
0.5
>>> MRR()(pred, true, 1)
0.0
>>> MRR()(true, pred, 1)
1.0

NDCG

class replay.metrics.NDCG

Normalized Discounted Cumulative Gain is a metric that takes into account positions of relevant items.

This is the binary version, it takes into account whether the item was consumed or not, relevance value is ignored.

\[DCG@K(i) = \sum_{j=1}^{K}\frac{\mathbb{1}_{r_{ij}}}{\log_2 (j+1)}\]

\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)

To get from \(DCG\) to \(nDCG\) we calculate the biggest possible value of DCG for user \(i\) and recommendation length \(K\).

\[IDCG@K(i) = max(DCG@K(i)) = \sum_{j=1}^{K}\frac{\mathbb{1}_{j\le|Rel_i|}}{\log_2 (j+1)}\]

\[nDCG@K(i) = \frac {DCG@K(i)}{IDCG@K(i)}\]

\(|Rel_i|\) – number of relevant items for user \(i\)

Metric is averaged by users.

\[nDCG@K = \frac {\sum_{i=1}^{N}nDCG@K(i)}{N}\]

>>> import pandas as pd
>>> pred=pd.DataFrame({"user_idx": [1, 1, 2, 2],
...                    "item_idx": [4, 5, 6, 7],
...                    "relevance": [1, 1, 1, 1]})
>>> true=pd.DataFrame({"user_idx": [1, 1, 1, 1, 1, 2],
...                    "item_idx": [1, 2, 3, 4, 5, 8],
...                    "relevance": [0.5, 0.1, 0.25, 0.6, 0.2, 0.3]})
>>> ndcg = NDCG()
>>> ndcg(pred, true, 2)
0.5

Surprisal

class replay.metrics.Surprisal(log)

Measures how many surprising rare items are present in recommendations.

\[\textit{Self-Information}(j)= -\log_2 \frac {u_j}{N}\]

\(u_j\) – number of users that interacted with item \(j\). Cold items are treated as if they were rated by 1 user. That is, if they appear in recommendations it will be completely unexpected.

Metric is normalized.

Surprisal for item \(j\) is

\[Surprisal(j)= \frac {\textit{Self-Information}(j)}{log_2 N}\]

Recommendation list surprisal is the average surprisal of items in it.

\[Surprisal@K(i) = \frac {\sum_{j=1}^{K}Surprisal(j)} {K}\]

Final metric is averaged by users.

\[Surprisal@K = \frac {\sum_{i=1}^{N}Surprisal@K(i)}{N}\]

__init__(log)

Here we calculate self-information for each item

Parameters: log (Union[DataFrame, DataFrame]) – historical data

Unexpectedness

class replay.metrics.Unexpectedness(pred)

Fraction of recommended items that are not present in some baseline recommendations.

>>> import pandas as pd
>>> from replay.session_handler import get_spark_session, State
>>> spark = get_spark_session(1, 1)
>>> state = State(spark)

>>> log = pd.DataFrame({"user_idx": [1, 1, 1], "item_idx": [1, 2, 3], "relevance": [5, 5, 5], "timestamp": [1, 1, 1]})
>>> recs = pd.DataFrame({"user_idx": [1, 1, 1], "item_idx": [0, 0, 1], "relevance": [5, 5, 5], "timestamp": [1, 1, 1]})
>>> metric = Unexpectedness(log)
>>> round(metric(recs, 3), 2)
0.67

__init__(pred)

Parameters: pred (Union[DataFrame, DataFrame]) – model predictions

Coverage

class replay.metrics.Coverage(log)

Metric calculation is as follows:

take K recommendations with the biggest relevance for each user_id
count the number of distinct item_id in these recommendations
devide it by the number of items in the whole data set

__init__(log)

Parameters: log (Union[DataFrame, DataFrame]) – pandas or Spark DataFrame It is important for log to contain all available items.

Custom Metric

Your metric should be inherited from Metric class and implement following methods:

__init__
_get_enriched_recommendations
_get_metric_value_by_user

get_enriched_recommendations is already implemented, but you can change it if it is required for your metric. _get_metric_value_by_user is required for every metric because this is where the actual calculations happen.

replay.metrics.base_metric.get_enriched_recommendations(recommendations, ground_truth)

Merge recommendations and ground truth into a single DataFrame and aggregate items into lists so that each user has only one record.

Parameters

recommendations (Union[DataFrame, DataFrame]) – recommendation list
ground_truth (Union[DataFrame, DataFrame]) – test data

Return type

DataFrame

Returns

[user_id, pred, ground_truth]

class replay.metrics.base_metric.Metric

Base metric class

abstract static _get_metric_value_by_user(k, pred, ground_truth)

Metric calculation for one user.

Parameters

k – depth cut-off
pred – recommendations
ground_truth – test data

Return type

float

Returns

metric value for current user

class replay.metrics.base_metric.RecOnlyMetric(log, *args, **kwargs): Base class for metrics that do not need holdout data