Metrics
Most metrics require dataframe with recommendations and dataframe with ground truth values — which objects each user interacted with.
- recommendations (Union[pandas.DataFrame, spark.DataFrame]):
predictions of a recommender system, DataFrame with columns
[user_id, item_id, relevance]
- ground_truth (Union[pandas.DataFrame, spark.DataFrame]):
test data, DataFrame with columns
[user_id, item_id, timestamp, relevance]
Every metric is calculated using top K
items for each user.
It is also possible to calculate metrics
using multiple values for K
simultaneously.
In this case the result will be a dictionary and not a number.
- k (Union[Iterable[int], int]):
a single number or a list, specifying the truncation length for recommendation list for each user
By default metrics are averaged by users,
but you can alternatively use method metric.median
.
Also you can get the lower bound
of conf_interval
for a given alpha
.
Diversity metrics require extra parameters on initialization stage,
but do not use ground_truth
parameter.
You can also add new metrics.
HitRate
- class replay.metrics.HitRate
- Percentage of users that have at least one
correctly recommended item among top-k.
\[HitRate@K(i) = \max_{j \in [1..K]}\mathbb{1}_{r_{ij}}\]\[HitRate@K = \frac {\sum_{i=1}^{N}HitRate@K(i)}{N}\]\(\mathbb{1}_{r_{ij}}\) – indicator function stating that user \(i\) interacted with item \(j\)
Precision
- class replay.metrics.Precision
Mean percentage of relevant items among top
K
recommendations.\[Precision@K(i) = \frac {\sum_{j=1}^{K}\mathbb{1}_{r_{ij}}}{K}\]\[Precision@K = \frac {\sum_{i=1}^{N}Precision@K(i)}{N}\]\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)
MAP
- class replay.metrics.MAP
- Mean Average Precision – average the
Precision
at relevant positions for each user, and then calculate the mean across all users.
\[ \begin{align}\begin{aligned}&AP@K(i) = \frac 1K \sum_{j=1}^{K}\mathbb{1}_{r_{ij}}Precision@j(i)\\&MAP@K = \frac {\sum_{i=1}^{N}AP@K(i)}{N}\end{aligned}\end{align} \]\(\mathbb{1}_{r_{ij}}\) – indicator function showing if user \(i\) interacted with item \(j\)
- Mean Average Precision – average the
Recall
- class replay.metrics.Recall
Mean percentage of relevant items, that was shown among top
K
recommendations.\[Recall@K(i) = \frac {\sum_{j=1}^{K}\mathbb{1}_{r_{ij}}}{|Rel_i|}\]\[Recall@K = \frac {\sum_{i=1}^{N}Recall@K(i)}{N}\]\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)
\(|Rel_i|\) – the number of relevant items for user \(i\)
ROC-AUC
- class replay.metrics.RocAuc
Receiver Operating Characteristic/Area Under the Curve is the aggregated performance measure, that depends only on the order of recommended items. It can be interpreted as the fraction of object pairs (object of class 1, object of class 0) that were correctly ordered by the model. The bigger the value of AUC, the better the classification model.
\[ROCAUC@K(i) = \frac {\sum_{s=1}^{K}\sum_{t=1}^{K} \mathbb{1}_{r_{si}<r_{ti}} \mathbb{1}_{gt_{si}<gt_{ti}}} {\sum_{s=1}^{K}\sum_{t=1}^{K} \mathbb{1}_{gt_{si}<gt_{tj}}}\]\(\mathbb{1}_{r_{si}<r_{ti}}\) – indicator function showing that recommendation score for user \(i\) for item \(s\) is bigger than for item \(t\)
\(\mathbb{1}_{gt_{si}<gt_{ti}}\) – indicator function showing that user \(i\) values item \(s\) more than item \(t\).
Metric is averaged by all users.
\[ROCAUC@K = \frac {\sum_{i=1}^{N}ROCAUC@K(i)}{N}\]>>> import pandas as pd >>> true=pd.DataFrame({"user_idx": 1, ... "item_idx": [4, 5, 6], ... "relevance": [1, 1, 1]}) >>> pred=pd.DataFrame({"user_idx": 1, ... "item_idx": [1, 2, 3, 4, 5, 6, 7], ... "relevance": [0.5, 0.1, 0.25, 0.6, 0.2, 0.3, 0]}) >>> roc = RocAuc() >>> roc(pred, true, 7) 0.75
MRR
- class replay.metrics.MRR
Mean Reciprocal Rank – Reciprocal Rank is the inverse position of the first relevant item among top-k recommendations, \(\frac {1}{rank_i}\). This value is averaged by all users.
>>> import pandas as pd >>> pred = pd.DataFrame({"user_idx": [1, 1, 1], "item_idx": [3, 2, 1], "relevance": [5 ,5, 5]}) >>> true = pd.DataFrame({"user_idx": [1, 1, 1], "item_idx": [2, 4, 5], "relevance": [5, 5, 5]}) >>> MRR()(pred, true, 3) 0.5 >>> MRR()(pred, true, 1) 0.0 >>> MRR()(true, pred, 1) 1.0
NDCG
- class replay.metrics.NDCG
Normalized Discounted Cumulative Gain is a metric that takes into account positions of relevant items.
This is the binary version, it takes into account whether the item was consumed or not, relevance value is ignored.
\[DCG@K(i) = \sum_{j=1}^{K}\frac{\mathbb{1}_{r_{ij}}}{\log_2 (j+1)}\]\(\mathbb{1}_{r_{ij}}\) – indicator function showing that user \(i\) interacted with item \(j\)
To get from \(DCG\) to \(nDCG\) we calculate the biggest possible value of DCG for user \(i\) and recommendation length \(K\).
\[IDCG@K(i) = max(DCG@K(i)) = \sum_{j=1}^{K}\frac{\mathbb{1}_{j\le|Rel_i|}}{\log_2 (j+1)}\]\[nDCG@K(i) = \frac {DCG@K(i)}{IDCG@K(i)}\]\(|Rel_i|\) – number of relevant items for user \(i\)
Metric is averaged by users.
\[nDCG@K = \frac {\sum_{i=1}^{N}nDCG@K(i)}{N}\]>>> import pandas as pd >>> pred=pd.DataFrame({"user_idx": [1, 1, 2, 2], ... "item_idx": [4, 5, 6, 7], ... "relevance": [1, 1, 1, 1]}) >>> true=pd.DataFrame({"user_idx": [1, 1, 1, 1, 1, 2], ... "item_idx": [1, 2, 3, 4, 5, 8], ... "relevance": [0.5, 0.1, 0.25, 0.6, 0.2, 0.3]}) >>> ndcg = NDCG() >>> ndcg(pred, true, 2) 0.5
Surprisal
- class replay.metrics.Surprisal(log)
Measures how many surprising rare items are present in recommendations.
\[\textit{Self-Information}(j)= -\log_2 \frac {u_j}{N}\]\(u_j\) – number of users that interacted with item \(j\). Cold items are treated as if they were rated by 1 user. That is, if they appear in recommendations it will be completely unexpected.
Metric is normalized.
Surprisal for item \(j\) is
\[Surprisal(j)= \frac {\textit{Self-Information}(j)}{log_2 N}\]Recommendation list surprisal is the average surprisal of items in it.
\[Surprisal@K(i) = \frac {\sum_{j=1}^{K}Surprisal(j)} {K}\]Final metric is averaged by users.
\[Surprisal@K = \frac {\sum_{i=1}^{N}Surprisal@K(i)}{N}\]- __init__(log)
Here we calculate self-information for each item
- Parameters
log (
Union
[DataFrame
,DataFrame
]) – historical data
Unexpectedness
- class replay.metrics.Unexpectedness(pred)
Fraction of recommended items that are not present in some baseline recommendations.
>>> import pandas as pd >>> from replay.session_handler import get_spark_session, State >>> spark = get_spark_session(1, 1) >>> state = State(spark)
>>> log = pd.DataFrame({"user_idx": [1, 1, 1], "item_idx": [1, 2, 3], "relevance": [5, 5, 5], "timestamp": [1, 1, 1]}) >>> recs = pd.DataFrame({"user_idx": [1, 1, 1], "item_idx": [0, 0, 1], "relevance": [5, 5, 5], "timestamp": [1, 1, 1]}) >>> metric = Unexpectedness(log) >>> round(metric(recs, 3), 2) 0.67
- __init__(pred)
- Parameters
pred (
Union
[DataFrame
,DataFrame
]) – model predictions
Coverage
- class replay.metrics.Coverage(log)
Metric calculation is as follows:
take
K
recommendations with the biggestrelevance
for eachuser_id
count the number of distinct
item_id
in these recommendationsdevide it by the number of items in the whole data set
- __init__(log)
- Parameters
log (
Union
[DataFrame
,DataFrame
]) – pandas or Spark DataFrame It is important forlog
to contain all available items.
Custom Metric
Your metric should be inherited from Metric
class and implement following methods:
__init__
_get_enriched_recommendations
_get_metric_value_by_user
get_enriched_recommendations
is already implemented, but you can change it if it is required for your metric.
_get_metric_value_by_user
is required for every metric because this is where the actual calculations happen.
- replay.metrics.base_metric.get_enriched_recommendations(recommendations, ground_truth)
Merge recommendations and ground truth into a single DataFrame and aggregate items into lists so that each user has only one record.
- Parameters
recommendations (
Union
[DataFrame
,DataFrame
]) – recommendation listground_truth (
Union
[DataFrame
,DataFrame
]) – test data
- Return type
DataFrame
- Returns
[user_id, pred, ground_truth]
- class replay.metrics.base_metric.Metric
Base metric class
- abstract static _get_metric_value_by_user(k, pred, ground_truth)
Metric calculation for one user.
- Parameters
k – depth cut-off
pred – recommendations
ground_truth – test data
- Return type
float
- Returns
metric value for current user
- class replay.metrics.base_metric.RecOnlyMetric(log, *args, **kwargs)
Base class for metrics that do not need holdout data