created, $=dv.current().file.ctime
& modified, =this.modified
tags:nlp
via Recall-Oriented Understudy for Gisting Evaluation
ROGUE is a software package specifically designed for evaluating automatic summarization that can also be used for machine translation.
The metrics compare an automatically produced summary or translation against a reference (high-quality and human-produced) summaries or translations.
ROUGE-N measures the number of matching n-grams between the model-generated text and a human-produced reference.
Consider Reference R and candidate summary C R: The cat is on the mat C: The cat and the dog.
precision:
ROUGE-1 precision can be computed as the ratio of the number of unigrams in C that appear also in R (that are the words “the”, “cat”, and “the”), over the number of unigrams in C.
recall:
ROUGE-1 recall can be computed as the ratio of the number of unigrams in R that appear also in C (that are the words “the”, “cat”, and “the”), over the number of unigrams in R.
f1-score:
Then, ROUGE-1 F1-score can be directly obtained from the ROUGE-1 precision and recall using the standard F1-score formula.
ROUGE-1 F1-score = 2 * (precision * recall) / (precision + recall) = 0.54
For ROUGE-2 precision, is the ratio of 2-grams in C that appear also in R (the only 2-gram is “the cat”), over the number 2-grams in C.
So, ROUGE-2 precision = 1/4 = 0.25 (the cat) / (cat and) (and the) (the dog)
Recall is the ration of 2-grams that appear in R also in C, (“the cat”) is the only one = 1/5
ROUGE-L is based on longest common subsequences between output and reference.
LCS is the 3-gram “the cat the” (remember words are not necessarily consecutive), which appears in both R and C.
The cat is on the mat. The cat and the dog.
ROUGE-S allows us to add a degree of leniency to n-gram matching.
ROUGE-S is a skip-gram concurrence metric: this allows to search for consecutive words from the reference text that appear in the model output but are separated by one-or-more other words.
- Pros: it correlates positively with human evaluation, it’s inexpensive to compute and language-independent.
- Cons: ROUGE does not manage different words that have the same meaning, as it measures syntactical matches rather than semantics.
Related BLEU - Bilingual Evaluation Understudy is a method for comparing a candidate translation to one of more reference translations.
ROUGE
when you care more about finding n-grams from the reference text into the predicted text. Usually useful for summarisation where you care more about keeping the essence message of the original text.
BLEU
when you care more about matching the predicted n-grams with the reference words n-grams. This is useful for machine translation, where you care more about how precise is the translation.
Limitations
Limits of automatic summarization according to ROUGE
The task is NP-hard, of which we give the first proof. Still, as we show empirically for three central benchmark datasets for the task, greedy algorithms empirically seem to perform optimally according to the metric. Additionally, overall quality assurance is problematic: there is no natural upper bound on the quality of summarisation systems, and even humans are excluded from performing optimal summarisation.