Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rouge score accuracy #2

Closed
pltrdy opened this issue Mar 22, 2017 · 13 comments
Closed

Rouge score accuracy #2

pltrdy opened this issue Mar 22, 2017 · 13 comments

Comments

@pltrdy
Copy link
Owner

pltrdy commented Mar 22, 2017

The results are known to be quite different from official ROUGE scoring script.

It has been discussed here:
google/seq2seq#89

@pltrdy
Copy link
Owner Author

pltrdy commented Feb 16, 2018

It has been improved with #6

I compared two scoring, on multi-sentence files with 10397 lines, 508630 words, and i get:

  • Official ROUGE(using files2rouge): (took 111 seconds)
---------------------------------------------
1 ROUGE-1 Average_R: 0.34882 (95%-conf.int. 0.34632 - 0.35132)
1 ROUGE-1 Average_P: 0.40104 (95%-conf.int. 0.39803 - 0.40391)
1 ROUGE-1 Average_F: 0.36161 (95%-conf.int. 0.35934 - 0.36383)
---------------------------------------------
1 ROUGE-2 Average_R: 0.13938 (95%-conf.int. 0.13718 - 0.14151)
1 ROUGE-2 Average_P: 0.16228 (95%-conf.int. 0.15968 - 0.16490)
1 ROUGE-2 Average_F: 0.14511 (95%-conf.int. 0.14293 - 0.14729)
---------------------------------------------
1 ROUGE-L Average_R: 0.32234 (95%-conf.int. 0.31998 - 0.32478)
1 ROUGE-L Average_P: 0.37093 (95%-conf.int. 0.36804 - 0.37374)
1 ROUGE-L Average_F: 0.33429 (95%-conf.int. 0.33208 - 0.33647)
  • this code: (took 20 seconds)
{
  "rouge-1": {
    "f": 0.3672435871687543,
    "p": 0.40349020487306564,
    "r": 0.3527286721707171
  },
  "rouge-2": {
    "f": 0.14396864450679678,
    "p": 0.16098625779779233,
    "r": 0.13821563233163145
  },
  "rouge-l": {
    "f": 0.32548307280858685,
    "p": 0.3741943564047806,
    "r": 0.32687448001488595
  }
}

@shijx12
Copy link

shijx12 commented Jul 4, 2018

Maybe the difference is caused by

hyp = [" ".join(_.split()) for _ in hyp.split(".") if len(_) > 0]

split by '.' will remove all '.' in hyp and ref.

@pltrdy
Copy link
Owner Author

pltrdy commented Jul 4, 2018

@shijx12 It's not the only reason, but you've got a good point, that code does not make sense.

I'm editing it and evaluating the impact. Thanks for pointing this out.

@Diego999
Copy link

Diego999 commented Aug 7, 2018

Hi @pltrdy ,

Could you run some evaluation to compare the differences between the perl script and yours ? How much does it differ ? I would love to get rid off the perl script ! https://github.com/RxNLP/ROUGE-2.0 seems to have identical scores (besides a +1 as smoothing they did not implement because not indication was present in the official ROUGE script)

@pltrdy
Copy link
Owner Author

pltrdy commented Aug 7, 2018

@Diego999 that's precisely what I did here: #2 (comment).
In addition, results may slightly differ because of how end of sentences are handled, as suggested in #2 (comment).

@Diego999
Copy link

Diego999 commented Aug 7, 2018

@pltrdy yes but that was in February, some modifications have been done since ;) Especially the remark of #2 (comment) . Did you re-conduct experiments since ?

@pltrdy
Copy link
Owner Author

pltrdy commented Aug 7, 2018

It must be similar if not exactly the same. I'm not sure how is the punctuation handled in the official script. I've attempted some fixes which seems to be worse. It may just be ignored, therefore naïve implementation may be the right one.

@Diego999
Copy link

Diego999 commented Aug 7, 2018

Ok, thank you for your answer !

@AlJohri
Copy link

AlJohri commented Aug 9, 2018

@Diego999

seems to have identical scores

is it documented somewhere showing that ROUGE-2.0 has identical scores?

@Diego999
Copy link

Diego999 commented Aug 9, 2018

@AlJohri Yes last paragraph of their paper

@Diego999
Copy link

By the way, I solved this problem here: https://github.com/Diego999/py-rouge Have a look at the README to understand when the results are different at ~4e-5 sometime

@AlJohri
Copy link

AlJohri commented Sep 19, 2018

that's great to hear @Diego999! are you planning on releasing this as an independent package or merging it back into pltrdy/rouge?

@Diego999
Copy link

Diego999 commented Sep 20, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants