Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] French DVB subtitles need deduplication #1040

Open
6 tasks done
Liontooth opened this issue Nov 18, 2018 · 0 comments
Open
6 tasks done

[BUG] French DVB subtitles need deduplication #1040

Liontooth opened this issue Nov 18, 2018 · 0 comments

Comments

@Liontooth
Copy link
Contributor

Liontooth commented Nov 18, 2018

CCExtractor version: 0.85

In raising this issue, I confirm the following:

  • I have read and understood the contributors guide.
  • I have checked that the bug-fix I am reporting can be replicated, or that the feature I am suggesting isn't already present.
  • I have checked that the issue I'm posting isn't already reported.
  • I have checked that the issue I'm reporting isn't already solved and no duplicates exist in closed issues and in opened issues
  • I have checked the pull requests tab for existing solutions/implementations to my issue/suggestion.

My familiarity with the project is as follows:

  • I am an active contributor to CCExtractor.

Necessary information

  • Is this a regression (did it work before)? [X] NO
  • What platform did you use? [ ] Windows - [X] Linux - [ ] Mac
  • What were the used arguments? -datets -ttxt -UCLA -noru -utf8

**Video links **
http://vrnewsscape.ucla.edu/dropbox/2017-07-14_1100_FR_TF1_Journal.mpg
http://vrnewsscape.ucla.edu/dropbox/2017-07-14_1100_FR_TF1_Journal.txt

Additional information
CCExtractor-0.85 compiled 2017-07-29 with liblept4 succeeds in extracting DVB captions from the file above, as shown in the accompanying txt file (Chrome gets the encoding wrong and no longer has a way to correct it; in fact the file is UTF-8). (CCExtractor-0.86 and CCExtractor-0.87 fail to find any subtitles, see issue #1039.)

However, each line appears in part several times before it completes, and also at times partially repeats in the following line:

20170714110001.000|20170714110001.360|CC1|distribués gratuitement pour petits,
20170714110001.360|20170714110001.480|CC1|distribués, gratuitement pour petits et
20170714110001.480|20170714110001.880|CC1|distribués, gratuitement pour petits et grands,
20170714110001.880|20170714110002.280|CC1|distribués, gratuitement pour …
20170714110002.280|20170714110002.440|CC1|distribués, gratuitement pour petits et grands,, histoire que
20170714110002.440|20170714110002.840|CC1|petits et grands,, histoire que pe rd u re,
20170714110002.840|20170714110003.120|CC1|petits et grands,, histoire que pe rd u re, cette
20170714110003.120|20170714110003.400|CC1|petits et grands,, histoire que pe rd u re, cette a n n ée
20170714110003.400|20170714110003.800|CC1|petits et grands,, histoire que perdure, cette année encore,
20170714110003.800|20170714110003.880|CC1|petits et grands,, histoire que perdure, cette année encore, la

CCExtractor has solved this duplication problem in teletext; it's clearly also present in some DVB subtitles, notably the French network TF1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants