Max subtitles duplicating

jinxed

'm not sure when this started as I haven't grabbed from Max in a few weeks but I noticed it over the weekend. No matter what I download, movie or TV series, it has the subtitles duplicated in the SRT files. It's almost like it's trying to grab a regular and the cc subtitles and putting them into 1 file. However when given the option of what to download, there is one option to select. I'm getting the same results from 6.2.1.2 and 6.2.1.0. And unfortunately due to formatting differences in the lines cannot be easily be removed as duplications. Does anyone know how far old of a version of SF may be able to download correctly? Or I'm guessing this is more of a Max issue.

AGuyWithAComputer

I've seen that since really early versions on and off. Enough that I use SubetitleEdit to fix it. This is the command I run on each SRT in my post processing script. {0} would be the path to your subtitle file. Just use /MergeSameTexts if you only want to fix the doubles.

SubtitleEdit.exe /Convert "{0}" SubRip /MergeSameTexts /RemoveTextForHI /FixCommonErrors /RedoCasing /overwrite

Germania

Additional info:

In USA only English CC subs offered for your title..
.. and the srt is fine (without doubled text) and so no Max or SF problem

This happen with mpc-hc and mpc-be but never in vlc player ..
.. so it's a player problem

jinxed

I tried the Merge Lines with Same texts and it did help with a lot of duplicates, however sometimes it fails because the lines will be "[constable] What are you,
some kind of an author?" vs "What are you, some kind of an author?" so the
throws it off, or it may be "-What are you, some kind of an author?" so the - messes it up. I also can't do a remove same time codes because the are slightly off at 8:14.911 vs 8:15.245.

I tried the extra commands and it did clear up a lot of the issues so thanks for that! However there are still duplicates due to the change in punctuations sometimes as seen below. Any suggestion for that?

jinxed

Additional info:

In USA only English CC subs offered for your title..
.. and the srt is fine (without doubled text) and so no Max or SF problem

This happen with mpc-hc and mpc-be but never in vlc player ..
.. so it's a player problem
Germania

It is not a player problem. You can see it right here in the SRT file the lines are duplicated at line 270 and line 573. And as you can see my post above even on Salem's Lot lines are duplicated with slightly different punctuations.

And here we are in Plex showing the duplicated subtitles. I'm not sure why VLC doesn't display them, but they are definitely there.

Germania

It is not a player problem.
jinxed

Yes, I was probably premature, because the duplicates are not listed
one after the other - but much later (not sorted chronologically)

try this - here

AGuyWithAComputer

Well, that's odd. usually when I see duplicates in the srt file, they are on the two consecutive lines. not separated by a few hundred. But no, if there is any difference, like punctuation or line breaks, it won't merge it.

Anyway, I downloaded Salem's Lot just now with the settings bellow on 6.2.1.2 and no issues with duplicate lines that I can see. Went the time stamp in your fist picture and it's fine. In the file I downloaded prior to processing has 2802 lines.

You might try clearing your SF temp folder. Path can be found in the general settings of SF.

jinxed

Well, that's odd. usually when I see duplicates in the srt file, they are on the two consecutive lines. not separated by a few hundred. But no, if there is any difference, like punctuation or line breaks, it won't merge it.

Anyway, I downloaded Salem's Lot just now with the settings bellow on 6.2.1.2 and no issues with duplicate lines that I can see. Went the time stamp in your fist picture and it's fine. In the file I downloaded prior to processing has 2802 lines.

You might try clearing your SF temp folder. Path can be found in the general settings of SF.

AGuyWithAComputer

I cleared the temp folder. Closed out SF and reopened. I deleted Salem's Lot and redownloaded it. Still messed up. I even tried downloading a higher quality one, same result. As you can see the time stamps are a fraction of a second off from each other and they are formatted differently with "Uncle!" and "Uncle,"

And here is an even weirder one. Same spot in the file, nearly identical time code, but one has the subtitle as saying [Parkins] and the other says [constable].

ms-dfav

...
Anyway, I downloaded Salem's Lot just now with the settings bellow on 6.2.1.2 and no issues with duplicate lines that I can see. Went the time stamp in your fist picture and it's fine. In the file I downloaded prior to processing has 2802 lines.
AGuyWithAComputer

It happens that I also have the Salem's Lot downloaded with a date of 2024.X.8 - so with some older SF. English CC srt file also has exactly 2802 lines, 195 696 bytes. The last few lines from your screenshot are identical.

The example conflict in this file is on lines 240:

00:16:22,190 --> 00:16:25,026
We take people at their word
here in the Lot.

and 468:

468
00:16:22,607 --> 00:16:24,609
We take people at their word
here in the Lot.

... among many other.

ms-dfav

I cleared the temp folder. Closed out SF and reopened. I deleted Salem's Lot and redownloaded it. Still messed up. I even tried downloading a higher quality one, same result. As you can see the time stamps are a fraction of a second off from each other and they are formatted differently with "Uncle!" and "Uncle,"
...
And here is an even weirder one. Same spot in the file, nearly identical time code, but one has the subtitle as saying [Parkins] and the other says [constable].
...
jinxed

I can also confirm all of the above.

So the problem exists since at least 3 months.

jinxed

So here's something weirder still. I went back and checked Doom Patrol which was able to grab a few weeks ago and the episodes I spot checked are fine. I also checked the first two episodes of Creature Commandos I grabbed a couple weeks ago and they are fine. But I grabbed Creature Commandos again today to get the updated episodes, and they are messed up with the duplication.

So it seems to be a communication issue with SF and Max. Or a Max issue. Either way, it may be hit or miss, or at least inconsistent. Which means, fixing it won't be easy.

ms-dfav

Which means, fixing it won't be easy.
jinxed

It's not exactly easy vs hard.

More like another hassle to go through while dealing with SF. There's never enough of these.

It is not a typical problem - so that there isn't any tool to repair such subtitles.

However, writing the script to detect that situation is probably trivial (the trigger being the time going 'backwards' substantially (like over a minute) between two consecutive lines), and a script to fix the issue (by deleting one of the 'variants' for each repeating 'chunk' of lines) would be relatively easy.

Maybe I'm missing something that would complicate the process - but hopefully I'm not.

october262

It's not exactly easy vs hard.

More like another hassle to go through while dealing with SF. There's never enough of these.

It is not a typical problem - so that there isn't any tool to repair such subtitles.

However, writing the script to detect that situation is probably trivial (the trigger being the time going 'backwards' substantially (like over a minute) between two consecutive lines), and a script to fix the issue (by deleting one of the 'variants' for each repeating 'chunk' of lines) would be relatively easy.

Maybe I'm missing something that would complicate the process - but hopefully I'm not.
ms-dfav

see this thread post #2 - https://forum.dvdfab.cn/forum/streamfab-support/streamfab/400044-hbomax-duplicate-subtitles-within-srt-file
you can use subtitle edit (free tool) to get rid of duplicate subtitles.

ms-dfav

see this thread post #2 - https://forum.dvdfab.cn/forum/streamfab-support/streamfab/400044-hbomax-duplicate-subtitles-within-srt-file
you can use subtitle edit (free tool) to get rid of duplicate subtitles.
october262

This is not enough, as they are not simply duplicates, as stated above.

Germania

The causal problem is that the mpd contains 2x the Label en-US CC ..
.. with "Salem's Lot" there is one for full Movie and one for the first additional 15 seconds (The "Skip" part)

But t0 has a duration of only 15,015 seconds (1 Segment) - auto adding t13 (1:53:47) is wrong

SF (and some other tools) dl segments for both (with ignoring duration for t0) and add this together.

The right subs (full CC - like in Max player) are here

Wilson.Wang

I will ask the dev to check it ASAP.

Wilson

jinxed

The causal problem is that the mpd contains 2x the Label en-US CC ..
.. with "Salem's Lot" there is one for full Movie and one for the first additional 15 seconds (The "Skip" part)

But t0 has a duration of only 15,015 seconds (1 Segment) - auto adding t13 (1:53:47) is wrong

SF (and some other tools) dl segments for both (with ignoring duration for t0) and add this together.

The right subs (full CC - like in Max player) are here
Germania

I appreciate the link but going and finding subtitles for everything I want to download on Max is going to be a pain. But I guess that might be the only solution for now.

jinxed

I will ask the dev to check it ASAP.

Wilson
Wilson.Wang

Awesome, thank you!

jinxed

Just updating, this problem is still present in 6.2.1.3. I thought it had been fixed as when I first opened SF and selected a title, it did list two options for the subtitles (they were identical). I unchecked one but the subs are still duplicated weirdly. When I went back to SF and reloaded the title, and checked another title, there was only 1 subtitle to pick from.

ms-dfav

This problem has been fixed in 6.2.1.4 but how about detecting / fixing the (already downloaded) subtitles from previous versions?

I present a script that will (try to) do both.

It requires Python 3.10+ and 'srt' and 'chardet' libraries (installed via 'pip install').

The instructions follow:

Fix Max srt subtitles downloaded by StreamFab versions 6.2.1.3 and earlier

Some of these files have repeating, overlapping blocks of subtitles coming from
two different sources, so they are not exact duplicates.

This problem affects mostly English subtitles but who knows what else...

Fortunately, the timestamps are grouped together between the two variants, so that
by analyzing negative time jumps conflicts can be detected automatically;
this script attempts to do just that.

This may backfire in rare cases but should be 99% safe.

The script should be executed with either
- directory name, to be traversed recursively for all the srt files inside
or
- srt file name

The fix comes with 2 variants: two srt files will be saved with filename ending with
'.fixed.v1.srt' and '.fixed.v2.srt'. The variants differ by which set of conflicting lines
got removed (v1 removes the former and v2 the latter of the two conflicting groups of lines).

The code is pasted below (I prefer not to attach files as I don't know if these will be visible). Feel free to do whatever you like with this code, hopefully it will be useful.

#!/usr/bin/env python

# this script requires at least Python 3.10

help_string = \
"""Fix Max srt subtitles downloaded by StreamFab versions 6.2.1.3 and earlier

Some of these files have repeating, overlapping blocks of subtitles coming from
two different sources, so they are not exact duplicates.

This problem affects mostly English subtitles but who knows what else...

Fortunately, the timestamps are grouped together between the two variants, so that
by analyzing negative time jumps conflicts can be detected automatically;
this script attempts to do just that.

This may backfire in rare cases but should be 99% safe.

The script should be executed with either
 - directory name, to be traversed recursively for all the srt files inside
or
 - srt file name

The fix comes with 2 variants: two srt files will be saved with filename ending with
'.fixed.v1.srt' and '.fixed.v2.srt'. The variants differ by which set of conflicting lines
got removed (v1 removes the former and v2 the latter of the two conflicting groups of lines).
"""

import os
import sys
import glob

import srt

from chardet.universaldetector import UniversalDetector
detector = UniversalDetector()

append_fixed_v1: str = ".fixed.v1"
append_fixed_v2: str = ".fixed.v2"

def fix_srt(filename: str, srt_input: str, threshold: float = -30.0) -> tuple[str, str] | None :
    """Fix srt_input given as a string

    :param filename: name of srt file (used only for reporting)
    :param srt_input: srt subtitles as a string
    :param threshold: minimal negative time jump (in seconds) to consider as needed to be fixed
    :return: a tuple of fixed srt subtitles (two variants of a fix), or None if no fix is needed
    """

    try:
        # get a list of subtitles from the contents of srt file
        subs: list[srt.Subtitle] = list(srt.parse(srt_input))
    except:
        print("file:", filename)
        print(" ... could not parse srt contents (probably encoding-related problem)")
        print(f" ... encoding detected: {detector.result}")
        return None

    # detect indices where time jumps backwards between two consecutive lines
    diffs_negative: list[int] = []
    for i in range(len(subs)-1):
        if (subs[i + 1].start - subs[i].end).total_seconds() < threshold:
            diffs_negative.append(i)

    # if there are such negative jumps, fix by removing the conflicting subs
    if diffs_negative:
        # contains indices of subs to keep,
        # removing the sub is done via removing its index from this list first
        indices_v1: set[int] = set(range(len(subs)))
        indices_v2: set[int] = set(range(len(subs)))

        ## variant 1 of the fix:
        # index_end is where the time jump occurs
        for index_end in diffs_negative:
            # determine index_start
            timestamp_start_bound: float = subs[index_end + 1].start.total_seconds()

            # ... starting from the end and going backwards,
            # find the line with timestamp early enough for subtitles to not overlap
            index_start: int = index_end
            while (    index_start >= 0
                   and subs[index_start].end.total_seconds() >= timestamp_start_bound
                  ):
                index_start -= 1

            # the line with index_start is the first one not overlapping,
            # so skip to the next one
            index_start += 1

            # remove subs with indices from index_start to index_end,
            # which is the minimal amount to remove so that there is no more
            # jumping backwards in time between consecutive subs
            for i in range(index_start, index_end+1):
                if i in indices_v1:
                    indices_v1.remove(i)

        ## variant 2 of the fix:
        # index_start is where the time jump occurs
        for index_start in diffs_negative:
            timestamp_end_bound: float = subs[index_start].end.total_seconds()
            index_start += 1
            # determine index_end
            # ... starting from index_start and going forward,
            # find the line with timestamp late enough for subtitles to not overlap
            index_end: int = index_start
            while (    index_end < len(subs)
                   and subs[index_end].start.total_seconds() <= timestamp_end_bound
                  ):
                index_end += 1

            # the line with index_start is the first one not overlapping,
            # so skip to the next one
            index_end -= 1

            # remove subs with indices from index_start to index_end,
            # which is the minimal amount to remove so that there is no more
            # jumping backwards in time between consecutive subs
            for i in range(index_start, index_end+1):
                if i in indices_v2:
                    indices_v2.remove(i)

        subs_filtered_v1 = [subs[i] for i in sorted(list(indices_v1))]
        subs_filtered_v2 = [subs[i] for i in sorted(list(indices_v2))]

        try:
            print("file:", filename)
            srt_output_v1: str = srt.compose(subs_filtered_v1,reindex=False)
            print(f" ... srt contents fixed, variant 1: a total of {len(subs)-len(subs_filtered_v1)} lines removed")
            srt_output_v2: str = srt.compose(subs_filtered_v2,reindex=False)
            print(f" ... srt contents fixed, variant 2: a total of {len(subs)-len(subs_filtered_v2)} lines removed")
            return (srt_output_v1, srt_output_v2)
        except:
            print(" ... srt.compose internal error, cannot re-parse subtitles")
            print(f" ... encoding detected: {detector.result}")

    # there is nothing to do (file is already ok / not corrupted)
    # print("file:", filename)
    # print(" ... no problems detected, no fix needed")
    return None

def fix_file(filename: str) -> bool:
    """
    Fix SRT file
    :param filename: full path to the SRT file
    :return: True if fixed, False otherwise
    """
    filename_v1 = filename[:-4] + append_fixed_v1 + '.srt'
    filename_v2 = filename[:-4] + append_fixed_v2 + '.srt'
    if os.path.exists(filename_v1) or os.path.exists(filename_v2):
        print("file:", filename)
        print(" ... fixes already exist, skipping")
        return False
    else:
        # (try to) detect the correct encoding of a file, pure magic this!
        detector.reset()
        try:
            with open(filename, 'rb') as f:
                for line in f:
                    detector.feed(line)
                    if detector.done: break
            detector.close()
            encoding: str = detector.result['encoding']
        except:
            print("file:", filename)
            print(" ... !!! ERROR cannot read file contents in binary mode (file access problem?), skipping !!!")

        try:
            with open(filename, 'r', encoding= encoding) as f:
                srt_txt = f.read()
        except:
            print("file:", filename)
            print(" ... !!! ERROR cannot read file contents, skipping !!!")
            print(f" ... encoding detected: {detector.result}")
            return False

        # generate two variants of a fix
        fixes: tuple[str, str] = fix_srt(filename, srt_txt)

        # if fixes occurred
        if fixes:
            srt_fix_v1, srt_fix_v2 = fixes
            try:
                with open(filename_v1, 'w', encoding= encoding) as f:
                    f.write(srt_fix_v1)
                with open(filename_v2, 'w', encoding= encoding) as f:
                    f.write(srt_fix_v2)
                return True
            except:
                print(" ... !!! ERROR occurred while writing files with fixes !!!")
                print(f" ... encoding detected: {detector.result}")

if __name__ == "__main__":
    # Test the amount of arguments
    if len(sys.argv) != 2:
        print("Exactly one argument required")
        print(help_string)
        sys.exit(-1)

    arg = sys.argv[1]

    # Traverse the directory if given as an arg
    if os.path.isdir(arg):
        print(f"Traversing directory '{arg}'...")
        for filename_relative in glob.iglob("**/*.srt", root_dir=arg, recursive= True):
            filename = os.path.join(arg, filename_relative)
            if os.path.isfile(filename):
                fix_file(filename)

    # ... or just fix the file if given as an arg
    elif os.path.isfile(arg):
        if not fix_file(arg):
            print("file:", arg)
            print(" ... fix not applied / not needed")

    # cover the edge case
    else:
        print(f"'{arg}' is neither a file nor a directory")
        print(help_string)
        sys.exit(-1)
    print(" ... DONE!")