-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
[Bug] Fix missing subtitle text in manually downloaded *.SRT files. (issue #10030) #12575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
2c35db7
e1888ed
22ee01b
3516667
71aa6d5
d311fae
300afde
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -54,6 +54,36 @@ private void writeString(final String text) throws IOException { | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| out.write(text.getBytes(charset)); | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| // CHECKSTYLE:OFF checkstyle:JavadocStyle | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| // checkstyle does not understand that span tags are inside a code block | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| /** | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| * <p>Recursive method to extract text from all nodes.</p> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| * <p> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| * This method processes {@link TextNode}s and {@code <br>} tags, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| * recursively extracting text from nested tags | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| * (e.g. extracting text from nested {@code <span>} tags). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| * Newlines are added for {@code <br>} tags. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| * </p> | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| * @param node the current node to process | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| * @param text the {@link StringBuilder} to append the extracted text to | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| */ | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| private void extractText(final Node node, final StringBuilder text) { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if (node instanceof TextNode textNode) { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| text.append((textNode).text()); | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } else if (node instanceof Element element) { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| // <br> is a self-closing HTML tag used to insert a line break. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if (element.tagName().equalsIgnoreCase("br")) { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| // Add a newline for <br> tags | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| text.append(NEW_LINE); | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| // Recursively process child nodes | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| for (final Node child : node.childNodes()) { | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| extractText(child, text); | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| if (node instanceof TextNode textNode) { | |
| text.append((textNode).text()); | |
| } else if (node instanceof Element element) { | |
| // <br> is a self-closing HTML tag used to insert a line break. | |
| if (element.tagName().equalsIgnoreCase("br")) { | |
| // Add a newline for <br> tags | |
| text.append(NEW_LINE); | |
| } | |
| } | |
| // Recursively process child nodes | |
| for (final Node child : node.childNodes()) { | |
| extractText(child, text); | |
| } | |
| final List<Pair<Element, String>> pairList = doc.selectStream("body > div > p") | |
| .map(paragraph -> { | |
| // Element.text extracts from child nodes as well | |
| return new Pair<>(paragraph, paragraph.text()); | |
| }) | |
| .filter(pair -> !ignoreEmptyFrames || !pair.second.isEmpty()) | |
| .toList(); | |
| for (final var pair : pairList) { | |
| final var paragraph = pair.first; | |
| final var text = pair.second; | |
| final String begin = getTimestamp(paragraph, "begin"); | |
| final String end = getTimestamp(paragraph, "end"); | |
| writeFrame(begin, end, text); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Element.text extracts from child nodes as well
But does it convert <br> tags to a new line? At least not from what I could find in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion!
I gave it a try locally, but it failed to build (built with dev branch) because:
Pairrequires thecommons-lang3dependency, which is currently not part of NewPipe.selectStream(...)is only available in newer Jsoup versions. The current setup doesn’t support it.
In addition, the recursive extractText() method is dependency-free, has been tested extensively, and keeps the logic clearer for future modifications (e.g. handling special tags).
For this reason, I’d prefer to keep the current approach extractText() instead of introducing new dependencies or relying on newer APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Android has a built-in Pair type. Also, the current Jsoup version in the project has the method.
@TobiGr You're right, but it works fine for the subtitles from the linked videos. Alternatively, the map operation could be changed to use the existing logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I honestly don't see a good reason to use streams here. It makes the whole code harder to read and I am not sure if we achieve an increased performance by using a stream here. If stream is significantly faster here, we could make use of it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I changed the parameter to String as well, forgot to add that.
I accidentally edited your reply, sorry about that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching to paragraph.text() without verifying whether all edge cases (like special characters or tags) are handled properly could introduce new issues.
The extractText() function consolidates NewPipe’s scattered code into a single, recursive function that processes all text content from various tags. This code has been tested for a long time and is well-suited to NewPipe’s design requirements. It guarantees the handling of special characters, especially line breaks (<br>), which are explicitly replaced with \r\n—something that paragraph.text() may not do correctly. Some software, for instance, defines <br> as \n or even as spaces, and we can't be sure that paragraph.text() handles these cases the same way as NewPipe does. And any other special characters, we don't know.
Switching to paragraph.text() introduces the risk of losing control over special character handling. If we need to add special handling later, paragraph.text() could lock us out of such flexibility and we can't change its code. Essentially, it becomes a black box, and we’re unsure of how it’ll behave with our specific use cases.
For these reasons, I recommend continuing to use extractText().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the paragraph.text() definition from https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/nodes/Element.java#L1552
public String text() {
final StringBuilder accum = StringUtil.borrowBuilder();
new TextAccumulator(accum).traverse(this);
return StringUtil.releaseBuilder(accum).trim();
}And it uses the trim() function.
After researching it on Google, I found that trim() is a standard Java String method used to remove leading and trailing whitespace from a string.
Since it removes whitespace, it is not suitable for subtitles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the implementation in this PR is already good enough. @Isira-Seneviratne If you do not have any objections, I'd proceed with merging this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I am also in favour of keeping the current code
Uh oh!
There was an error while loading. Please reload this page.