Skip to content
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,36 @@ private void writeString(final String text) throws IOException {
out.write(text.getBytes(charset));
}

// CHECKSTYLE:OFF checkstyle:JavadocStyle
// checkstyle does not understand that span tags are inside a code block
/**
* <p>Recursive method to extract text from all nodes.</p>
* <p>
* This method processes {@link TextNode}s and {@code <br>} tags,
* recursively extracting text from nested tags
* (e.g. extracting text from nested {@code <span>} tags).
* Newlines are added for {@code <br>} tags.
* </p>
* @param node the current node to process
* @param text the {@link StringBuilder} to append the extracted text to
*/
private void extractText(final Node node, final StringBuilder text) {
if (node instanceof TextNode textNode) {
text.append((textNode).text());
} else if (node instanceof Element element) {
// <br> is a self-closing HTML tag used to insert a line break.
if (element.tagName().equalsIgnoreCase("br")) {
// Add a newline for <br> tags
text.append(NEW_LINE);
}
}
// Recursively process child nodes
for (final Node child : node.childNodes()) {
extractText(child, text);
}
Copy link
Copy Markdown
Member

@Isira-Seneviratne Isira-Seneviratne Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The build method could be rewritten like this:

Suggested change
if (node instanceof TextNode textNode) {
text.append((textNode).text());
} else if (node instanceof Element element) {
// <br> is a self-closing HTML tag used to insert a line break.
if (element.tagName().equalsIgnoreCase("br")) {
// Add a newline for <br> tags
text.append(NEW_LINE);
}
}
// Recursively process child nodes
for (final Node child : node.childNodes()) {
extractText(child, text);
}
final List<Pair<Element, String>> pairList = doc.selectStream("body > div > p")
.map(paragraph -> {
// Element.text extracts from child nodes as well
return new Pair<>(paragraph, paragraph.text());
})
.filter(pair -> !ignoreEmptyFrames || !pair.second.isEmpty())
.toList();
for (final var pair : pairList) {
final var paragraph = pair.first;
final var text = pair.second;
final String begin = getTimestamp(paragraph, "begin");
final String end = getTimestamp(paragraph, "end");
writeFrame(begin, end, text);
}

Copy link
Copy Markdown
Contributor

@TobiGr TobiGr Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Element.text extracts from child nodes as well

But does it convert <br> tags to a new line? At least not from what I could find in the docs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Isira-Seneviratne

Thanks for the suggestion!

I gave it a try locally, but it failed to build (built with dev branch) because:

  1. Pair requires the commons-lang3 dependency, which is currently not part of NewPipe.
  2. selectStream(...) is only available in newer Jsoup versions. The current setup doesn’t support it.

In addition, the recursive extractText() method is dependency-free, has been tested extensively, and keeps the logic clearer for future modifications (e.g. handling special tags).

For this reason, I’d prefer to keep the current approach extractText() instead of introducing new dependencies or relying on newer APIs.

Copy link
Copy Markdown
Member

@Isira-Seneviratne Isira-Seneviratne Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Android has a built-in Pair type. Also, the current Jsoup version in the project has the method.

@TobiGr You're right, but it works fine for the subtitles from the linked videos. Alternatively, the map operation could be changed to use the existing logic.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly don't see a good reason to use streams here. It makes the whole code harder to read and I am not sure if we achieve an increased performance by using a stream here. If stream is significantly faster here, we could make use of it though.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I changed the parameter to String as well, forgot to add that.

I accidentally edited your reply, sorry about that

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to paragraph.text() without verifying whether all edge cases (like special characters or tags) are handled properly could introduce new issues.

The extractText() function consolidates NewPipe’s scattered code into a single, recursive function that processes all text content from various tags. This code has been tested for a long time and is well-suited to NewPipe’s design requirements. It guarantees the handling of special characters, especially line breaks (<br>), which are explicitly replaced with \r\n—something that paragraph.text() may not do correctly. Some software, for instance, defines <br> as \n or even as spaces, and we can't be sure that paragraph.text() handles these cases the same way as NewPipe does. And any other special characters, we don't know.

Switching to paragraph.text() introduces the risk of losing control over special character handling. If we need to add special handling later, paragraph.text() could lock us out of such flexibility and we can't change its code. Essentially, it becomes a black box, and we’re unsure of how it’ll behave with our specific use cases.

For these reasons, I recommend continuing to use extractText().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the paragraph.text() definition from https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/nodes/Element.java#L1552

    public String text() {
        final StringBuilder accum = StringUtil.borrowBuilder();
        new TextAccumulator(accum).traverse(this);
        return StringUtil.releaseBuilder(accum).trim();
    }

And it uses the trim() function.
After researching it on Google, I found that trim() is a standard Java String method used to remove leading and trailing whitespace from a string.

Since it removes whitespace, it is not suitable for subtitles.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the implementation in this PR is already good enough. @Isira-Seneviratne If you do not have any objections, I'd proceed with merging this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am also in favour of keeping the current code

}
// CHECKSTYLE:ON

public void build(final SharpStream ttml) throws IOException {
/*
* TTML parser with BASIC support
Expand All @@ -74,21 +104,15 @@ public void build(final SharpStream ttml) throws IOException {
final Elements paragraphList = doc.select("body > div > p");

// check if has frames
if (paragraphList.size() < 1) {
if (paragraphList.isEmpty()) {
return;
}

for (final Element paragraph : paragraphList) {
text.setLength(0);

for (final Node children : paragraph.childNodes()) {
if (children instanceof TextNode) {
text.append(((TextNode) children).text());
} else if (children instanceof Element
&& ((Element) children).tagName().equalsIgnoreCase("br")) {
text.append(NEW_LINE);
}
}
// Recursively extract text from all child nodes
extractText(paragraph, text);

if (ignoreEmptyFrames && text.length() < 1) {
continue;
Expand Down