Skip to content

Conversation

@zhengwei143
Copy link
Contributor

@zhengwei143 zhengwei143 commented Jul 18, 2023

When args are written to parameter files, non-ascii values are wrongly encoded again as utf-8. This seems to be unaffected by the JDK20 upgrade of Bazel, and has always been happening.

Repro:

$ cat encoding/BUILD$ cat encoding/BUILD
load("defs.bzl", "cat")

cat(
    name = "test_cat",
    out = "cat.txt",
    content = "привет",
)

sh_binary(
    name = "cat_bin",
    srcs = ["test_cat.sh"],
)

$ cat encoding/defs.bzl
def _test_cat_impl(ctx):
  args = ctx.actions.args()
  args.use_param_file("%s", use_always = True)
  args.add(ctx.attr.content)
  ctx.actions.run(
    inputs = [],
    outputs = [ctx.outputs.out],
    arguments = [args, ctx.outputs.out.path],
    executable = ctx.executable.cat_bin,
  )

cat = rule(
  implementation = _test_cat_impl,
  attrs = {
    "out": attr.output(mandatory = True),
    "content": attr.string(mandatory = True),
    "cat_bin": attr.label(
      executable = True,
      cfg = "exec",
      allow_files = True,
      default = Label("//encoding:cat_bin"),
    ),
  })

$ cat encoding/test_cat.sh
#!/bin/sh
cat "$1" >> "$2";

$ bazel build //encoding:test_cat
$ cat bazel-bin/encoding/cat.txt-0.params
'п�иве�'
$ cat bazel-bin/encoding/cat.txt
'п�иве�'

@zhengwei143 zhengwei143 requested a review from tetromino July 19, 2023 12:36
@zhengwei143 zhengwei143 marked this pull request as ready for review July 19, 2023 12:36
@github-actions github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-Performance Issues for Performance teams labels Jul 19, 2023
@zhengwei143
Copy link
Contributor Author

Potential fix for #18792

Copy link
Contributor

@tetromino tetromino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems logically correct but inefficient: writeContentUtf8 is a private method which has exactly 1 call site, so we can certainly avoid double-recoding.

I would suggest reverting the change to writeContent(), and instead changing writeContentUtf8() to the following:

      ...
      if (stringUnsafe.getCoder(line) == StringUnsafe.LATIN1 && isAscii(bytes)) {
        outputStream.write(bytes);
      } else if (!StringUtil.decodeBytestringUtf8(line).equals(line)) {
        // We successfully decoded line from utf8 - meaning it was already encoded as utf8.
        // We do not want to double-encode.
        outputStream.write(bytes);
      } else {
        ByteBuffer encodedBytes = encoder.encode(CharBuffer.wrap(line));
        ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

team-Performance Issues for Performance teams

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants