Skip to content

String values don't properly handle unicode escapes #58

@SteveKommrusch

Description

@SteveKommrusch

I am using javalang to tokenize files which include Unicode escape sequences. These are correctly tokenized as strings, but the item.value is not handled cleanly. Consider the 2 cases below:
Case 1: builder.append(text, 0, MAX_TEXT).append('\u2026');
Case 2: builder.append(text, 0, MAX_TEXT).append('…');

In both cases, item.value is identical and I get an exception if I try to write the item.value to a file. I can catch the error and successfully print using python like this:

      if (token_type == 'String'):
          try:
              outfile.write(item.value)
          except UnicodeEncodeError:
              outfile.write(item.value.encode('unicode-escape').decode('utf-8'))

but the python code above prints the same value for Case 1 and 2. I suspect the proper fix is to use raw strings for String token values internal to javalang. Below is an example of raw strings solving the problem.

>>> str1 = '…'
>>> str2 = '\u2026'
>>> print("str1: ",str1," str2:",str2)
str1:  …  str2: …
>>> str1 == str2
True
>>> str1 = r'…'
>>> str2 = r'\u2026'
>>> print("str1: ",str1," str2:",str2)
str1:  …  str2: \u2026
>>> str1 == str2
False

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions