-
Notifications
You must be signed in to change notification settings - Fork 165
Open
Description
I am using javalang to tokenize files which include Unicode escape sequences. These are correctly tokenized as strings, but the item.value is not handled cleanly. Consider the 2 cases below:
Case 1: builder.append(text, 0, MAX_TEXT).append('\u2026');
Case 2: builder.append(text, 0, MAX_TEXT).append('…');
In both cases, item.value is identical and I get an exception if I try to write the item.value to a file. I can catch the error and successfully print using python like this:
if (token_type == 'String'):
try:
outfile.write(item.value)
except UnicodeEncodeError:
outfile.write(item.value.encode('unicode-escape').decode('utf-8'))but the python code above prints the same value for Case 1 and 2. I suspect the proper fix is to use raw strings for String token values internal to javalang. Below is an example of raw strings solving the problem.
>>> str1 = '…'
>>> str2 = '\u2026'
>>> print("str1: ",str1," str2:",str2)
str1: … str2: …
>>> str1 == str2
True
>>> str1 = r'…'
>>> str2 = r'\u2026'
>>> print("str1: ",str1," str2:",str2)
str1: … str2: \u2026
>>> str1 == str2
FalseMetadata
Metadata
Assignees
Labels
No labels