-
-
Notifications
You must be signed in to change notification settings - Fork 61
Closed
Labels
detectionRelated to the charset detection mechanism, chaos/mess/coherenceRelated to the charset detection mechanism, chaos/mess/coherence
Milestone
Description
Describe the bug
Introducing conventional ascii text returns UTF-16LE encoding
To Reproduce
import chardet, charset_normalizer
charset_normalizer.detect(b");") # error also happens with b"(;"
# returns {'encoding': 'utf_16_le', 'language': '', 'confidence': 1.0}
chardet.detect(b");")
# {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}Expected behavior
These are standard ASCII characters, I expect a UTF-8 encoding
Desktop (please complete the following information):
- macOS 14.5
- Python version 3.12.1 (anaconda build)
- charset_normalizer version 3.3.2
Additional context
Evaluate either b"(", b")", b";" or b"()" produces the expected result. There are other combinations of punctuation characters that produce the same error, e.g. b".;".
I understand this is a very small string but perhaps a default to the minimum character set?
Metadata
Metadata
Assignees
Labels
detectionRelated to the charset detection mechanism, chaos/mess/coherenceRelated to the charset detection mechanism, chaos/mess/coherence