-
Notifications
You must be signed in to change notification settings - Fork 519
Description
Example code beautifulsoup_crawler.py is scraping from: https://crawlee.dev
This page contains special symbols, for example: "ποΈ"
Such symbols are scraped and using context.push_data saved to a file.
Writing to a file is done without specific encoding:
| f = await asyncio.to_thread(open, file_path, mode='w') |
And in such case some system default encoding is used.
See following mini-example to summarize the issue:
import locale
locale.getpreferredencoding() # "cp1252" on Windows, "utf-8" on Linux
"ποΈ".encode("utf-8") # Works fine.
"ποΈ".encode("cp1252") # UnicodeEncodeError
To reproduce the issue with example file, just run it on Windows machine where locale.getpreferredencoding() returns "cp1252".
This exposes bigger issue, that the code is making assumptions about the encoding of the extracted data, that might not be always correct or relying on some system default encoding, which can be different in different environments. Hard-coding utf-8 would be probably fastest simple fix when it comes to writing to a file. But what if user wants to use a different encoding?
Unexpected encodings could also create other issues when creating soup without from_encoding optional argument. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings
One way to solve the issue would be to give users some option to specify recommended encodings that should be used by the system when needed. Example solution draft that handles only encoding when writing to a file to fix the code example on Windows:
Pijukatel#1
(I am not familiar with the code base, so this is my naive proposal. There are probably more suitable ways how to address this, but this is just an example solution that would give user the option to deal with such issues.)