Skip to content

Example code beautifulsoup_crawler.py not working on Windows due to encoding assumptions.Β #532

@Pijukatel

Description

@Pijukatel

Example code beautifulsoup_crawler.py is scraping from: https://crawlee.dev
This page contains special symbols, for example: "πŸ—οΈ"
Such symbols are scraped and using context.push_data saved to a file.
Writing to a file is done without specific encoding:

f = await asyncio.to_thread(open, file_path, mode='w')

And in such case some system default encoding is used.

See following mini-example to summarize the issue:

import locale
locale.getpreferredencoding() # "cp1252" on Windows, "utf-8" on Linux
"πŸ—οΈ".encode("utf-8") # Works fine. 
"πŸ—οΈ".encode("cp1252") # UnicodeEncodeError

To reproduce the issue with example file, just run it on Windows machine where locale.getpreferredencoding() returns "cp1252".

This exposes bigger issue, that the code is making assumptions about the encoding of the extracted data, that might not be always correct or relying on some system default encoding, which can be different in different environments. Hard-coding utf-8 would be probably fastest simple fix when it comes to writing to a file. But what if user wants to use a different encoding?

Unexpected encodings could also create other issues when creating soup without from_encoding optional argument. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings

One way to solve the issue would be to give users some option to specify recommended encodings that should be used by the system when needed. Example solution draft that handles only encoding when writing to a file to fix the code example on Windows:
Pijukatel#1

(I am not familiar with the code base, so this is my naive proposal. There are probably more suitable ways how to address this, but this is just an example solution that would give user the option to deal with such issues.)

Metadata

Metadata

Assignees

Labels

bugSomething isn't working.t-toolingIssues with this label are in the ownership of the tooling team.

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions