Example code beautifulsoup_crawler.py not working on Windows due to encoding assumptions.

Example code[ beautifulsoup_crawler.py](https://github.com/apify/crawlee-python/blob/3c3dfe8cdbf3b72c46018914a2b6479b31c5dd99/docs/examples/code/beautifulsoup_crawler.py) is scraping from: https://crawlee.dev
This page contains special symbols, for example:  "🏗️"
Such symbols are scraped and using [context.push_data](https://github.com/apify/crawlee-python/blob/3c3dfe8cdbf3b72c46018914a2b6479b31c5dd99/docs/examples/code/beautifulsoup_crawler.py#L40) saved to a file.
Writing to a file is done without specific encoding:
https://github.com/apify/crawlee-python/blob/3c3dfe8cdbf3b72c46018914a2b6479b31c5dd99/src/crawlee/memory_storage_client/_dataset_client.py#L353

And in such case some system default encoding is used. 

See following mini-example to summarize the issue:
```
import locale
locale.getpreferredencoding() # "cp1252" on Windows, "utf-8" on Linux
"🏗️".encode("utf-8") # Works fine. 
"🏗️".encode("cp1252") # UnicodeEncodeError
```

To reproduce the issue with example file, just run it on Windows machine where locale.getpreferredencoding() returns "cp1252".

This exposes bigger issue, that the code is making assumptions about the encoding of the extracted data, that might not be always correct or relying on some system default encoding, which can be different in different environments. Hard-coding utf-8 would be probably fastest simple fix when it comes to writing to a file. But what if user wants to use a different encoding?

Unexpected encodings could also create other issues when creating soup without **from_encoding** optional argument. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings

One way to solve the issue would be to give users some option to specify recommended encodings that should be used by the system when needed. Example solution draft that handles only encoding when writing to a file to fix the code example on Windows:
https://github.com/Pijukatel/crawlee-python/pull/1

(I am not familiar with the code base, so this is my naive proposal. There are probably more suitable ways how to address this, but this is just an example solution that would give user the option to deal with such issues.)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Example code beautifulsoup_crawler.py not working on Windows due to encoding assumptions. #532

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Example code beautifulsoup_crawler.py not working on Windows due to encoding assumptions. #532

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions