Replace bs4 stub with beautifulsoup4 in dependencies#977
Replace bs4 stub with beautifulsoup4 in dependencies#977justinwolfington wants to merge 1 commit into
Conversation
The `bs4` package on PyPI is a dummy stub (Beautiful Soup maintainer's own anti-name-squatting placeholder) that just installs `beautifulsoup4`. Listing it as a direct dependency adds an extra step in dependency resolution and pulls in a package with no source code of its own. `beautifulsoup4` registers the `bs4` import name, so the existing `from bs4 import BeautifulSoup` and `import bs4` usages in data_juicer/download/downloader.py and data_juicer/ops/mapper/extract_tables_from_html_mapper.py continue to work unchanged. Effect: - One fewer hop in dependency resolution. - Cleaner dependency graph for downstream consumers (notably anyone using mirrors / private indexes that don't host the bs4 stub).
There was a problem hiding this comment.
Code Review
This pull request updates the pyproject.toml file by replacing the bs4 dependency with beautifulsoup4. The reviewer suggested adding a minimum version constraint to the new dependency to ensure reproducibility and reminded the author to update the uv.lock file to reflect this change.
| "emoji==2.2.0", # emoji handling | ||
| "tabulate", | ||
| "bs4", | ||
| "beautifulsoup4", |
There was a problem hiding this comment.
It is recommended to provide a minimum version constraint for beautifulsoup4 to ensure reproducibility and prevent the installation of very old versions, which is consistent with other core dependencies in this file (e.g., datasets, numpy).
Additionally, since uv.lock is explicitly included in the build configuration (lines 197, 203, 206), please ensure that the lock file is updated (e.g., by running uv lock) to reflect this change and avoid synchronization issues during the build process.
| "beautifulsoup4", | |
| "beautifulsoup4>=4.11.0", |
Summary
Swap
"bs4"for"beautifulsoup4"inpyproject.toml'sdependencies.Why
The
bs4package on PyPI is a dummy stub maintained by Beautiful Soup's author to prevent name-squatting. Its only purpose is to depend onbeautifulsoup4(the real library) so that users who accidentally typepip install bs4get something working. The PyPI page itself says: "This is a dummy package managed by the developer of Beautiful Soup to prevent name squatting."Listing
bs4as a direct dependency:bs4→beautifulsoup4).What stays the same
beautifulsoup4registers thebs4import name, so the two existing usages in this repo continue to work unchanged:data_juicer/download/downloader.py:12—from bs4 import BeautifulSoupdata_juicer/ops/mapper/extract_tables_from_html_mapper.py:1—import bs4No code changes needed elsewhere.
Test plan
pip install py-data-juicerafter this lands installsbeautifulsoup4directly without going through thebs4shim, andimport bs4still resolves.