Skip to content

Replace bs4 stub with beautifulsoup4 in dependencies#977

Open
justinwolfington wants to merge 1 commit into
datajuicer:mainfrom
justinwolfington:swap-bs4-to-beautifulsoup4
Open

Replace bs4 stub with beautifulsoup4 in dependencies#977
justinwolfington wants to merge 1 commit into
datajuicer:mainfrom
justinwolfington:swap-bs4-to-beautifulsoup4

Conversation

@justinwolfington
Copy link
Copy Markdown

Summary

Swap "bs4" for "beautifulsoup4" in pyproject.toml's dependencies.

Why

The bs4 package on PyPI is a dummy stub maintained by Beautiful Soup's author to prevent name-squatting. Its only purpose is to depend on beautifulsoup4 (the real library) so that users who accidentally type pip install bs4 get something working. The PyPI page itself says: "This is a dummy package managed by the developer of Beautiful Soup to prevent name squatting."

Listing bs4 as a direct dependency:

  • Adds an extra hop in dependency resolution (bs4beautifulsoup4).
  • Pulls in a package that contains no real code of its own.

What stays the same

beautifulsoup4 registers the bs4 import name, so the two existing usages in this repo continue to work unchanged:

  • data_juicer/download/downloader.py:12from bs4 import BeautifulSoup
  • data_juicer/ops/mapper/extract_tables_from_html_mapper.py:1import bs4

No code changes needed elsewhere.

Test plan

  • CI passes (no expected functional change).
  • pip install py-data-juicer after this lands installs beautifulsoup4 directly without going through the bs4 shim, and import bs4 still resolves.

The `bs4` package on PyPI is a dummy stub (Beautiful Soup maintainer's
own anti-name-squatting placeholder) that just installs `beautifulsoup4`.
Listing it as a direct dependency adds an extra step in dependency
resolution and pulls in a package with no source code of its own.

`beautifulsoup4` registers the `bs4` import name, so the existing
`from bs4 import BeautifulSoup` and `import bs4` usages in
data_juicer/download/downloader.py and
data_juicer/ops/mapper/extract_tables_from_html_mapper.py continue to
work unchanged.

Effect:
- One fewer hop in dependency resolution.
- Cleaner dependency graph for downstream consumers (notably anyone
  using mirrors / private indexes that don't host the bs4 stub).
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the pyproject.toml file by replacing the bs4 dependency with beautifulsoup4. The reviewer suggested adding a minimum version constraint to the new dependency to ensure reproducibility and reminded the author to update the uv.lock file to reflect this change.

Comment thread pyproject.toml
"emoji==2.2.0", # emoji handling
"tabulate",
"bs4",
"beautifulsoup4",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is recommended to provide a minimum version constraint for beautifulsoup4 to ensure reproducibility and prevent the installation of very old versions, which is consistent with other core dependencies in this file (e.g., datasets, numpy).

Additionally, since uv.lock is explicitly included in the build configuration (lines 197, 203, 206), please ensure that the lock file is updated (e.g., by running uv lock) to reflect this change and avoid synchronization issues during the build process.

Suggested change
"beautifulsoup4",
"beautifulsoup4>=4.11.0",

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant