Skip to content

Conversation

@ljluestc
Copy link

Fixes issue #134 (originally #455): href url in news html source and scrape urls from Newspaper counts differ.

Changes

  • Added restrict_to_homepage_urls option to newspaper.build to limit articles to homepage <a href> links.
  • Integrated BeautifulSoup for homepage URL extraction.
  • Fixed indexing bug in user example code.
  • Added test case for Reuters homepage scraping.
  • Updated documentation with new option.

Testing

  • Verified ~300 articles scraped from Reuters homepage.
  • Ensured article URLs match homepage patterns.
  • Tested error handling for failed downloads.
  • Ran existing test suite to confirm no regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant