This script was created as part of my college project, ReadUniverse, where I needed a large volume of book data for development and dummy content.
Before running the scripts, make sure you have Python 3.8+ installed. Then install the required dependencies using:
pip install -r requirements.txtThis project consists of three main Python scripts that work together to collect and extract book data from Goodreads:
url.py — Collects book URLs based on a genre, shelf, list, or search query.
list.py — Automatically generated after running url.py, this file stores the collected book URLs by default.
scraper.py — Extracts detailed book information from the URLs stored in the list.py file.
The following command-line arguments configure the URL collection process:
-
--url(string, required):
The Goodreads shelf, search, or list page URL from which to start scraping book URLs. -
--max(integer, optional):
The maximum number of book URLs to collect.
Default:20 -
--delay(integer, optional):
Delay in seconds between consecutive page requests to respect Goodreads' rate limits and avoid overloading their servers.
Default:2 -
--output(string, optional):
The filename where the collected URLs will be saved as a Python file containing a list of URLs. Default:"list.py"
Note: The collected URLs will be saved into the specified output file (default is
list.py).
Usage example:
python url.py --url https://www.goodreads.com/shelf/show/fantasy --max 50 --delay 1 --output books.pyCurrently, scraper.py does not accept any command-line arguments. It automatically processes the list of book URLs saved in the output file generated by url.py (default: list.py).
Usage example:
python scraper.py- Run
url.pyto collect book URLs and save them to a Python file (e.g.,list.pyor a custom filename). - Run
scraper.pyto scrape detailed book data from the URLs contained in that file. - The scraped data will be exported in JSON and CSV formats for further analysis.
The extracted data are saved in data.json and data.csv below are the JSON example:
{
"title": "Fight Club",
"authorName": "Chuck Palahniuk",
"description": "Chuck Palahniuk showed himself to be his generation",
"isbn": "9780393355949",
"publication": "1996-08-17",
"pages": 224,
"category": [
"Fiction",
"Classics",
"Thriller",
"Contemporary",
"Novels",
"Mystery",
"Literature"
],
"likes": 69,
"averageRating": 4.18,
"totalRating": 625058,
"totalReview": 25009,
"price": 57000,
"stock": 7,
"imageURL": "https://images-na.ssl-images-amazon.com/images/S/compressed.photo.goodreads.com/books/1558216416i/36236124.jpg"
}There are
Likes,PriceandStockfield, those 3 field created only for the dummy data, not from actual goodreads website
This project provides a scalable and modular solution for extracting comprehensive book data from Goodreads. The url.py script collects book URLs based on specified criteria (genre, shelf, or search) and saves them in a Python file (default: list.py). The scraper.py script processes these URLs to scrape detailed book metadata, exporting the results in JSON and CSV formats. This approach enables efficient, large-scale data harvesting suitable for analytics, research, machine learning datasets, or app development.