Skip to content

shivaaang/boston-311-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Boston 311 Data Cleaning and Preparation Pipeline

Python 3.11 License: MIT Libraries Jupyter Notebook Parquet

This repository contains the complete, reproducible data pipeline for acquiring, cleaning, and preparing over 2.5 million Boston 311 service requests from 2015 to 2024. The primary goal of this phase was to address significant data quality issues, particularly the recovery of over 650,000 records with missing location data, to create a robust, analysis-ready dataset with an exceptional 98.6% retention rate.

Project Status: Data Acquisition and Cleaning phase is complete. The final output is a single, cleaned Parquet file located at data/processed/boston_311_cleaned.parquet.


Key Features

  • Automated Data Acquisition: Scripts automatically download ten years of 311 service request data and all required geospatial shapefiles from their official sources. The scripts are idempotent, checking for existing files before downloading.
  • Comprehensive Data Cleaning: The 01_data_cleaning.ipynb notebook merges all yearly files, standardizes data types, removes redundant columns, and handles systematic inconsistencies in the raw data.
  • Advanced Geospatial Imputation: A multi-stage spatial imputation process was used to recover critical location data for records with valid coordinates:
    • ZIP Codes: Imputed for 551,136 records using a point-in-polygon join with a custom-filtered Massachusetts ZCTA shapefile.
    • Street Names: Imputed for 9,101 records using a nearest-neighbor join against the official Boston SAM address point database.
    • Neighborhoods: Imputed and standardized for 98,211 records using a point-in-polygon join with official neighborhood boundaries.
  • Rigorous Validation: The accuracy of all geospatial data sources was confirmed through a suite of test notebooks that verify known locations against the shapefiles.
  • Reproducible Environment: The project includes an environment.yml file to ensure the analysis can be reproduced reliably with all necessary dependencies.

Folder Structure

The project is organized to separate raw data, processed data, notebooks, and scripts for clarity and reproducibility.

BOSTON-311-ANALYSIS/
│
├── .gitignore
├── environment.yml
├── README.md
│
├── data/
│   ├── processed/
│   │   ├── boston_311_cleaned.parquet
│   │   ├── (↓ these Parquets will be generated here by the script)
│   │   ├── boston_neighborhood_boundaries.parquet
│   │   ├── boston_neighborhood_boundaries_remapped.parquet
│   │   ├── live_street_address_management_sam_addresses.parquet
│   │   └── massachusetts_zip_boundaries.parquet
│   └── raw/
│       └── (CSVs will be downloaded here by the script)
│
├── notebooks/
│   ├── 01_data_cleaning.ipynb
│   └── test_geocode/
│       ├── 00_test_geocode_neighborhood.ipynb
│       ├── 00_test_geocode_street_name.ipynb
│       └── 00_test_geocode_zip.ipynb
│
└── scripts/
    ├── 01_fetch_311_data.py
    └── 02_prepare_geodata.py

How to Run This Project

To reproduce the data preparation pipeline, follow these steps:

  1. Clone the repository:

    git clone https://github.com/shivaaang/boston-311-analysis.git
    cd boston-311-analysis
  2. Create and activate the Conda environment: This will install all the required packages listed in environment.yml.

    conda env create -f environment.yml
    conda activate boston311
  3. Run the main notebook: The entire data pipeline is orchestrated within the main Jupyter Notebook. Open and run all the cells in notebooks/01_data_cleaning.ipynb. The notebook will automatically:

    • Execute the necessary scripts to download all raw 311 data and geospatial files.
    • Perform all cleaning, imputation, and processing steps.
    • Save the final, analysis-ready dataset to data/processed/boston_311_cleaned.parquet.

Data Sources

About

A complete data pipeline and analysis of Boston 311 service requests (2015-2024). This repository contains scripts for automated data acquisition and a comprehensive cleaning notebook that uses advanced geospatial imputation to prepare over 2.5 million records for analysis and visualization.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors