CleverMaps Data Ingestion

The cm-data-ingestion package provides easy data ingestion from various geodata sources. This package is based on the great dlt framework (https://dlthub.com/), so all custom sources can be easily pipelined into the standard destinations supported by dlt (https://dlthub.com/docs/dlt-ecosystem/destinations/). The main idea behind this package is to break siloed geodata and make it easily accessible in standard data environments for further visualization or analytics. The package is also built with the analytics as code approach in mind.

Currently supported sources:

OvertureMaps
OpenStreetMap
WorldPop
GTFS
GeoBoundaries

Installation

Run the following command to install the cm-data-ingestion package on your system:

    pip install git+https://github.com/clevermaps/cm-data-ingestion.git

Documentation

More detailed documentation can be found in docs folder.

Examples

Example configurations and usage can be found in example.py, demonstrating how to set up ingestion for various sources.

Technical overview

Overview

The cm-data-ingestion repository is designed to facilitate the ingestion, processing, and management of various geospatial and transit data sources. It is structured into several key components including sources, pipelines, and helpers, enabling modular and extensible data workflows.

Architecture

Sources: Located in src/cm_data_ingestion/sources, this directory contains modules responsible for fetching and processing data from different providers such as Geoboundaries, GTFS (General Transit Feed Specification), OpenStreetMap, OvertureMaps, and WorldPop. Each source module encapsulates the logic specific to its data format and API.
Pipelines: Found in src/cm_data_ingestion/pipelines, pipelines orchestrate the data ingestion workflows. They manage the sequence of operations, including data extraction, loading, and optionaly basic normalization transformations.
Helpers: Utility functions and shared logic are organized under helpers within both sources and pipelines. These include common data processing routines, API interaction helpers, and configuration management.

Data Flow

Configuration: Users define ingestion configurations specifying providers, data items, and options.
Data Extraction: Source modules fetch raw data from external APIs or files, handling authentication, downloading, and initial parsing.
Loading: Processed data is loaded into DuckDB databases or other destinations for downstream use.
Transformation: Optionaly, data is normalized using basic staging transformations defined as dbt models.

Key Technologies

Python 3 for core logic and scripting.
DuckDB for embedded analytical database capabilities.
PyArrow and related libraries for efficient data handling.
Requests and other HTTP libraries for API communication.
Pytest for unit testing.

Extensibility

The modular design allows easy addition of new data sources or pipelines by adhering to established interfaces and patterns.

Changelog

0.0.1 Initial version

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
docs		docs
src/cm_data_ingestion		src/cm_data_ingestion
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CleverMaps Data Ingestion

Installation

Documentation

Examples

Technical overview

Overview

Architecture

Data Flow

Key Technologies

Extensibility

Changelog

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

clevermaps/cm-data-ingestion

Folders and files

Latest commit

History

Repository files navigation

CleverMaps Data Ingestion

Installation

Documentation

Examples

Technical overview

Overview

Architecture

Data Flow

Key Technologies

Extensibility

Changelog

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages