VerA DB Parquet Export Script - Sourcify fork

Python scripts and Docker container to export the Verifier Alliance PostgreSQL database in Parquet format and upload it to an S3 bucket. This fork provides a modified config to comply with the Sourcify database - which uses a modification of the Verifier Alliance database schema itself.

The latest export is available at https://export.sourcify.dev.

Requirements

Python 3

Installation

Create a virtual environment:

python -m venv venv

Activate the virtual environment:

source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Usage

Run the script with:

python main.py

The script takes some additional env vars for debugging purposes:

DEBUG: Enables debug logging, reduces chunk sizes by 100x, processes only 1 file per table, and skips GCS upload
DEBUG_TABLE: The name of the table to dump solely. Skips other tables.
DEBUG_OFFSET: Will add an offset to the SELECT queries "SELECT * FROM {table_name} OFFSET {os.getenv('DEBUG_OFFSET')}"

The rest of the env vars can be found in the .env-template file. Copy the .env-template file to .env and fill in the values.

Authentication

For local development, set the GOOGLE_APPLICATION_CREDENTIALS environment variable to point to your service account key JSON file:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json

In Cloud Run, authentication is automatic via Workload Identity.

The config.py file contains the configuration for each database table about the chunk sizes and number of chunks per file, and the datatypes for each column in the table.

Example:

  {
    'name': 'verified_contracts',
    'datatypes': {
        'id': 'Int64',
        'created_at': 'datetime64[ns]',
        'updated_at': 'datetime64[ns]',
        'created_by': 'string',
        'updated_by': 'string',
        'deployment_id': 'string',
        'compilation_id': 'string',
        'creation_match': 'bool',
        'creation_values': 'string',
        'creation_transformations': 'string',
        'runtime_match': 'bool',
        'runtime_values': 'string',
        'runtime_transformations': 'string'
    },
    'chunk_size': 10000,
    'num_chunks_per_file': 10
  }

This config gives 10,000 * 10 = 100,000 rows per file.

The files will be named verified_contracts_0_100000_zstd.parquet and verified_contracts_100000_200000_zstd.parquet etc. (zstd is the compression algorithm).

The script also generates a manifest.json that contains a timestamp when the dump is created, and the list of files uploaded to Google Cloud Storage.

{
  "timestamp": 1718042395518,
  "dateStr": "2024-06-10T17:59:55.518972Z",
  "files": {
    "code": [
      "code/code_0_100000_zstd.parquet",
      "code/code_100000_200000_zstd.parquet",
      "code/code_200000_300000_zstd.parquet",
      "code/code_300000_400000_zstd.parquet",
      "code/code_400000_500000_zstd.parquet",
      "code/code_500000_600000_zstd.parquet",
      "code/code_600000_700000_zstd.parquet",
      "code/code_700000_800000_zstd.parquet"
    ],
    "contract_deployments": [...],
    "compiled_contracts": [...],
    "verified_contracts": [...]
  }
}

Docker

Build the image:

docker build --tag=sourcify/parquet-export --platform=linux/amd64 .

Publish:

docker push sourcify/parquet-export

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
.vscode		.vscode
.dockerignore		.dockerignore
.env-template		.env-template
.gitignore		.gitignore
Dockerfile		Dockerfile
Dockerfile.local		Dockerfile.local
LICENSE		LICENSE
README.md		README.md
check_memory_usage.sh		check_memory_usage.sh
config.py		config.py
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VerA DB Parquet Export Script - Sourcify fork

Requirements

Installation

Usage

Authentication

Docker

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

sourcifyeth/parquet-export

Folders and files

Latest commit

History

Repository files navigation

VerA DB Parquet Export Script - Sourcify fork

Requirements

Installation

Usage

Authentication

Docker

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages