Skip to content
This repository was archived by the owner on Sep 23, 2025. It is now read-only.

Conversation

@space-buzzer
Copy link

@space-buzzer space-buzzer commented Apr 14, 2021

This includes a command to "replay" the API CSVs stored in GitHub as data updates, such that the changes will be stored in the DB as batches.

This command gets an input file with list of commits, and goes over the commits, in order, fetches the raw CSV content from GH (not through GH API, because then it hits a rate limit pretty fast), and sends the commit as a new published batch.

There are a few things that happen locally (in the command), to reduce the size of updates, and calculate the message for the batch, changed fields, etc.
The heuristics are:

  • Updates that changed 56 rows for 56 different states, on a single day => daily
  • Everything else => edit

Some heuristics about date/time/date formatting
Some commits with bad data are completely skipped
The process to find the diff (between 2 consecutive commits) and submit only the rows that changed (or added) is done locally in the command, it's faster this way.

I changed some logging -- this is minor.
I commented out the requirement to submit states as part of the batch, because I didnt do the cross-ref of history of states_info to the commits

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants