-
Notifications
You must be signed in to change notification settings - Fork 234
Refactor the internal _load_remote_dataset function to simplify datasets' definitions #2917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ataset definitions
| # check resolution | ||
| if resolution not in dataset.resolutions: | ||
| for res in dataset.resolutions: | ||
| if res.code == resolution: | ||
| valid_registrations = res.registrations | ||
| is_tiled = res.tiled | ||
| break | ||
| else: | ||
| raise GMTInvalidInput(f"Invalid resolution '{resolution}'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This for-loop seems a lot more inefficient compared to the previous if resolution not in dataset.resolutions dictionary-based lookup. I almost think we should go with the intermediate {"01d": Resolution(), ...} dictionary you mentioned at #2917 (comment), or come up with a better data structure than a list of Resolution NamedTuples. I almost feel like that datasets variable could be a nested JSON or a multi-level pandas.DataFrame object (but maybe that's overkill).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could just keep the datasets variable as is, but bring in all the type hint stuff. The remote datasets aren't really updated that often, though I know you're working on those Moon/Venus/Mars PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This for-loop seems a lot more inefficient compared to the previous
if resolution not in dataset.resolutionsdictionary-based lookup.
Yes, it's almost 10-times slower.
I almost think we should go with the intermediate
{"01d": Resolution(), ...}dictionary you mentioned at #2917 (comment), or come up with a better data structure than a list ofResolutionNamedTuples. I almost feel like thatdatasetsvariable could be a nested JSON or a multi-level pandas.DataFrame object (but maybe that's overkill).
What about a dictionary:
{
"01d": Resolution("01d"),
}
It's more clear than:
{
"01d": Resolution(),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about a dictionary:\n\n{\n "01d": Resolution("01d"),\n}
Yes, that could work.
For each GMT remote dataset, we need to define it like:
The
resolutionsproperty is a dictionary of available resolutions and the corresponding registrations and tile information. As you can see,["gridline", "pixel"]is duplicated multiple times. I feel maintaining such a dictionary is tedious.Since most resolutions support both "gridline" and "pixel" registrations, it makes sense to let
Resolution.registrationshave a default value["gridline", "pixel"]. Similarly,Resolution.tiledcan have a default valueFalse.Then, the dataset definition can be written as:
The new function definition looks more compact, but an entry like
"01d": Resolution()looks weird, since it may be unclear whatResolution()means.Thus, I add the new
codeproperty to theResolutionclass, then aResolutionobject can be defined likeor the shortest version:
After the above changes, entries like
"01d": Resolution("01d")are still weird since the resolution code (e.g.,01d) must be duplicated twice. Thus, I changedresolutionsfrom a dict to a list.Here is the final version of the dataset definition:
which I think is more compact and more readable.