-
Notifications
You must be signed in to change notification settings - Fork 83
Description
What happened?
If a GRIB file has messages for a parameter (like 'Total precipitation' or 'tp') which is expressed as an average ('stepType' = 'avg'), and one message describes the average of the preceding hour, and one describes the average since the reference time (t=0, model start, start of prediction, etc), then cfgrib's open_datasets function is unable to recognize the difference between these two messages, and consequently only includes one (the first). The second message will not be present in the result, without any hick-up or indication to the user that the data returned is not, in fact, all the data from the GRIB file.
NOTE: This would happen to any stepType that describes some form of time interval (average, accumulation, maximum, etc). It would also ignore any amount of messages past the first, if more than one is present in the GRIB file.
I have also identified the cause of this behaviour, and potentially a (start for a) fix.
Within cfgrib, when opening a GRIB file, the enforce_unique_attributes function (in dataset.py) is used as the first step in build_variable_components the to ensure that the resulting dataset is a valid hypercube. The error raised when it is not is used by raw_open_datasets (in xarray_store.py) to keep refining a set of filter_by_keys values until the entire GRIB file can be read into hypercubes without conflicts.
Inside of the GRIB message, time time interval of the data is encoded via 'forecast time' (octets 19-22 in Section 4 of the GRIB message, called 'forecastTime' by eccodes). For a message, say, 16 hours ahead of the reference time, if the stepType is 'instant', forecastTime would be 16. If the stepType is 'avg' and the data describes the average over the preceding hour, forecastTime would be 15. And if the data describes the average since the reference time, forecastTime would be 0.
The problem is that the set of attribute keys provided to enforce_unique_attributes (DATA_ATTRIBUTES_KEYS) does not include this attribute, or any derived attribute (stepRange for example). If you add "forecastTime" to the list DATA_ATTRIBUTES_KEYS, the messages are correctly distinguished and all present in the resulting datasets.
While it is possible to supply read_keys as a kwargs to open_datasets, these only comes in with the extra_keys in build_variable_components, and are not used to enforce unique attributes. I have tried this, but it does not result in getting the 'lost' messages in the output datasets.
You can use backend_kwargs={"filter_by_keys": {"forecastTime": <some_value>}} to get the separate messages, but that requires that you know all the possible values ahead of time, and that you even know that this problem occurs. It is my understanding that the point of the open_datasets() function is to be able to fully read in a GRIB file without knowing this. As it stands, you simply don't get the data, and you wouldn't know you are missing some of the GRIB messages until you fully compare the output datasets to the input GRIB file.
The reason I am unsure if adding 'forecastTime' to DATA_ATTRIBUTES_KEYS is a desirable fix, is that it results in potentially undesirable behaviour when opening GRIB files containing messages spanning multiple timesteps. I believe that the varying values of the forecastTime attribute would force what is effectively the same parameter into different datasets. That might mean a different solution is required, or that some more work is required to prevent this from happening when it is not desired. Perhaps different attributes like lengthOfTimeRange can be of help.
What are the steps to reproduce the bug?
- Get a GRIB file with multiple messages for the same parameter and time, but with differing time intervals.
- I recommend a GRIB file from NCEP. One can be downloaded with ease from https://nomads.ncep.noaa.gov/gribfilter.php?ds=gfs_0p25_1hr . Make sure to select a file some time past t=0, say 10 hours ahead (the file ending with
f010. Select 'ACPCP' or 'APCP' as Parameter, leave Levels to 'All' ('surface' is the only provided level for these parameters), and enter some small subregion to save data. NCEP provides these two parameters as averages both since t=0 and since the most-recent-6-hour-interval. This means that timestep 10 will have an average over the past 10 hours and an average over the past 4 hours, i.e. since t=6.
- I recommend a GRIB file from NCEP. One can be downloaded with ease from https://nomads.ncep.noaa.gov/gribfilter.php?ds=gfs_0p25_1hr . Make sure to select a file some time past t=0, say 10 hours ahead (the file ending with
- Verify using a tool like
grib_lsthat the GRIB file includes 2 messages for these parameters - Attempt to open the file with
cfgrib.open_datasets() - Observe that there is only one entry per parameter
Version
0.9.10.4
Platform (OS and architecture)
WSL2 Ubuntu 22.04.2 LTS
Relevant log output
No response
Accompanying data
No response
Organisation
No response