Skip to content

Conversation

@cpelley
Copy link

@cpelley cpelley commented Dec 2, 2014

NOT FOR MERGING
This PR is up at this moment for discussion and is not yet intended for merging (test results have not been updated).

This PR:

  1. Makes the resulting regrided array the same dtype as the source (rather than 64bit, which consumes twice as much memory for us here using 32bit data).
  2. Defines the dtype of the source and target coordinate bounds to be operated as the highest source coordinate bound precision. We should not define our regridded data as being defined on a more accurate grid than the source as it does not make sense to increase precision (which can currently occur if we were to define 64bit bounded coordinates on our target grid when we have 32bit bounds on our source for example).
  • Agree on the statistical correctness of these changes.
  • Update the test results to reflect these changes.

@cpelley
Copy link
Author

cpelley commented Dec 2, 2014

@jkettleb and @rockdoc you may be of interest in this PR. Any ideas are welcome :) whether you are in support or against these changes.

@esc24
Copy link
Member

esc24 commented Dec 2, 2014

as it does not make sense to increase precision

I'm not sure I agree with this statement. I can imagine wanting to regrid to very precise locations. I'd be surprised if the dtype of the resulting cube coordinates did not match (or exceed?) my grid_cube.

@cpelley
Copy link
Author

cpelley commented Dec 2, 2014

as it does not make sense to increase precision

I'm not sure I agree with this statement. I can imagine wanting to regrid to very precise locations. I'd be surprised if the dtype of the resulting cube coordinates did not match (or exceed?) my grid_cube.

Thanks for your comment @esc24. To clarify, you think you should be able to regrid for example some data defined on a 32bit src onto a 64bit grid (with the 64bit grid being your more accurate/precise locations).

I disagree with the concept of regridding to "more precise locations" as the dtype of the source grid coordinates determine the level of uncertainty of these 'locations' (ignoring how the data was derived). To regrid to 'more precise locations' is to infer that your original data is defined by a higher precision that it really is.

They user should have to explicitly re-cast the result to force the behaviour your describing as this seems wrong to me.

@esc24
Copy link
Member

esc24 commented Dec 2, 2014

I understand your position @cpelley, and it makes sense. I just happen to disagree with it from a practical perspective. I think the resulting precision (not accuracy or level of uncertainty) of my target grid should be defined by my target grid.

@rhattersley
Copy link
Member

I think the resulting precision (not accuracy or level of uncertainty) of my target grid should be defined by my target grid.

This makes a lot of sense. If I've asked Iris to re-grid my data onto a particular grid I would expect the result to match that grid exactly.

If we change Iris to use the target grid's dtype does that fix your problem?

I disagree with the concept of regridding to "more precise locations" as the dtype of the source grid coordinates determine the level of uncertainty of these 'locations'

It's not a simple as that. I can use float32 to record some common grids with perfect precision (e.g. 0 - 360 in steps of 0.5). But the area weighting and my target grid may not be so neat, so I'm perfectly justified in using float64 for all calculations.

@esc24
Copy link
Member

esc24 commented Dec 3, 2014

If we change Iris to use the target grid's dtype does that fix your problem?

I believe that this change would give the user the ability to control the behaviour in an intuitive way.

@cpelley
Copy link
Author

cpelley commented Dec 3, 2014

If we change Iris to use the target grid's dtype does that fix your problem?
I believe that this change would give the user the ability to control the behaviour in an intuitive way.

@rhattersley I have to say that I had not considered grids defined with 'perfect precision', good point. It would be great if there could be identifying metadata in the case of the NetCDF fileformat to describe precision, but thats another matter!

Thanks @esc24 and @rhattersley, as allways its great to get different points of view to converge to a neater and more appropriate solution :) Basing the dtype on the target grid does sound like the way to go, satisfying all concerned.

Happy to implement the changes, though I would love to hear what @jkettleb thinks of the proposed changes first.

@rockdoc
Copy link

rockdoc commented Dec 11, 2014

@cpelley @esc24 @rhattersley

One issue to bear in mind in this context is the fact that, because Iris does not currently support the notion of loading grids from 'cubeless' source files (e.g. netcdf files containing lat + long variables, and often bounds variables, but no data variables), we are in the habit of manufacturing our own target grid cubes from such sources.

In adopting this tactic one usually wants to use the smallest data type possible (typically np.int8) for the data payload in order to minimise the memory footprint of the cube. The reason being that for high-resolution global grids the lat-long coordinates and bounds together can total several GB of memory. Of course, if there was a way of manufacturing a grid cube without having to reify the data array then this problem would, I think, disappear (if I'm understanding the issue correctly - I may not be :-)

Therefore an alternative approach to those mentioned above, and one which I currently use, is to add a datatype=None keyword argument to the appropriate regrid API functions/methods. By default, if this keyword is undefined then the returned cube is of the same time as the input cube (even if type promotion has occurred internally in order to maintain precision in the result). Otherwise, the type of the returned cube data array is cast (potentially down as well as up) to match datatype.

The above suggestion relates to the cube's data array. One could adopt a similar tactic for the type of the grid coordinates and bounds.

@rhattersley
Copy link
Member

In adopting this tactic one usually wants to use the smallest data type possible (typically np.int8) for the data payload in order to minimise the memory footprint of the cube.

To save on memory you can use biggus.ConstantArray(shape, dtype=...) instead of a real data array. You can make the shape as large as you like and it won't use any memory.

@rockdoc
Copy link

rockdoc commented Dec 11, 2014

To save on memory you can use biggus.ConstantArray(shape, dtype=...) instead of a real data array.

Thanks for this tip @rhattersley, I hadn't thought to use biggus.

@cpelley
Copy link
Author

cpelley commented Dec 11, 2014

Of course, if there was a way of manufacturing a grid cube without having to reify the data array then this problem would, I think, disappear (if I'm understanding the issue correctly - I may not be :-)

@rockdoc, apologies my comment for this PR may be rather cryptic, the issue is: what behaviour would one expect when dealing with a source and target of different datatypes (and the same with the coordinates). The issue is not directly of storage, as the end result still needs to be a 'real' array. This PR hopes to give control and question: the precision used to calculate this returned result? and what precision is this end result?

@cpelley
Copy link
Author

cpelley commented Dec 16, 2014

I'm going to close this for now, if people think its worth me creating a google groups discussion thread on this then please let me know.

@cpelley cpelley closed this Dec 16, 2014
@alistairsellar
Copy link

Hi,

Just to add from a user perspective that this problem confused the hell out of me. I am regridding a float from one global grid to another, and the resulting cube (and hence netCDF file) had dtype=double. I didn't notice until after the files were used by an official configuration (and so can't be removed), so the output files are using GBs more space than they should. I guess I should check more carefully next time.

Personally my expectation would be that the output data had the same dtype as the input data, as described in point one of the original post above. I have worked around this in my own script using

    ocube.data = ocube.data.astype(np.float32, copy=False)

but the memory footprint of the code is larger than it should be.

So I'd be grateful if this change could be implemented in Iris.

Thanks,
Alistair

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants