Skip to content

Conversation

@navidcy
Copy link
Member

@navidcy navidcy commented Oct 9, 2025

No description provided.

@navidcy navidcy added the testing 🧪 Tests get priority in case of emergency evacuation label Oct 9, 2025
@simone-silvestri
Copy link
Collaborator

I am very interested in this. Let's hope it works and we can move on from julia 1.10

@simone-silvestri
Copy link
Collaborator

I am disabling the reactant tests for the moment to check if the rest works.

@simone-silvestri
Copy link
Collaborator

If docs still break on the internal_tide.jl example NaNing, I would try with a GridFittedBottom instead of a PartialCellBottom, which is not even tested, and, if tests pass, merge this PR and move forward trying to assess what the problems with partial cells are.

@simone-silvestri
Copy link
Collaborator

Seems that we are hitting the same NaN issue on the internal tide example

@navidcy
Copy link
Member Author

navidcy commented Oct 9, 2025

Seems that we are hitting the same NaN issue on the internal tide example

the ghosts of the past still haunt us....

@simone-silvestri
Copy link
Collaborator

Apparently also GridFittedBottom fails...

@simone-silvestri
Copy link
Collaborator

If I run the example locally, it works. Why would it error on CI? Do we have a way to reproduce this error locally?

@ali-ramadhan
Copy link
Member

If I run the example locally, it works. Why would it error on CI? Do we have a way to reproduce this error locally?

One thing to try might be to run the example locally and on CI using the exact same Manifest.toml if possible. We can commit a Manifest.toml to this branch for debugging. I can't think of which dependency would lead to such a big difference but it's one thing we can control for.

@navidcy
Copy link
Member Author

navidcy commented Oct 16, 2025

From the Julia v1.11 chat I recall that the error was showing up only for unix, not for mac?

@giordano

This comment was marked as resolved.

@giordano

This comment was marked as resolved.

@giordano

This comment was marked as resolved.

@giordano
Copy link
Collaborator

giordano commented Oct 27, 2025

The plot thickens: it works correctly in Julia v1.12 on Ampere eMAG (aarch64) with AlmaLinux 8.10 as operating system, which rules out an operating system difference. aarch64 is also the architecture on macOS, so I'm starting to suspect there's an architecture dependence. Can someone point me to the operation performed on the u field? Is there any chance there you're running any BLAS operation? Edit: using in x86-64 Julia v1.12 the libopenblas shipped in Julia v1.10 doesn't fix the issue, so that'd rule out BLAS version differences. Edit 2: also works on Intel(R) Xeon(R) CPU X5670 with CentOS 8 Stream, I'm getting more and more confused, sigh. Pointers to the relevant code would still be very welcome.

@glwagner
Copy link
Member

The plot thickens: it works correctly in Julia v1.12 on Ampere eMAG (aarch64) with AlmaLinux 8.10 as operating system, which rules out an operating system difference. aarch64 is also the architecture on macOS, so I'm starting to suspect there's an architecture dependence. Can someone point me to the operation performed on the u field? Is there any chance there you're running any BLAS operation? Edit: using in x86-64 Julia v1.12 the libopenblas shipped in Julia v1.10 doesn't fix the issue, so that'd rule out BLAS version differences. Edit 2: also works on Intel(R) Xeon(R) CPU X5670 with CentOS 8 Stream, I'm getting more and more confused, sigh. Pointers to the relevant code would still be very welcome.

Nice work so far though!!

The entire time-step is a complex chain of operations. I do think it is a good start to save down all fields every time-step. We may find that differences arise in one field versus another. Note that the NaNChecker checks u only as a proxy for the entire state. Here the prognostic state should be model.velocities (u, v, w) and model.tracers.b.

@glwagner
Copy link
Member

glwagner commented Oct 27, 2025

To save every iteration chnage this line

schedule = TimeInterval(save_fields_interval),

to schedule = IterationInterval(1). Also I think we should add v (I see u and w but not v there).

The difference should arise in the very first time-step? We could compare those. It seems annoying laborious to do this across architectures, but maybe @giordano you have good ideas how to do this efficiently

@giordano giordano marked this pull request as ready for review November 11, 2025 17:41
@simone-silvestri
Copy link
Collaborator

simone-silvestri commented Nov 11, 2025

I think I might have found the reason for the abstract operation test breaking. Apparently, the GPU compiler (and the CPU as well) do not really like this method which leads to dynamic evaluation

@inline Base.getindex::MultiaryOperation{LX, LY, LZ, N}, i, j, k) where {LX, LY, LZ, N} =
Π.op(ntuple-> Π.▶[γ](i, j, k, Π.grid, Π.args[γ]), Val(N))...)

I have added a workaround that works for multiary operations with up to 6 operands; however, there might be a better (more general) solution?

@simone-silvestri
Copy link
Collaborator

Anyway, if we agree with this workaround, I am eager to merge this PR and finally Oceananigans can be officially used with julia 1.12!

@giordano
Copy link
Collaborator

ntuple + splatting are probably hard to reason about, although N is a static parameter 🤔

Copy link
Collaborator

@giordano giordano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Π.op(ntuple-> Π.▶[γ](i, j, k, Π.grid, Π.args[γ]), Val(N))...)

# Try to improve inferrability by hardcoding methods for small N
@inline Base.getindex::MultiaryOperation{LX, LY, LZ, 1}, i, j, k) where {LX, LY, LZ} =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is bad, ntuple is implemented with similar manual splitting. This might cause longer compilation latency, but so does Val.

@simone-silvestri
Copy link
Collaborator

There seems to be one last issue connected to the docs deployment.

@giordano
Copy link
Collaborator

 ! [remote rejected] HEAD -> gh-pages (cannot lock ref 'refs/heads/gh-pages': is at a00f0e6e5641011ad80648a72f2e2f21ed6ec749 but expected 534c22e3519adcadb1802bc44ce20e67bcac43c6)

Is that a race condition during a push? Can you restart that job? (I can't myself).

Side note, in Breeze.jl I split the docs job in two parts, build and deployment, so that if the deploy fails for reasons and you need to restart it you don't need to rebuild the docs all over again.

@simone-silvestri
Copy link
Collaborator

Ok, let's merge away.

@simone-silvestri simone-silvestri merged commit 2e6d7f8 into main Nov 11, 2025
70 checks passed
@simone-silvestri simone-silvestri deleted the ncc/julia-v1.12 branch November 11, 2025 21:25
@navidcy
Copy link
Member Author

navidcy commented Nov 11, 2025

O.M.G.!
Amazing!

@navidcy
Copy link
Member Author

navidcy commented Nov 11, 2025

I'm changing my .zshrc so that julia is now julia +1.12!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing 🧪 Tests get priority in case of emergency evacuation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants