-
-
Notifications
You must be signed in to change notification settings - Fork 67
Implementation of ResMLP and gMLP (with improvements to MLPMixer and PatchEmbedding)
#125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1. Refactor of MLPMixer to allow for more customisation 2. Refined API for PatchEmbedding, DropPath and LayerScale 3. Cleaned up some utility functions 4. Fixed minor formatting errors
|
What is with the Ubuntu failures with the KILL signal 🥲 |
|
Per one of Kyle's earlier comments, this may be more OOMs. I saw a similar issue pop up in FluxBench, so perhaps something is leaking in the test suite, something is being tested with too large of an input or GC is not running aggressively enough? |
Inputs are similar for all models, so probably unlikely? Not sure about GC or leakages in the test suite, although it's interesting that it seems to be OS dependent somehow. macOS has been unaffected across PRs, while Windows is fine on this one but errors on the res2net one |
|
Right, but some of the newer models might be larger. I wonder if the glibc memory bloat issue might be a factor on linux, but this is all speculation. |
|
These OOMs are why we disable Can you try adding manual Maybe we can see if FluxML can get some more dedicated CI options. Otherwise, we can always "chunk" the tests so that an |
|
Huh. That's unexpected - I did |
PatchEmbedding)
PatchEmbedding)PatchEmbedding)
|
The GC can certainly be version dependent. It's also possible that we are right on the edge of maxing out and the results are non-deterministic.
I don't the GC calls are making any difference. What is consistent between both runs so far is that the KILL signal happens right when we start the "other" tests. So, it seems like the MLP mixer variants are more memory intensive; maybe right on our 7GB limit. Can you try reducing the batch size to 1? As for why macOS never has these issues, it looks like those machines get twice as much memory as Windows or Linux: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources |
|
Sorry, edited my comment above after you pushed but let's see what happens anyways |
This is exactly what's confusing me - they're not. The ViT variants are the heaviest by far, and they didn't trip the memory when they were merged, so this is surprising to me because the MLPMixer variants are less intensive on both memory and compute Edit: Oh I just realised we aren't testing on the ViT variants, only on the base model 🤦🏽♂️ that probably explains it |
|
But yeah, we will probably need a different approach for tests anyways given that some of the ViT variants will be in the multi-100Ms range - this seems hacky and also doesn't seem to work all the time Edit: nvm, figured that some of the models are upto 1 GB in size (especially the |
1. Cleaned up `mlpblock` implementation 2. More elaborate API for mlpmixer constructor
PatchEmbedding)PatchEmbedding)
|
Is this GTG? |
darsnack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, I left some comments. I'll need a little more time to review the spatial gating unit properly.
|
CI failures upstream on nightly during precompilation because |
darsnack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow this is a very clean implementation now that I've had the chance to appreciate it!
|
Why is Windows CI OOMing on nightly 😑 it's the same code for the stable version |
|
Yeah the memory issues are still problematic on GitHub Actions...probably need a long term alternative as the number of models increase |
darsnack
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small changes, but looks ready otherwise.
Co-authored-by: Kyle Daruwalla <[email protected]>
|
Thanks @theabhirath! |
This is an implementation of ResMLP. In the process, I ended up doing quite a lot of things, including an almost complete rewrite of the base MLPMixer model itself to make it cleaner and more understandable, as well as by fixing stuff like
PatchEmbedding,DropPathandLayerScale. There is also some utility function cleanup (and some random formatting errors that I fixed as I found them) but mostly this PR deals with MLPMixer, ResMLP andPatchEmbedding.gradtestpasses, so I think this should be fine on that frontEdit: also added gMLP to the implementations