Skip to content

Conversation

@theabhirath
Copy link
Member

@theabhirath theabhirath commented Feb 25, 2022

This is an implementation of ResMLP. In the process, I ended up doing quite a lot of things, including an almost complete rewrite of the base MLPMixer model itself to make it cleaner and more understandable, as well as by fixing stuff like PatchEmbedding, DropPath and LayerScale. There is also some utility function cleanup (and some random formatting errors that I fixed as I found them) but mostly this PR deals with MLPMixer, ResMLP and PatchEmbedding. gradtest passes, so I think this should be fine on that front

Edit: also added gMLP to the implementations

1. Refactor of MLPMixer to allow for more customisation
2. Refined API for PatchEmbedding, DropPath and LayerScale
3. Cleaned up some utility functions
4. Fixed minor formatting errors
@theabhirath
Copy link
Member Author

What is with the Ubuntu failures with the KILL signal 🥲

@ToucheSir
Copy link
Member

Per one of Kyle's earlier comments, this may be more OOMs. I saw a similar issue pop up in FluxBench, so perhaps something is leaking in the test suite, something is being tested with too large of an input or GC is not running aggressively enough?

@theabhirath
Copy link
Member Author

Per one of Kyle's earlier comments, this may be more OOMs. I saw a similar issue pop up in FluxBench, so perhaps something is leaking in the test suite, something is being tested with too large of an input or GC is not running aggressively enough?

Inputs are similar for all models, so probably unlikely? Not sure about GC or leakages in the test suite, although it's interesting that it seems to be OS dependent somehow. macOS has been unaffected across PRs, while Windows is fine on this one but errors on the res2net one

@ToucheSir
Copy link
Member

Right, but some of the newer models might be larger. I wonder if the glibc memory bloat issue might be a factor on linux, but this is all speculation.

@darsnack
Copy link
Member

These OOMs are why we disable gradtest in the first place. It isn't surprising that it is OS-dependent, because the "kill" switch is not really up to Julia. If the issue is accumulated memory that isn't be garbage collected, then both the number of tests and the size of each matters. So, not the input size to ResMLP that matter as much as where it is in the test suite order.

Can you try adding manual GC.gc() calls between @testset for each model?

Maybe we can see if FluxML can get some more dedicated CI options. Otherwise, we can always "chunk" the tests so that an ENV variable controls which test sets are run. Then our actions script will invoke Julia multiple times with a different environment variable.

@theabhirath
Copy link
Member Author

theabhirath commented Feb 25, 2022

Huh. That's unexpected - I did GC.gc() between the groups of testsets (not individual models just yet). Is that Julia version dependent somehow? Also the KILL signal is right after the GC call....that doesn't feel right somehow

@theabhirath theabhirath changed the title Implementation of ResMLP Implementation of ResMLP (and improvements to MLPMixer and PatchEmbedding) Feb 25, 2022
@theabhirath theabhirath changed the title Implementation of ResMLP (and improvements to MLPMixer and PatchEmbedding) Implementation of ResMLP (with improvements to MLPMixer and PatchEmbedding) Feb 25, 2022
@darsnack
Copy link
Member

darsnack commented Feb 25, 2022

The GC can certainly be version dependent. It's also possible that we are right on the edge of maxing out and the results are non-deterministic. You could try going more fine-grained or even add GC.gc() directly in gradtest's definition.

Also the KILL signal is right after the GC call....that doesn't feel right somehow

I don't the GC calls are making any difference. What is consistent between both runs so far is that the KILL signal happens right when we start the "other" tests. So, it seems like the MLP mixer variants are more memory intensive; maybe right on our 7GB limit. Can you try reducing the batch size to 1?

As for why macOS never has these issues, it looks like those machines get twice as much memory as Windows or Linux: https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#supported-runners-and-hardware-resources

@darsnack
Copy link
Member

darsnack commented Feb 25, 2022

Sorry, edited my comment above after you pushed but let's see what happens anyways

@theabhirath
Copy link
Member Author

theabhirath commented Feb 25, 2022

it seems like the MLP mixer variants are more memory intensive

This is exactly what's confusing me - they're not. The ViT variants are the heaviest by far, and they didn't trip the memory when they were merged, so this is surprising to me because the MLPMixer variants are less intensive on both memory and compute

Edit: Oh I just realised we aren't testing on the ViT variants, only on the base model 🤦🏽‍♂️ that probably explains it

@theabhirath
Copy link
Member Author

theabhirath commented Feb 25, 2022

But yeah, we will probably need a different approach for tests anyways given that some of the ViT variants will be in the multi-100Ms range - this seems hacky and also doesn't seem to work all the time

Edit: nvm, figured that some of the models are upto 1 GB in size (especially the xlarge ones) - so GC calls inside the MLPMixer testsets seem to fix everything up

1. Cleaned up `mlpblock` implementation
2. More elaborate API for mlpmixer constructor
@theabhirath theabhirath changed the title Implementation of ResMLP (with improvements to MLPMixer and PatchEmbedding) Implementation of ResMLP and gMLP (with improvements to MLPMixer and PatchEmbedding) Feb 26, 2022
@theabhirath
Copy link
Member Author

Is this GTG?

Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that, I left some comments. I'll need a little more time to review the spatial gating unit properly.

@theabhirath
Copy link
Member Author

theabhirath commented Mar 1, 2022

CI failures upstream on nightly during precompilation because libblas not found - for NNLib 🥲. Was facing this locally as well on Julia master. Worth opening an issue somewhere?

@theabhirath theabhirath requested a review from darsnack March 3, 2022 18:02
@darsnack darsnack mentioned this pull request Mar 11, 2022
Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow this is a very clean implementation now that I've had the chance to appreciate it!

@theabhirath
Copy link
Member Author

Why is Windows CI OOMing on nightly 😑 it's the same code for the stable version

@theabhirath
Copy link
Member Author

Yeah the memory issues are still problematic on GitHub Actions...probably need a long term alternative as the number of models increase

Copy link
Member

@darsnack darsnack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small changes, but looks ready otherwise.

@theabhirath theabhirath requested a review from darsnack March 15, 2022 20:45
@darsnack darsnack merged commit 13cbf02 into FluxML:master Mar 17, 2022
@darsnack
Copy link
Member

Thanks @theabhirath!

@theabhirath theabhirath deleted the resmlp branch March 18, 2022 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants