Skip to content

Conversation

@AmericanPresidentJimmyCarter

Closes #1611

I haven't had time to test it much, but you can make route configs (CLI args) like:

            f'--tread_config={json.dumps({
                'routes': [
                    {
                        'start_layer_idx': 8,
                        'end_layer_idx': -8,
                        'selection_ratio': 0.5,
                    },
                ],
            })}',

It works, it does speed up training, and the speedup is proportion to the selection_ratio (the closer that is to 1.0, the more tokens are dropped).

I haven't messed with the router configurations much to see what works and what doesn't.

It also worked with masked loss, where masking prevents certain tokens from being dropped. This slows down training by using more tokens.

The token dropping was implemented for RoPE in FLUX, as this is required to make TREAD work. I'm unsure the implementation is 100% correct, but it seems to work.

The more tokens you drop, the higher loss seems to be when you start LoRA/LoKr training. It tends to correct fairly rapidly and the images look more normal. I'm unsure if it's the network just adjusting to the smaller amount of tokens being supplied in the intermediary layers.

There was also a bug that seemed to break training with masked loss for flux in the main branch. This was fixed with a self.config.model_flavour == "kontext" guard.

@bghira
Copy link
Owner

bghira commented Jul 28, 2025

replaced by #1675

@bghira bghira closed this Jul 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider adding TREAD training

2 participants