Memory-efficient attention

This is a discussion of how to minimize memory usage of attention.

Current state: investigating apex's [scaled_masked_softmax](https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/main/megatron/fused_kernels/scaled_masked_softmax.h) to check how it operates