Transforming vision and spatial tasks into language modeling problems.
Read the report: https://api.wandb.ai/links/ccaven/fzcevldh
The important files are as follows:
generate_block_bezier_dataset.pyconstructs the training data for the VQ-VAEtrain_vqvae.pyactually trains the VQ-VAEgenerate_random_plane_2_dataset.pyconstructs the training data for the transformertrain_transformer_5.pyactually trains the transformer
I made quite a few attempts with various methods, so I left those methods in the repository. For example, the src/encoder_decoder folder and train_encoder_decoder.py file contains an attempt at writing a different kind of autoencoder for images where the decoder is not a convolution network but instead an autoregressive next token predictor. The src/diffusion contains a similar attempt where the decoder is a diffusion network.
- The 
src/nanogptfolder is largely taken from Andrej Karpathy's nanoGPT project. - The 
src/diffusion_2folder is cloned from milmor/diffusion-transformer and includes the original files and license.