Skip to content

Conversation

@mmoskal
Copy link
Contributor

@mmoskal mmoskal commented Feb 16, 2024

This PR introduces support for some features of AI Controller Interface (AICI). In particular it allows for:

  • uploading a controller (Wasm binary) via POST /v1/controllers (see REST docs)
  • tagging REST APIs POST /v1/controllers/tags and GET /v1/controllers/tags
  • running a controller via POST /v1/run
  • the controller can return initial fixed tokens (turned into prompt in vLLM)
  • the controller can also generate output with a constraint (regex, LR(1) grammar, etc.)
  • controller results are always returned via streaming API

The following are not supported yet:

  • fast-forward tokens (adding a number of tokens in one batch, after the initial prompt is processed)
  • backtracking (popping a number of entries from the KV cache)
  • forking (allowing the controller to split a sequence into several sequences)

I have some questions about how these features should be implemented.

cc @simon-mo @emrekiciman

Giving it a spin

Currently, the best way to run is from the AICI repo which contains the forked vLLM as a submodule.

git clone --recursive https://github.com/microsoft/aici
cd aici/py/vllm
python setup.py develop
cd ../..
./scripts/vllm-server.sh

The last command requires a recent rust compiler - see AICI dev setup for more info (you may want to use the devcontainer labeled "vLLM experimental" from the AICI repo).

Once the server is running, in separate terminal, you can try out different controllers (this is using .wasm binaries released in AICI repo):

cd aici
./aici.sh run --ctrl gh:microsoft/aici/pyctrl controllers/pyctrl/samples/yesno.py 
./aici.sh run --ctrl gh:microsoft/aici/jsctrl controllers/jsctrl/samples/hello.js

Due to limitations listed above, FixedTokens can be only used once at the top, and there is no fork().

In future, it will be possible to download the released aicirt binary while pyaici will be published on pip and can be listed in requirements.txt here. Then it would be just a matter of passing --aici-rt to vLLM server.

Here's a short overview of capabilities of AICI, and issues with implementing them in vLLM.

Logit biases

In AICI, the controller computes logit biases for the next sampling, while the GPU is working on the logits. These are used to constrain output to adhere to a certain regular expression, LR(1) grammar, etc.

I apply these in _apply_logits_processors in sampler.py.

FF-tokens and backtracking

Fast-forward tokens (also called zero-entropy tokens, forced tokens, or fixed tokens) are tokens that are added to the current sequence in one step, possibly after some generation steps. The initial prompt can be thought of as ff-tokens (this is in fact how AICI views it).

An example, where they are useful is generating data adhering to a certain JSON schema. The controller first forces {"name":" to be generated, then the model generates John", the controller forces ,\n"age":, model generates 42, and so on. Another example is chain-of-thought reasoning, where after the model generated a sentence, the controller forces more instructions for the model, the model generates more text, and so on.

FF-tokens works together with backtracking (popping KV cache entries from the current sequence), for example we have the model "show its work", but then backtrack over the work and only append the final result, before continuing with reasoning.

They are also useful with forking, for example we first give model some text, and then in separate forks add ff-tokens that instruct the model to score the text based on separate criteria - we only processed the prompt once but scoring happens in parallel.

Thus, the sequences of ff-tokens can be between a few and a few thousand long.

The way I had this implemented in rLLM (a reference LLM Rust server with a subset of vLLM functionality) was by keeping track of the number of tokens in the current sequence for which the KV cache was correctly computed (num_kv_computed) and then updating it as the batches went through. I think right now, in vLLM this is implicit (either all are computed or none).

For the actual computation of the KV cache, I think context_attention_fwd will do what I need with some hacking.

As for performance, if vLLM is running say 20 sequences in parallel, and one of them requests 5 ff-tokens, right now I think I would need to do a prompt computation step for this one sequence, while pausing the remaining 19. It would be nice to either take account of that somehow in the batching (eg., do a few more tokens and see if one of the remaining 19 also needs ff-tokens, or maybe there is 21st one coming in), or else mixed steps with both prompt and generation tokens. However, this is a song of the future.

Question: Would adding a num_kv_computed field to Sequence be a good way to handle this?

Forking

In AICI the controllers can ask the LLM runtime to fork the current sequence. This also forks the controller. Controllers in different forks (but the same request) can communicate with each other. Forking is mostly useful together with ff-tokens.

If a controller asks to be forked, the logit generation should proceed with a single sequence, and then in sampling the logits need to be duplicated, and separate logit biases added to each fork, creating multiple sequences. The main issue I see here, is that the biases need to applied before temperature, softmax, etc., while currently duplication (for n sampling etc.) happens after these steps.

Question: Should I pre-duplicate the logits for AICI, and keep the current duplication logic for n sampling etc?

Controlling sampling

This is not yet implemented in AICI, but it is planned. The idea is that the controller can ask the LLM runtime change sampling parameters for certain parts of generation. For example, it may want to lower the temperature for some correctness-critical part of the output.

Question: Would the approach here be to add an optional pointer to sampling parameters in Sequence?

@AaronFriel
Copy link

AaronFriel commented Mar 20, 2024

Does automatic prefix caching (#2614) obviate the need for AICI to be clever, and instead it would be better to implement feed forward, backtracking, and forking by managing parallel requests?

@mmoskal
Copy link
Contributor Author

mmoskal commented Jul 10, 2024

closing in favor of #6273

@mmoskal mmoskal closed this Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants