-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
AI Controller Interface (AICI) integration #2888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|
Does automatic prefix caching (#2614) obviate the need for AICI to be clever, and instead it would be better to implement feed forward, backtracking, and forking by managing parallel requests? |
This was referenced Mar 29, 2024
Contributor
Author
|
closing in favor of #6273 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces support for some features of AI Controller Interface (AICI). In particular it allows for:
POST /v1/controllers(see REST docs)POST /v1/controllers/tagsandGET /v1/controllers/tagsPOST /v1/runThe following are not supported yet:
I have some questions about how these features should be implemented.
cc @simon-mo @emrekiciman
Giving it a spin
Currently, the best way to run is from the AICI repo which contains the forked vLLM as a submodule.
The last command requires a recent rust compiler - see AICI dev setup for more info (you may want to use the devcontainer labeled "vLLM experimental" from the AICI repo).
Once the server is running, in separate terminal, you can try out different controllers (this is using
.wasmbinaries released in AICI repo):cd aici ./aici.sh run --ctrl gh:microsoft/aici/pyctrl controllers/pyctrl/samples/yesno.py ./aici.sh run --ctrl gh:microsoft/aici/jsctrl controllers/jsctrl/samples/hello.jsDue to limitations listed above,
FixedTokenscan be only used once at the top, and there is nofork().In future, it will be possible to download the released
aicirtbinary whilepyaiciwill be published on pip and can be listed inrequirements.txthere. Then it would be just a matter of passing--aici-rtto vLLM server.Here's a short overview of capabilities of AICI, and issues with implementing them in vLLM.
Logit biases
In AICI, the controller computes logit biases for the next sampling, while the GPU is working on the logits. These are used to constrain output to adhere to a certain regular expression, LR(1) grammar, etc.
I apply these in
_apply_logits_processorsinsampler.py.FF-tokens and backtracking
Fast-forward tokens (also called zero-entropy tokens, forced tokens, or fixed tokens) are tokens that are added to the current sequence in one step, possibly after some generation steps. The initial prompt can be thought of as ff-tokens (this is in fact how AICI views it).
An example, where they are useful is generating data adhering to a certain JSON schema. The controller first forces
{"name":"to be generated, then the model generatesJohn", the controller forces,\n"age":, model generates42, and so on. Another example is chain-of-thought reasoning, where after the model generated a sentence, the controller forces more instructions for the model, the model generates more text, and so on.FF-tokens works together with backtracking (popping KV cache entries from the current sequence), for example we have the model "show its work", but then backtrack over the work and only append the final result, before continuing with reasoning.
They are also useful with forking, for example we first give model some text, and then in separate forks add ff-tokens that instruct the model to score the text based on separate criteria - we only processed the prompt once but scoring happens in parallel.
Thus, the sequences of ff-tokens can be between a few and a few thousand long.
The way I had this implemented in rLLM (a reference LLM Rust server with a subset of vLLM functionality) was by keeping track of the number of tokens in the current sequence for which the KV cache was correctly computed (
num_kv_computed) and then updating it as the batches went through. I think right now, in vLLM this is implicit (either all are computed or none).For the actual computation of the KV cache, I think
context_attention_fwdwill do what I need with some hacking.As for performance, if vLLM is running say 20 sequences in parallel, and one of them requests 5 ff-tokens, right now I think I would need to do a prompt computation step for this one sequence, while pausing the remaining 19. It would be nice to either take account of that somehow in the batching (eg., do a few more tokens and see if one of the remaining 19 also needs ff-tokens, or maybe there is 21st one coming in), or else mixed steps with both prompt and generation tokens. However, this is a song of the future.
Question: Would adding a
num_kv_computedfield toSequencebe a good way to handle this?Forking
In AICI the controllers can ask the LLM runtime to fork the current sequence. This also forks the controller. Controllers in different forks (but the same request) can communicate with each other. Forking is mostly useful together with ff-tokens.
If a controller asks to be forked, the logit generation should proceed with a single sequence, and then in sampling the logits need to be duplicated, and separate logit biases added to each fork, creating multiple sequences. The main issue I see here, is that the biases need to applied before temperature, softmax, etc., while currently duplication (for
nsampling etc.) happens after these steps.Question: Should I pre-duplicate the logits for AICI, and keep the current duplication logic for
nsampling etc?Controlling sampling
This is not yet implemented in AICI, but it is planned. The idea is that the controller can ask the LLM runtime change sampling parameters for certain parts of generation. For example, it may want to lower the temperature for some correctness-critical part of the output.
Question: Would the approach here be to add an optional pointer to sampling parameters in
Sequence?