AI Controller Interface (AICI) integration #2888

mmoskal · 2024-02-16T02:12:14Z

This PR introduces support for some features of AI Controller Interface (AICI). In particular it allows for:

uploading a controller (Wasm binary) via POST /v1/controllers (see REST docs)
tagging REST APIs POST /v1/controllers/tags and GET /v1/controllers/tags
running a controller via POST /v1/run
the controller can return initial fixed tokens (turned into prompt in vLLM)
the controller can also generate output with a constraint (regex, LR(1) grammar, etc.)
controller results are always returned via streaming API

The following are not supported yet:

fast-forward tokens (adding a number of tokens in one batch, after the initial prompt is processed)
backtracking (popping a number of entries from the KV cache)
forking (allowing the controller to split a sequence into several sequences)

I have some questions about how these features should be implemented.

cc @simon-mo @emrekiciman

Giving it a spin

Currently, the best way to run is from the AICI repo which contains the forked vLLM as a submodule.

git clone --recursive https://github.com/microsoft/aici
cd aici/py/vllm
python setup.py develop
cd ../..
./scripts/vllm-server.sh

The last command requires a recent rust compiler - see AICI dev setup for more info (you may want to use the devcontainer labeled "vLLM experimental" from the AICI repo).

Once the server is running, in separate terminal, you can try out different controllers (this is using .wasm binaries released in AICI repo):

cd aici
./aici.sh run --ctrl gh:microsoft/aici/pyctrl controllers/pyctrl/samples/yesno.py 
./aici.sh run --ctrl gh:microsoft/aici/jsctrl controllers/jsctrl/samples/hello.js

Due to limitations listed above, FixedTokens can be only used once at the top, and there is no fork().

In future, it will be possible to download the released aicirt binary while pyaici will be published on pip and can be listed in requirements.txt here. Then it would be just a matter of passing --aici-rt to vLLM server.

Here's a short overview of capabilities of AICI, and issues with implementing them in vLLM.

Logit biases

In AICI, the controller computes logit biases for the next sampling, while the GPU is working on the logits. These are used to constrain output to adhere to a certain regular expression, LR(1) grammar, etc.

I apply these in _apply_logits_processors in sampler.py.

FF-tokens and backtracking

Fast-forward tokens (also called zero-entropy tokens, forced tokens, or fixed tokens) are tokens that are added to the current sequence in one step, possibly after some generation steps. The initial prompt can be thought of as ff-tokens (this is in fact how AICI views it).

An example, where they are useful is generating data adhering to a certain JSON schema. The controller first forces {"name":" to be generated, then the model generates John", the controller forces ,\n"age":, model generates 42, and so on. Another example is chain-of-thought reasoning, where after the model generated a sentence, the controller forces more instructions for the model, the model generates more text, and so on.

FF-tokens works together with backtracking (popping KV cache entries from the current sequence), for example we have the model "show its work", but then backtrack over the work and only append the final result, before continuing with reasoning.

They are also useful with forking, for example we first give model some text, and then in separate forks add ff-tokens that instruct the model to score the text based on separate criteria - we only processed the prompt once but scoring happens in parallel.

Thus, the sequences of ff-tokens can be between a few and a few thousand long.

The way I had this implemented in rLLM (a reference LLM Rust server with a subset of vLLM functionality) was by keeping track of the number of tokens in the current sequence for which the KV cache was correctly computed (num_kv_computed) and then updating it as the batches went through. I think right now, in vLLM this is implicit (either all are computed or none).

For the actual computation of the KV cache, I think context_attention_fwd will do what I need with some hacking.

As for performance, if vLLM is running say 20 sequences in parallel, and one of them requests 5 ff-tokens, right now I think I would need to do a prompt computation step for this one sequence, while pausing the remaining 19. It would be nice to either take account of that somehow in the batching (eg., do a few more tokens and see if one of the remaining 19 also needs ff-tokens, or maybe there is 21st one coming in), or else mixed steps with both prompt and generation tokens. However, this is a song of the future.

Question: Would adding a num_kv_computed field to Sequence be a good way to handle this?

Forking

In AICI the controllers can ask the LLM runtime to fork the current sequence. This also forks the controller. Controllers in different forks (but the same request) can communicate with each other. Forking is mostly useful together with ff-tokens.

If a controller asks to be forked, the logit generation should proceed with a single sequence, and then in sampling the logits need to be duplicated, and separate logit biases added to each fork, creating multiple sequences. The main issue I see here, is that the biases need to applied before temperature, softmax, etc., while currently duplication (for n sampling etc.) happens after these steps.

Question: Should I pre-duplicate the logits for AICI, and keep the current duplication logic for n sampling etc?

Controlling sampling

This is not yet implemented in AICI, but it is planned. The idea is that the controller can ask the LLM runtime change sampling parameters for certain parts of generation. For example, it may want to lower the temperature for some correctness-critical part of the output.

Question: Would the approach here be to add an optional pointer to sampling parameters in Sequence?

AaronFriel · 2024-03-20T05:03:17Z

Does automatic prefix caching (#2614) obviate the need for AICI to be clever, and instead it would be better to implement feed forward, backtracking, and forking by managing parallel requests?

mmoskal · 2024-07-10T02:17:19Z

closing in favor of #6273

mmoskal added 4 commits February 15, 2024 01:26

first sketch of AICI integration

42136a0

run exec_mid() callback

93256f7

add POST /v1/controllers

3f51135

check for num_forks/ff_tokens etc

6a06f94

emrekiciman mentioned this pull request Feb 16, 2024

vLLM integration microsoft/aici#63

Open

mmoskal added 3 commits February 16, 2024 21:23

add support for tagging controllers

0492b12

apply yapf

cfe44d4

apply ruff --fix

ac0a23a

This was referenced Mar 29, 2024

[Feature]: Integrate with AICI #3714

Closed

[Roadmap] vLLM Roadmap Q2 2024 #3861

Closed

lucasavila00 mentioned this pull request Apr 5, 2024

Model grammar support via BNF EricLBuehler/mistral.rs#59

Closed

mmoskal added 13 commits April 10, 2024 22:25

merge upstream main

f277b55

fix typo

c872b1c

remove pre/post callbacks in aici

b861fad

Merge branch 'main' into aici

78870b7

token splicing support

42159e2

backtracking fixes

f4032b6

ignore eos with AICI

a8fb178

fix ruff warnings

484cab9

run isort

e8fb320

run yapf

9dfd274

fixes for empty token append

920550a

better error messages from aicirt

d3776a2

forward http errors in /v1/run

c524af5

mmoskal mentioned this pull request May 12, 2024

[core] SequenceController in SamplingParams #4775

Closed

simon-mo mentioned this pull request Jun 12, 2024

[RFC]: Improve guided decoding (logit_processor) APIs and performance. #5423

Closed

mmoskal closed this Jul 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

AI Controller Interface (AICI) integration #2888

AI Controller Interface (AICI) integration #2888

Uh oh!

mmoskal commented Feb 16, 2024 •

edited

Loading

Uh oh!

AaronFriel commented Mar 20, 2024 •

edited

Loading

Uh oh!

mmoskal commented Jul 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

AI Controller Interface (AICI) integration #2888

AI Controller Interface (AICI) integration #2888

Uh oh!

Conversation

mmoskal commented Feb 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Giving it a spin

Logit biases

FF-tokens and backtracking

Forking

Controlling sampling

Uh oh!

AaronFriel commented Mar 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmoskal commented Jul 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mmoskal commented Feb 16, 2024 •

edited

Loading

AaronFriel commented Mar 20, 2024 •

edited

Loading