Conversation
Code Metrics Report=============================================================================== Language Files Lines Code Comments Blanks =============================================================================== C Header 2 35 28 0 7 Dockerfile 1 34 25 0 9 Happy 1 442 369 0 73 JSON 11 102 101 0 1 Python 46 2018 1718 62 238 TOML 20 617 544 11 62 YAML 1 9 8 1 0 ------------------------------------------------------------------------------- Jupyter Notebooks 4 0 0 0 0 |- Markdown 2 77 32 31 14 |- Python 2 196 169 1 26 (Total) 273 201 32 40 ------------------------------------------------------------------------------- Markdown 28 1960 0 1481 479 |- BASH 5 101 98 0 3 |- JSON 1 12 12 0 0 |- Python 5 92 82 0 10 |- Rust 6 408 365 19 24 |- TOML 2 75 63 0 12 (Total) 2648 620 1500 528 ------------------------------------------------------------------------------- Rust 195 59957 54396 1100 4461 |- Markdown 100 906 13 843 50 (Total) 60863 54409 1943 4511 =============================================================================== Total 310 65174 57189 2655 5330 =============================================================================== |
|
@EricLBuehler One thing I am pondering is that |
@Dan-wanna-M yes, we parallelize the sampling process. Perhaps we could change the API to accept some generic type which has some method to do the logits processing? Then, you could contain the state in that type in |
|
@Dan-wanna-M I just updated it so now the API is: pub trait CustomLogitsProcessor: Send + Sync {
/// Logits and sequence context (prompt and generated tokens), returning modified tokens.
fn apply(&self, logits: &Tensor, context: &[u32]) -> Result<Tensor>;
}In the example, we have a custom struct: struct ThresholdLogitsProcessor {
threshold: f64,
}
impl CustomLogitsProcessor for ThresholdLogitsProcessor {
fn apply(&self, logits: &Tensor, _context: &[u32]) -> Result<Tensor> {
// Mask is 1 for true, 0 for false.
let mask = logits.ge(self.threshold)?;
logits.broadcast_mul(&mask.to_dtype(logits.dtype())?)
}
}Would this work if you could put shared state into the custom struct w/ a mutex? |
I spent some time rethinking it and I think as long as one generation only uses one logits processor then it should work(So a batch of N generation will need N separate logits processor) |
Sounds good, and this API implements |
Yes it should be mergeable now |
|
Sounds good @Dan-wanna-M! Just merged it. |
* Add custom logits processor api * Typos * Nicer interface and update example * Fix doctest * Update docs
* Implement dry penalty * Add dry sampling params to requests * Handle it * Clippy * Review: "Implement DRY penalty" (#645) * Silence bogus Clippy warning Clippy's suggestion cannot be implemented because of borrowing issues * Get rid of unnecessary type annotations Interesting that Clippy doesn't catch this * Store default sequence breakers in a slice It's nicer when the length is not hardcoded * Make default sequence breakers private No need to leak this as it's not used elsewhere * Limit match length Avoids quadratic runtime and potential DoS with adversarial inputs Ref oobabooga/text-generation-webui#6047 * "Fix" sequence breaker tokenization Most tokenizers encode punctuation tokens differently depending on where they occur in the input, and which tokens surround them. With the default sequence breakers, the appropriate encoding usually corresponds to the encoding produced when the token occurs after a word, rather than by itself. To emulate this, prefix the token with "a" before encoding, and extract the final token of the result. See LostRuins/koboldcpp#982 for a correct solution to this problem. * Nicer * Even better * Complete merge * Fix saturating sub * Handle when no context * Make context the entire sequence and refactor * Remove slicing for all * Fix the bug with penalty Credit to @p-e-w for finding this! Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com> * Add custom logits processor API (#702) * Add custom logits processor api * Typos * Nicer interface and update example * Fix doctest * Update docs * Update exports * Add Gemma 2 PagedAttention support (#704) * Add gemma2 paged attn support * Non cuda support? * Remove error * It works * Faster RmsNorm in gemma/gemma2 (#703) * Fix bug in metal isq (#706) * Support GGUF BF16 tensors (#691) * Support GGUF bf16 tensors * Fix loading of bf16 ggml tensor * Fix dequant of bf16 * Use merged rev * Softcapping, real batching + sliding window support for Flash Attention (#707) * Flash attention varlen kind of works * Seems to work * Now it's nice * Sliding window support and clippy * Remove warning * Support smollm * Update rev to match merged * Remove some usages of 'pub' in models (#708) * Support the Phi 3.5 V model (#710) * Update image_seq_len * Update the examples * Format * Implement the Phi 3.5 MoE model (#709) * Copy the model * Add most of it * Add the blocksparse moe parts * Clippy * Fix mscales * A batch of fixes * Correctly cast it * Handle isq on gate * Even more progress * Runs now * Clippy * Fix to use layernorm * Remove unused * Add docs * Add more docs * Apply review comments * Update readme --------- Co-authored-by: Philipp Emanuel Weidmann <pew@worldwidemann.com>
New sampling order:
@Dan-wanna-M would this API work? You can find an example of using it here. Basically, you provide a Vec of
Arc<dyn Fn(&Tensor, &[u32]) -> Result<Tensor> + Send + Sync>, which are applied.Please let me know if this works for you or what changes would be necessary, and I can merge this PR. kbnf looks super cool!