finetune: SGD optimizer, more CLI args#13873
finetune: SGD optimizer, more CLI args#13873JohannesGaessler merged 7 commits intoggml-org:masterfrom
Conversation
|
perhaps no need to review until i have an actual SGD impl in a follow-on, @JohannesGaessler - but a few general questions about contributing:
|
WilliamTambellini
left a comment
There was a problem hiding this comment.
you should better keep that change as it time to get more feedbacks/approval.
Any changes made to the ggml source in this repository will eventually be synced to the ggml repository and vice versa; it is completely fine. I think the issue of a git submodule was previously brought up and rejected.
My opinion is that people serious about training should be writing a program rather than use a command line tool. Still, I think it's good to make things such as the learning rate configurable in the provided example program.
I don't remember whether those args were put in by me when I copypasted code or by Georgi when he later refactored it but I myself definitely did not make an intentional choice to use these exact arguments.
I don't know, sorry. |
JohannesGaessler
left a comment
There was a problem hiding this comment.
None of the previous perplexity-specific arguments are needed.
|
For adding an SDG optimizer, add a new ggml op like |
yes, will do. should the actual SGD impl be a subsequent pull req (or several, e.g. starting first w/ just CPU impl) or do you want it all in one pull req? |
|
Either way would be fine with me as long as there are at no point broken or unfinished features on master. |
e752031 to
e689af8
Compare
matiaslin
left a comment
There was a problem hiding this comment.
Looking forward to the next PR(s).
|
you should see frivolous clang-format changes (using the project's .clang-format) only on lines changed in the PR (using git-clang-format). if there's something undesireable we could figure out what in the format config does it |
|
Don't autoformat code en masse unless it's done in a dedicated PR, it makes it unnecessarily difficult to track what was actually changed in a PR. |
|
Sorry, I didn't read the
part. |
7534bbf to
48a16bf
Compare
|
Hi @WilliamTambellini @JohannesGaessler I think this is usable now, inviting code nitpicks etc :) |
|
Second (actual usable SGD) commit is 48a16bf (also shows above here) |
There was a problem hiding this comment.
Mix up different projects: change of CLI/renaming and SGD. Need to split in 2 PRs.
@slaren ?
|
I'm not aware of anything I can do on my end to get this merged (is someone waiting on me that I'm unaware of?). I just marked a 'conflict' above resolved but I think ultimately the permission is with the llama.cpp maintainers |
|
@JohannesGaessler completed for final review/merge
@graehl you may rebase to master to simply merge for JohannesGaessler
|
Don't think there's anything I can currently do (please be specific if I'm mistaken, I'm new). |
From my side, I'm basically waiting for you to look at graehl#1 and merge it if it's fine. I see it has a merge conflict again, I'll fix it. |
Rebase YOUR branch to master(then force push to your branch), see 0cc4m's changes, cherry pick 0cc4m(or rebase him changes to your's changes) |
Thanks for spelling this out, that was easy - didn't squash so we can keep occam's contrib. separate but it's all rebased and you should see it here. |
As I said, please use the human-readable parameters, and only the human-readable parameters, as the ones being passed to |
add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating
m, v tensors.
support finetune.cpp arg -opt SGD (or sgd). (default adamw as before)
llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch)
when using SGD instead of 19gb (55 sec/epoch) using adamw.
(wikipedia 100 lines finetune)
(
using the same GPU memory, adamw can only do before OOM 512
batch/context, reaching:
train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00
val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00
SGD is superior, though it converges slower, with max before OOM 1728
batch/context (esp see the better validation perf):
train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00
val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00
)
note: when finetuning long enough (or w/ enough -lr),
validation accuracy *eventually* drops ('catastrophic forgetting')
-lr-half (halflife) option useful for SGD to avoid oscillation or
super slow underdamped learning (makes setting -lr more forgiving).
terminal -lr for now is set by lr-halvings i.e. if you want at most
1/8 the inital -lr you set -lr-halvings 3.
note: objective loss not directly comparable between adamw, sgd? -
check perplexity or accuracy or consider relative improvements
for convergence
new finetune args -wd 1e-9 to enable weight decay in sgd or adamw,
and max -epochs N (default 2 as before)
cache (1 - wd*alpha) in 'adamw' opt struct -
no noticeable perf benefit, disabled (still done
for new SGD though)
since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params
would probably be able to change between SGD and AdamW with each epoch
but would need to use adamw for the first (unconfirmed - no cmdline arg
to set such a policy yet)
test-opt checks adamw as before and now sgd (except for a few disabled
tests for sgd only; probably just needs logging values and adding
alternate reference values); tolerance on the 'regression'
test is broader for sgd (so we don't need many more epochs)
This is fine and done now, but I cannot be confident the vulkan end of things is correct after the change (I just haven't read up on how the vulkan API works, at all). |
|
You can change what you want, once things are ready I'll do a proper review of the Vulkan parts and make sure they are okay. |
|
From my end I would consider this PR now essentially good to merge. So unless there is something else that is left to do I will make some cosmetic changes and rely on @0cc4m to fix Vulkan if necessary. After that I will approve and merge. |
I didn't change anything at all in vulkan - it's all greek to me :) Do take a look. Perhaps the tests weren't really running on vulkan (I had disabled them since I didn't have an impl). The change is that the op params tensor [1] is now sgd.wd instead of 1 - sgd.wd*sgd.alpha. ([0] is just sgd.alpha) |
|
Yeah, no worries. Here's a diff that does that change on the Vulkan shader, and removes two unnecessary preprocessor steps. diff --git a/ggml/src/ggml-vulkan/vulkan-shaders/opt_step_sgd.comp b/ggml/src/ggml-vulkan/vulkan-shaders/opt_step_sgd.comp
index 3d5e1d98f..6426dedee 100644
--- a/ggml/src/ggml-vulkan/vulkan-shaders/opt_step_sgd.comp
+++ b/ggml/src/ggml-vulkan/vulkan-shaders/opt_step_sgd.comp
@@ -1,9 +1,6 @@
#version 450
#include "generic_head.comp"
-#include "types.comp"
-
-#extension GL_EXT_control_flow_attributes : enable
layout(local_size_x = 512, local_size_y = 1, local_size_z = 1) in;
@@ -19,7 +16,7 @@ void main() {
}
const float alpha = data_params[0];
- const float keep = data_params[1];
+ const float keep = 1.f - alpha * data_params[1];
data_x[i] = data_x[i] * keep - alpha * data_grad[i];
}If you apply that the CI should pass again. |
|
I made some changes and pushed them to |
|
I believe I successfully applied both. If anything else can be done to get this merged let me know. |
|
Please fix the build issue in the pipelines. |
|
The build failures are my fault. I don't know why, but for some reason |
|
Thanks for the work and the persistence, everyone. For bookkeeping I changed the title/commit message to also mention SGD. |
|
I think this PR broke the SYCL build: Maybe just need to update the "supports" function |


add to ggml-opt learning rate (adamw alpha) cmdline arg, and an optimizer enum defaulting to adamw,
preparatory to work to support SGD
these are in common args a set of optimizer options active only for the new FINETUNE example (which includes all the previous finetune.cpp PERPLEXITY options as a precaution)
perhaps breaking with precedent, the ggml_opt_optimizer_params struct is included directly as args - if desired, we can instead just add learning rate and optimizer type to a struct independent of ggml-opt.h
as proposed in
#13835