Curve fitting comparison between Adam and L-BFGS optimizer
I'm studying the NN tools for theoretical chemistry simulation, especially potential energy surface (PES) fitting.
At first, I chose TensorFlow for NN simulation. I have successfully constructed a Diabatic PES in TensorFlow 2.4 with adam optimizer, and the result has been published in J. Chem. Phys. 155, 214102 (2021). However, some of the reviews said the second optimizer, like Levenberg-Marquardt, can provide better convergence results efficiently. In my previous simulation, it usually takes almost 10^7 epochs in a week for convergency, which can be 10^3 level in L-M optimizer in review. So, I want to test the performance of the second optimizer in the regression problem.
- Fabio Di Marco has compared Levenberg-Marquardt and Adam with TensorFlow. The target function is sinc function.
- Soham Pal has compared L-BFGS and Adam with PyTorch in linear regression problem.
- NN-PES review has compared some optimizers but it lacks details. And matlab has more study costs (in my point of view).
Since TensorFlow does not have an official second optimizer, I will use pyTorch L-BFGS optimizer in this test.
You can find some information about L-BFGS algorithms on many websites, and I will not discuss this. However, when you use L-BFGS in PyTorch, you need to define a 'closure' function for gradient evaluation. I'm not so familiar with optimization algorithms, and simply follow the code written by Soham Pal. The 'train' function will be:
def train(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
model.train()
lm_lbfgs=model.to(device)
#spacial function for LBFGS
for batch, (X, y) in enumerate(dataloader):
¦ x_ = Variable(X, requires_grad=True)
¦ y_ = Variable(y)
¦ def closure():
¦ ¦ # Zero gradients
¦ ¦ optimizer.zero_grad()
¦ ¦ # Forward pass
¦ ¦ y_pred = lm_lbfgs(x_)
¦ ¦ # Compute loss
¦ ¦ loss = loss_fn(y_pred, y_)
¦ ¦ # Backward pass
¦ ¦ loss.backward()
¦ ¦ return loss
¦ optimizer.step(closure)
¦ loss=closure()
¦ if batch % train_size == 0:
¦ ¦ loss, current = loss.item(), batch * len(X)
¦ ¦ print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
return loss_trainAlso, I use strong_wolfe option. Otherwise, the loss function will become very large (I don't know the reason).
optimizer_lbfgs= torch.optim.LBFGS(model.parameters(), lr=1,
¦ history_size=100, max_iter=20,
¦ line_search_fn="strong_wolfe"
¦ )The code for this test can be found in optimizer_test/
I compared two NN structures:
- One hidden layer with 20 neurons. Linear output. I use t20 to denote this situation while "t" means using tanh for activation function.
- Two hidden layers with 20 neurons each. Linear output. I use t20-t20 to denote this.
I use 20000 sampled points from Sinc function: x in [-1.1], y=sinc(x)=( 1 if x=0 or sin(x)/x if else )
And 80% of data was randomly chosen for training.
It is not surprise that adam t20 perform worst, and adam t20-t20 seems has the same performance with l-bfgsHowever, if we zoom into boundary:
The green line adam t20-t20 derivate a lot from the target.
The loss decay curve in log() can illustrate the fitting error better.
Adam t20-t20 is still worse than lbfgs t20 in several orders of magnitude.
Almost the same due to relative small network. However, second-order optimizer commmonly need more memory for gradient.
Please try second-order optimizer in regression problems if possible, especially for small networks.


