Skip to content

Conversation

@fehiepsi
Copy link
Member

@fehiepsi fehiepsi commented Dec 22, 2017

This pull request is a WIP tutorial on how to create a (simple) GP regression model using pyro. Definitely, there should be added more explanations, comments, plots,... Any comments are really helpful for me to complete it.

  • Add an introduction, formulas, remarks, comments to code.
  • Explain how to define a Distribution. (skip as suggested by @fritzo)
  • Explain how to use random_module primitive. (skip)
  • Explain Pyro's model-guide + SVI approach. (skip)
  • Implement MAP inference.
  • Simplify the tutorial after moving some parts to pyro.contrib.gp.

@fritzo
Copy link
Member

fritzo commented Dec 22, 2017

cc @ysaatchi @karalets

Copy link
Member

@fritzo fritzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, and deserves more than a tutorial 😄 I think we should add GPRegressor as a first-class part of Pyro. Would you be willing to:

  1. Create a PR that adds a pyro.contrib.gp.GPRegressor say in a new file pyro/contrib/gp.py, including a couple tests? I suggest basing this off of @dwd31415's MultivariateNormal in #651, since that PR already adds tests and deals with torch.potri in a safe way.
  2. Simplify this PR to from pyro.contrib.gp import GPRegressor.

(EDIT changed pyro.nn to pyro.contrib)

@fritzo
Copy link
Member

fritzo commented Dec 26, 2017

If you separate the MultivariateNormalTri and GPRegressor into a separate PR, then you can avoid explaining some of the coding details in this PR (Distributions, nn.random_module) and focus more attention on the interesting parts of Gaussian Processes.

@fehiepsi
Copy link
Member Author

fehiepsi commented Dec 27, 2017

@fritzo Thank you for your suggestions! contrib.gp is a good place to start making a skeleton for GP. I will try to finish this tutorial first (without explaining details of implementation, it seems that the remaining work is to make MAP inference). After my vacation (next week), I will focus on contrib.gp to see how far it can go. :)

@fritzo
Copy link
Member

fritzo commented Jan 3, 2018

@fehiepsi If it's easier for you, it would be fine to split this up into two PRs by:

  1. update this tutorial to use @dwd31415's MultivariateNormal which is now merged;
  2. merge this PR
  3. factor out GaussianProcess into pyro.congrib.gp in a second PR.

@fehiepsi
Copy link
Member Author

fehiepsi commented Jan 3, 2018

@fritzo : I am on a vacation and will return in a week. Your suggestion is good for me. I will follow it when I come back to my home.

@fehiepsi
Copy link
Member Author

@fritzo I have updated the tutorial after moving GPR model and RBF kernel to pyro.contrib.gp and using the MultivariateNormal distribution which is currently implemented.

@fehiepsi fehiepsi changed the title [WIP] Gaussian Process tutorial Gaussian Process tutorial Jan 15, 2018
Copy link
Member

@fritzo fritzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for factoring out the gp library, I think it will be useful.

Calculate covariance matrix of inputs on active dimensionals.
:param torch.autograd.Variable X: A 2D tensor of size `N x input_dim`.
:param torch.autograd.Variable X2: A 2D tensor of size `N x input_dim`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace X2 with Z here and below


class RBF(Kernel):
"""
Implementation of RBF kernel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand this to "Radial Basis Function kernel"?

.gitignore Outdated
@@ -1,3 +1,4 @@
# temp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove this line

kernel_fn = pyro.random_module(self.kernel.name, self.kernel, self.priors)
kernel = kernel_fn()
K = kernel.K(self.X) + self.noise.repeat(self.input_dim).diag()
zero_loc = Variable(torch.zeros(self.input_dim).type_as(K.data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think to get correct GPU device placement you'll need to

zero_loc = K.new([0]).expand(self.input_dim)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no problem with using type_as method (in pytorch ver 3). However, I just have 1 GPU to test. Did you mean that this will fail for multi-GPU?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, K.new(...) ensures that the result is on the same GPU as K, whereas I believe .type_as() only ensures that the result is on some GPU. I only learned this recently, and we still have a few bugs in Pyro relating to GPU placement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I try to use .new() whenever I create a new tensor.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, thank you a lot for the tip! Will change soon.

@fritzo
Copy link
Member

fritzo commented Jan 18, 2018

@karalets Could you please review the high-level approach? I've already reviewed for software details.

@eb8680 eb8680 requested review from karalets and ysaatchi January 18, 2018 23:28
def __init__(self, variance=torch.ones(1), lengthscale=torch.ones(1), active_dims=None, name="RBF"):
super(RBF, self).__init__(active_dims=active_dims, name=name)
self.variance = Parameter(variance)
self.lengthscale = Parameter(lengthscale)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add a description of how lengthscale works, typically you have one independent lengthscale per dimension, but you are assuming that lengthscales across all dimensions are the same (?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ysaatchi You are right (originally, I just wrote this module for 1-dimensional data). We need to refactor it a bit. I will solve it by introducing input_dim parameter (similar to GPy or GPflow) so we can set correct shape for initial lengthscale.

self.kernel = kernel
# TODO: define noise as a nn.Module, so we can train/set prior to it
if noise is None:
self.noise = Variable(X.data.new([1]))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do? Needs a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noise plays the role of Gaussian likelihood. I intend to add Likelihood, Mean function modules in a future pull request. So temporarily, I let it be constant. See #681 for a plan I have in mind. The pull request is served for the purpose of simplifying the original tutorial code.

self.y = y
self.input_dim = X.size(0)
self.kernel = kernel
# TODO: define noise as a nn.Module, so we can train/set prior to it
Copy link

@ysaatchi ysaatchi Jan 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the best idea, you should define noise as a likelihood with its own hypers and optimize it that way. In general we need to support arbitrary likelihoods for the GP so defining them at this early stage will be very helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree!

zero_loc = Variable(K.data.new([0]).expand(self.input_dim))
pyro.sample("f", dist.MultivariateNormal(zero_loc, K), obs=self.y)

def guide(self):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of the guide in this context? It seems like you are doing inference in forward(), so what is the point of the guide?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I want to put some constraints on parameters but it needs to write a wrapper of nn.Parameter (support transform methods). Then I found that setting priors and using guide for MAP inference might be a simpler idea. Do you find a better way of using the guide?

K_zx = K_xz.t()
K_zz = kernel(Z)
loc = K_zx.matmul(self.y.gesv(K)[0]).squeeze(1)
covariance_matrix = K_zz - K_zx.matmul(K_xz.gesv(K)[0])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very inefficient (calling gesv twice on an NxN matrix), see GPML book (Gaussian Processes for Machine Learning) for correct pseudocode for doing this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, does gesv play well with autograd -- quite cool if so :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ysaatchi Originally, I put noise as hyperparameter and use Cholesky decomposition. Then when noise is small, the Lapack error "the leading minor of order ... is not positive definite" annoyed me. In addition, somehow, I find torch.trtrs is not stable (pytorch/pytorch#4296). So I use gesv instead. Of course, these problems might come from some bugs in my code at that time.

Anyway, using gesv might be not a good way, so I will use Cholesky decomposition again.

p/s: gesv supports autograd, but not supports batch yet. :)

Copy link

@ysaatchi ysaatchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start, I have some major questions at this stage, once clarified we can move forward.

@fritzo
Copy link
Member

fritzo commented Jan 20, 2018

@fehiepsi We plan to merge this early next week. We're about to do a big refactoring and I want to make sure your PR can come along for the ride.

@fehiepsi
Copy link
Member Author

@fritzo @ysaatchi I have made several changes reflect your reviews. For the change from noise to Gaussian Likelihood (a nn.Module), I think it is better to do in another pull request. My main concern is not about its implementation, but about naming: we should have a good way to distinguish parameters of kernels and likelihoods/mean_functions when setting priors to them.

@fehiepsi
Copy link
Member Author

fehiepsi commented Jan 20, 2018

@ysaatchi I make a benchmark to compare the following three methods on 1000x1000 matrices.

def method1(L, K):
    return K.t().matmul(K.potrs(L, upper=False))

def method2(L, K):
    v = K.trtrs(L, upper=False)[0]
    return v.t().matmul(v)

def method3(L, K):
    v = L.inverse().matmul(K)
    return v.t().matmul(v)

On my CPU: method2 is fastest (10ms), method1 is slower (13ms), method3 is slowest (20ms).
On my GPU: method1 is much faster than method3 (2ms comparing to 10ms), method2 throws an error.

Edited:
The drawbacks of method2 are: torch.trtrs() does not support cuda tensor. In addition, torch.potrs and torch.trtrs does not support gradients. So I think that using method3 is the best option for GPR.

@fritzo
Copy link
Member

fritzo commented Jan 22, 2018

@fehiepsi FYI we're refactoring MultivariateNormal in #693 as part of our migration to PyTorch distributions. I've tried to preserve all the functionality you need for your tutorial. Let me know if you have any problems with the new version and I'll prioritize fixing it for you.

@fritzo
Copy link
Member

fritzo commented Jan 22, 2018

method3 is the best option

I believe torch.trtrs is currently lacks gradient support. It's also fine to make a helper like

def _kernel_norm(L, K):
    if L.is_cuda:
        # work around lack of CUDA support for trtrs
        v = L.inverse().matmul(K)
    else:
        v = K.trtrs(L, upper=False)[0]
    return v.t().matmul(v)

@fehiepsi
Copy link
Member Author

@fritzo I think that calling K.inverse() is better than L.inverse(). L is just helpful when it is used together with torch.trtrs(). So in my implementation, I choose the method1 among the following 4 methods (previously I chose method2, but @ysaatchi points out that it is ineffective to call torch.gesv two times on the same kernel matrix`). Anyway, I think that this is ready to merge.

def method1(K, A, y):
    K_inv = K.inverse()
    A_t = A.t()
    loc = A_t.matmul(K_inv.matmul(y))
    scale = A_t.matmul(K_inv.matmul(A))
    return loc, scale
    
def method2(K, A, y):
    A_t = A.t()
    loc = A_t.matmul(y.gesv(K)[0])
    scale = A_t.matmul(A.gesv(K)[0])
    return loc, scale

def method3(K, A, y):
    L = K.potrf(upper=False)
    L_inv = L.inverse()
    L_inv_A = L_inv.matmul(A)
    L_inv_y = L_inv.matmul(y)
    L_inv_A_t = L_inv_A.t()
    loc = L_inv_A_t.matmul(L_inv_y)
    scale = L_inv_A_t.matmul(L_inv_A)
    return loc, scale
    
def method4(K, A, y):
    L = K.potrf(upper=False)
    L_inv_A = A.trtrs(L, upper=False)[0]
    L_inv_y = y.trtrs(L, upper=False)[0]
    L_inv_A_t = L_inv_A.t()
    loc = L_inv_A_t.matmul(L_inv_y)
    scale = L_inv_A_t.matmul(L_inv_A)
    return loc, scale

@fritzo
Copy link
Member

fritzo commented Jan 23, 2018

Hi @fehiepsi I'm going to merge this now so it can follow our refactoring work, but feel free to keep submitting updates in follow-up PRs. Thanks for contributing this!

@fritzo fritzo merged commit 1fe22a5 into pyro-ppl:dev Jan 23, 2018
@martinjankowiak
Copy link
Collaborator

@fehiepsi is it ok if i submit a PR to clean up the language/organization of this tutorial a bit to make it a bit easier to follow?

@fehiepsi
Copy link
Member Author

@martinjankowiak Sure, it would be great! Given that the API for GP is stable now (current PRs does not affect the API), it is a good time to revise it. I am happy to see the changes from you.

@fehiepsi fehiepsi deleted the add-gp-tutorial branch June 10, 2018 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants