Conversation
There was a problem hiding this comment.
From what I understand of #31 and the downstream lurk-lab/lurk-beta#648, you need a hash of the public parameters that changes every time the circuits involved in those public parameters change, and can be used as a public parameter cache key so that no one can load the wrong public parameters for the current circuits?
If I have understood that requirement correctly, then good news! that digest already exists, it's the public parameter digest, it was introduced in microsoft/Nova#168 and it is cached in the public parameters already. No need to add anything : if the circuits change, the public parameter digest will change correspondingly.1
Now, it's possible I missed something, just lmk.
/cc @porcuquine
Footnotes
-
Now you'll tell me that the public parameters that are used for Supernova don't include the full range of circuits used in the accumulators: they contain but the largest instance, and not the full set. I would argue that this is plain wrong, in the sense that it invalidates the purpose of https://github.com/microsoft/Nova/pull/168, which is to bind the prover to a specific set of accumulators.
In that sense, even if the public parameters only need the shape of the largest circuit for most of the time, they should at least pass the whole ROM (all accumulators) at the time of populating the digest. The pattern to use there, I think, is a builder:
https://rust-unofficial.github.io/patterns/patterns/creational/builder.html -and there the builder would receive the ROM and possibly not keep it in the PPs post vomiting the hash. Or (and I'd probably prefer this) keep the whole array of R1CS shapes in the PPs.
Fortunately, we can tackle all this a bit later, because, as you probably noticed, there are two sets of public parameters, one for Nova (which is fine), one for Supernova (which I have strong issues with, as laid out in the above). ↩
src/r1cs.rs
Outdated
| digest: G::Scalar::ZERO, | ||
| }; | ||
|
|
||
| // it's a shame we have to recompute everything here; perhaps we can optimize this somehow |
There was a problem hiding this comment.
We will indeed need that optimization, because the same bytes will be hashed again in https://github.com/lurk-lab/arecibo/blob/c3608280054fd183c65873b88251e2710d483c3f/src/lib.rs#L203-L204
Evidence that this is performance-sensitive exists in https://github.com/microsoft/Nova/pull/182/files, which introduced the benchmark and tests for the public parameters' digest.
There was a problem hiding this comment.
Can we compute it only if it's equal to G1::Scalar::ZERO (the default initial value)?
There was a problem hiding this comment.
Padding occurs as a one time cost at the end when producing the snark. Technically we currently don't even need to check any digest at the point pf padding, since it's assumed that the public params are already loaded and valid (though in the future it might be necessary to check again due to some new complexity). So my inclination to to delete this line and add a comment making it clear that this is "wrong," but we don't care.
Indeed, this is pretty compute heavy so we have to be careful.
There was a problem hiding this comment.
@arthurpaulino we could compute it for any default initial value (like 37, or None with Option<F>). We just had to pick a default one to be able to serialize the structure (and make sure it's the same default each time).
I believe the issue is that we need not only a digest that changes whenever the parameters do but one that can also be computed without incurring the major cost of actually generating the parameters. Specifically, we need this as a cache key exactly in order to avoid that. We can think of there being several parts:
Since we are not implementing the parameter caching as an arecibo feature, we should primarily consider a solution that addresses the 'application-specific' needs. Ideally this can be done in a way that is not intrusive and incurs no performance penalty for applications that don't make use of this. In other words, I think it is indeed the case that we need to export the ability to get a digest just from the application-specific part: the For NIVC, this will need to include all R1CS shapes, as you note. For downstream purposes, it may be acceptable to compose such a composite digest from a canonical digest of the individual |
To make sure we're on the same page, let me note that the ROM is a concept only used in the toy 'NIVC application' used in the SuperNova tests as part of the initial draft. It's essentially a static representation of the NIVC 'program' to be proved. I don't think this distinction is significant to the discussion, but I mention it so we can clarify any confusion about terminology. I agree that all shapes must be accounted for in the parameter digest for SuperNova. As noted above, I believe that is already the case. I also agree that neither the creation method nor precise content of those parameters is optimal. Finally, I suspect that once we have reached a better version of creation/content there will also be a better method of creating the joint digest than the current. |
|
I think this can now be reworked, to use |
8e50258 to
b82b65f
Compare
b82b65f to
9c2e3e2
Compare
|
@porcuquine Rebasing on #40, I've hacked around the |
| self.digest.write_digest(byte_sink) | ||
| } | ||
|
|
||
| fn to_bytes(&self) -> Result<Vec<u8>, io::Error> { |
There was a problem hiding this comment.
Perhaps we could implement a quick proc macro for this
There was a problem hiding this comment.
I think what you want here is a standard derive macro for the Digestible trait, no?
There was a problem hiding this comment.
What I don't understand is that in the above,to_bytes and write_digestable_bytes were producing the same bytes in a different fashion. Here, we're breaking this invariant. Why?
There was a problem hiding this comment.
to_bytes is what creates a digest from a raw object T. write_digestible_bytes is how T should be represented when creating the digest of a larger object S can contains T. For R1CSShape and PublicParams we break the invariant because when computing the digest of PublicParams that contains R1CSShapes, we simply want to absorb shape.digest instead of rehashing all of shape again. Thus the implementation of write_digestible_bytes for R1CSShape should just write its digest and nothing else.
There was a problem hiding this comment.
But now for all structures that use the digests of their constituent fields (rather than the serialization bytes of their constituent fields), I have two ways of getting to a digest for the top-level structure: hashing the output of to_bytes, and asking write_digestable_bytes to write the digest to a byte sink.
For some structures this is the same. For some this is not. This ambiguity is bound to end in tears.
src/digest.rs
Outdated
| fn write_bytes<W: Sized + io::Write>(&self, byte_sink: &mut W) -> Result<(), io::Error>; | ||
| /// Write the digest representation of Self in a byte buffer | ||
| /// this is what you want to write when Self is being digested into another digest | ||
| fn write_digest<W: Sized + io::Write>(&self, byte_sink: &mut W) -> Result<(), io::Error>; |
There was a problem hiding this comment.
I think this may need a better name. I think the separated functions make sense, but for example SimpleDigestible implements this function in a way that does not involve writing it's digest.
Maybe this can be called write_summary_bytes or something? That is not a well-thought-out suggestion, but the point is that this function might either write the digest or the input over which the digest was created — depending on context. At least that is what I've understood.
There was a problem hiding this comment.
That does make sense; maybe write_digestable_bytes to indicate that these are the bytes being digested (into another container)?
There was a problem hiding this comment.
Yes, I think that's the ticket.
| // TODO: at this point within the proving pipeline, we assume that `R1CSShape::pad` | ||
| // is called only in the ppsnark/snark setup process. Thus, the digest's use as a check | ||
| // to prevent drifting public params is not a concern, and we can avoid doing a massive | ||
| // recomputation for a new digest that will never be used. However, this is ad-hoc and | ||
| // we should find a nicer solution | ||
| digest: self.digest, |
There was a problem hiding this comment.
Another concern. As I wrote this, I realized that currently master doesn't take into this concern either, but it is more subtle. After padding, the digest of the public params will not match the new padded shape. Looking at the setup code in snark.rs/ppsnark.rs, I see that we commit again to the padded shape though, so this shouldn't be an issue, right?
cc: @huitseeker
There was a problem hiding this comment.
The point of the builder pattern I produced in #31
Was to make sure you'd run all the initialization steps before getting the object, and therefore did not have to check an object to see if it had been padded or not. Did we forget padding as part of this ?
There was a problem hiding this comment.
I don't understand what you mean. My point is that after the digest is built, we mutate the shape by padding it, and we don't rebuild the new padded shape to update the digest. However, because the new digest wouldn't be used anyway, since the transcript implicitly consumes the digest of the unpadded shape through pp.digest. Hence we can elide the expensive recompute.
However, I don't know if this may raise ways to attack the snark protocol.
ef54705 to
6f48fcc
Compare
32ee450 to
67760c4
Compare
67760c4 to
b4f3ce8
Compare
There was a problem hiding this comment.
I think I understand what you're trying to do, that is you want:
- the
Digestibletrait's main functionality should be to stream bytes into anio::Writer passed as argument, - an implementation of the
Digestibletrait by which, for objects that are serializable, you stream the bytes of their serialized form. That's exactly theSimpleDigestibleimplementation, or at least thewrite_digestable_bytes(sic) of it. To recall, the functionality of this trait should be to inject a stream of bytes in aWriter - an extension trait of
Digestible, sayHashableby which you can equip theDigestibleobjects with adigest()function (or equivalent). The functionality of this trait should by to compute an actual hash from theDigestiblebytes. - a proc-macro by which you can derive the
Digestibletrait on an object O, so that for each field of that object, assuming those fields implement theHashabletrait, you can stream:- a marker of the type of the object, which should be a domain separator (the
serde_namecrate would help), - the hashed bytes of each field,
- except for digest field of O.
To recall, the functionality of this trait (Digestible) should also be to inject bytes in aWriter.
- a marker of the type of the object, which should be a domain separator (the
One goal of this would be not to allow any hash collisions between types that stream their ground serialization bytes, and those that stream their hash bytes.
| /// Write the digest representation of Self in a byte buffer | ||
| /// this is what you want to write when Self is being digested into another digest |
There was a problem hiding this comment.
I'm not sure I understand this wording: what is "the digest representation of self"? What does "Self is being digested into another digest" mean?
| fn write_bytes<W: Sized + io::Write>(&self, byte_sink: &mut W) -> Result<(), io::Error>; | ||
| /// Write the digest representation of Self in a byte buffer | ||
| /// this is what you want to write when Self is being digested into another digest | ||
| fn write_digestable_bytes<W: Sized + io::Write>( |
There was a problem hiding this comment.
I think the adjective is digestible.
| Ok(v) | ||
| } | ||
| /// this is what you want when you need the representation of Self to compute its digest | ||
| fn to_bytes(&self) -> Result<Vec<u8>, io::Error>; |
There was a problem hiding this comment.
Note: in a trait, the default implementation is always overridable.
| self.digest.write_digest(byte_sink) | ||
| } | ||
|
|
||
| fn to_bytes(&self) -> Result<Vec<u8>, io::Error> { |
There was a problem hiding this comment.
I think what you want here is a standard derive macro for the Digestible trait, no?
| self.digest.write_digest(byte_sink) | ||
| } | ||
|
|
||
| fn to_bytes(&self) -> Result<Vec<u8>, io::Error> { |
There was a problem hiding this comment.
What I don't understand is that in the above,to_bytes and write_digestable_bytes were producing the same bytes in a different fashion. Here, we're breaking this invariant. Why?
| self.inner.write_bytes(&mut hasher)?; | ||
| let bytes = self.inner.to_bytes()?; | ||
| hasher.update(&bytes); |
There was a problem hiding this comment.
Because this requires allocating the fully serialized form of the object in bytes before passing it to the hasher, this has a completely different performance envelope than the original.
Please use the io::Write API instead.
| self | ||
| .augmented_circuit_params_secondary | ||
| .write_digestable_bytes(&mut byte_sink)?; | ||
| self.digest.write_digestable_bytes(&mut byte_sink)?; |
There was a problem hiding this comment.
In order to populate the digest for this object, I call the present to_bytes, which ingests the value of self.digest? Does this make sense? If by some strike of luck I initialize an object with a different starting (uninitialized) digest, but identical to another in all other points, I will get a distinct final digest: should I?
There was a problem hiding this comment.
:O nice catch
|
I'm a bit confused by your suggestions for
Given this, we could have 2 traits for each point, instead of 1 trait with two functions (plus |
There was a problem hiding this comment.
@winston-h-zhang The digest computed here is security-critical, so we need to be more precise in specifying it. There's no such thing as a function that can be specified with as few words as T -> Digest and I don't know how to parse what T < S should mean.
In order to define a digest, you need:
- the actual hash computation, which involves a hasher that can munge bytes. It should not require allocating these bytes in e.g. a vector.,
- the selection of the bytes to be added as argument to (1.),
Things we should do:
- define statically an unambiguous way of getting the bytes that serve as an input for a hasher, for a large class of objects,
- not define any other way of getting any other bytes that could possibly be construed to be a suitable input for that very hasher,
With that said, I understand the purpose of this PR is to also offer an indirection by selecting the digests of constituent fields as inputs, so as to avoid re-computing the hash of those embedded objects (but see note 1 below)
The reason I suggest splitting this in two traits is that part (1.) above is the same for all objects, irrespective of their constituents, and (2.) is not.
But this goes a bit further: we also need to use domain separation when inserting those digests, so that there is no possible way of creating a hash collision between the following objects:
mod foo {
struct S ( [u8; 32] );
impl SimpleDigest for [u8; 32] {}
}
// and
mod bar {
struct S ( [u8; 32] );
impl SimpleDigest for S {}
}This will require at least using serde_name as indicated above, and adding a domain separator to the addition of the digest in the hasher's input, something I believe your approach to Digestible does not do today (and I have trouble imagining how to make it do so).
For a glimpse of why this is important, see https://crypto.junod.info/posts/recursive-hash/
Please schedule some time with me so we can go over this in depth.
Notes: [1] [2]
[1]: Since our hashers, including sha3, are quite fast, it's actually quite fine to start an implementation by just not using digests as inputs to the hasher (in 2. above), thereby flattening the bytes involved in the digest computation to the serialization bytes of the structure. We have two desires:
a. speeding up the equality comparison in the case of embedded structures, e.g. when we have two PublicParameters that each embed circuits that are known to be distinct,
b. speeding up the computation of overall hash of the top-level structure, by reusing the hash of the imbricated structure, rather than re-computing it.
Note that a. comes quite ahead of b. in priority. (b.) is a secondary concern and a mere performance optimization. I believe the following is an acceptable option to start iterating from, if sub-optimal from a perf standpoint:
fn cache_key(pp: PublicParams<...>) -> String {
let str1 = pp.r1cs_shape_primary.to_bytes(); // full serialization and hashing of r1cs
let str2 = pp.to_bytes(); // full serialization and hashing of pp, including the second serialization & hashing of r1cs
format!("{str1}-{str2}")
}Then we can write the interesting part (a. above): a function that reads the pp from cached files only if the r1cs prefix matches.
[2]: Prior to this PR, I wrote the Digestible trait with two functions: write_bytes, and the decorative, documentary to_bytes. I now realize with this PR that adding the second function was a mistake, because it is giving you ideas about giving a funny meaning to to_bytes. I think we should remove it. I'm also in favor of changes that involve redefining the trait hierarchy ... as long as it makes rigorous sense.
| self._p_c1.write_digestable_bytes(&mut byte_sink)?; | ||
| self._p_c2.write_digestable_bytes(&mut byte_sink)?; |
There was a problem hiding this comment.
Another important point: do you think this actually binds the public parameters to the shape of the circuit?
Because phantom types beg to differ:
https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=ffd3e4adc7538effc0a088b1b0474734
There was a problem hiding this comment.
I think these can be left out, exactly because they are not actually 'present'. The shapes are there independently, if I'm not mistaken — as straightforward data: https://github.com/lurk-lab/arecibo/blob/b4f3ce80980e6a9f985b5fe9ec44f37b1e21d267/src/lib.rs#L198-L202
There was a problem hiding this comment.
Yes, that's what I was getting at.
I hesitate to interject, because I have not tracked this discussion entirely. However, I want to make sure I'm following. I understand generally what @winston-h-zhang is going for. I think what @huitseeker describe above is slightly different. At least the wording of this sentence suggests that. The nuance I observe is that in @winston-h-zhang's plan, there are two distinct ways of ensuring that a constituent object is included in a containing digest.
In this model, the contained object's digest will always be computed exactly once. However, inclusion of that digest as part of the input to be digested in the container is not directly equivalent to the first instance. It still binds the parent's hash to the content of the contained object — but it will not result in the same final digest. I believe this is fine. It might be that this was absolutely clear to both parties already. In that case, carry on. If not, maybe this helps in some way. |
|
@porcuquine thanks for clarifying the goals, I believe that we are all on the same page as to what they are. They are also clear in the code. I'm just highlighting that performing the second case (hashing digests recursively) safely (without opening ourselves to collision attacks) requires rigor and care. And my note [1] suggests that until we have defined a rigorous and safe way to perform this in code (something that will indeed have advantages from a performance standpoint), perhaps we could limit ourselves to the somewhat more modest goal of computing digests strictly from serialized forms (not digests themselves), and simply using the digest of a constituent field as a prefix of a cache key when this field's digest is expected to be enough to discriminate between two sets of public parameters |
| self.A.write_digestable_bytes(&mut byte_sink)?; | ||
| self.B.write_digestable_bytes(&mut byte_sink)?; | ||
| self.C.write_digestable_bytes(&mut byte_sink)?; | ||
| self.digest.write_digestable_bytes(&mut byte_sink)?; |
|
Closed in favor of #49 |
Resolves #31 and lurk-lab/lurk-beta#648 downstream.
This PR adds a small mechanism to distinguish circuits via a digest. This is essentially the same as what is done for
PublicParams.digest. Downstream inlurk-rs, the digest is needed to quickly determine if a set of public params matches the given circuits. Note that the cost ofcircuit_digestis somewhat small compared to fully generating params, but it remains to be benchmarked. If the cost is too high, maybe something more clever will be needed.